Zanopia – Stateless application, database & storage architecture

Objects and the cloud.

Is Cloud based Tape Backup a great new business?

leave a comment »


I am certainly not an expert in tape storage, but the question is interesting. It has been said that “Disk is the new Tape” at Chirp 2010 in the presentation Scaling Twitter, but is it really? Could it make sense for somebody to buy a massive tape library, bolt on a REST interface and charge for tape on a pay as you backup model?

To start, let’s assume that there’s some clever way to work out putting data on tape for each specific user so that tens of separate tapes don’t need to mounted by the robot to retrieve data, all the while serializing the writing of data to tape. There is surely enough cleverness out there with the help of some disk caching to do this intelligently. Alternatively it could be an archival feature of current cloud offerings. Now what does tape cost in big volumes really? The math of this one is tricky because the medium is not what costs the most money. Some San Diego Supercomputer Center researchers gave the math a shot and their findings seem well thought through. The article states a price of $500/year/TB for one copy.

The same article also states that the price for storage at Amazon S3 is $1850 TB/year + $205 to initially store the data. Clearly AWS S3 provides a higher quality of storage with more than 1 copy and much more readily available access. Wow, the cost difference is around a factor of 4! Hmm, ,so maybe it would be possible to create the Acme Cloud Tape Backup Service and make a killing! After all Cloud Storage is sometimes referred to as WORN Storage anyway (write once – read never) A mammoth installation should be able to provide tape at more economical prices than this study would indicate, bringing price still lower; now this starts to really look interesting!

Then again, there are some counter arguments:

  • For a near-line storage offering to be interesting, it would need to be as reliable and much less expensive than disk based storage. One tape copy will never do, so we better take 2 copies cutting our profits in 1/2 and raising prices to $1000/TB/year
  • Even though folks use cloud storage for backup, it’s way more responsive than it needs to be for backup. After all, it’s fast enough to serve up web pages with; now just try that with tape! We could imagine a website with a new URI, say tape:// with a nice blinking Javascript banner saying Please wait while the Robot loads your request…

  • We must also admit that the cost of using disks can realistically be significantly reduced by using different redundancy models such as dispersed storage like Cleversafe and Amplidata propose. It is surely possible to reach the $1000/year number if disk storage is allowed latencies and I/O performance similar to tape values.

  • Tape has power consumption arguments in its favor, but it would be fairly simple to power down full hard drives and achieve similar results.

  • The level of investment in hard drive technologies speaks in their favor over time, especially when much of the cost in tape is associated with physical handling of tapes by robots and humans; costs that are not subject to Moore’s laws or its derivatives.

  • Another advantage of hard disks is their ability to tell us when they are failing and need replacement. Tapes on the other hand must simply be replaced on a regular schedule, because, especially with a single copy, when read errors occur due to tape degradation it is simply too late.

The question is interesting though. We have to admit, tape is still cheaper than disk, and it doesn’t consume power remembering what you asked it to remember. So yeah, I think there just might be a market, but whoever does it better be very good at what they do.

One last thought:
You have to admit Exabyte was a very cool and forward looking name for a tape drive company!
On the other hand their site is Temporarily Unavailable today.

Written by Giorgio Regni

January 14, 2011 at 5:58 pm

Posted in Cloud Storage, Storage

Comparing Scality RING Object Store & Hadoop HDFS file system

with 6 comments

It’s a question that I get a lot so I though let’s answer this one here so I can point people to this blog post when it comes out again!

So first, introduction,

What are Hadoop and HDFS?


Apache Hadoop is a software framework that supports data-intensive distributed applications. It’s open source software released under the Apache license. It can work with thousands of nodes and petabytes of data and was significantly inspired by Google’s MapReduce and Google File System (GFS) papers.


Hadoop was not fundamentally developed as a storage platform but since data mining algorithms like map/reduce work best when they can run as close to the data as possible, it was natural to include a storage component.

This storage component does not need to satisfy generic storage constraints, it just needs to be good at storing data for map/reduce jobs for enormous datasets; and this is exactly what HDFS does.

About Scality RING object store

About Scality

Our core RING product is a software-based solution that utilizes commodity hardware to create a high performance, massively scalable object storage system.

Our technology has been designed from the ground up as a multi petabyte scale tier 1 storage system to serve billions of objects to millions of users at the same time.

We did not come from the backup or CDN spaces

Surprisingly for a storage company, we came from the anti-abuse email space for internet service providers.

Why we developed it?

Scality RING object store architecture
The initial problem our technology was born to solve is the storage of billions of emails – that is: highly transactional data, crazy IOPS demands and a need for an architecture that’s flexible and scalable enough to handle exponential growth. Yes, even with the likes of Facebook, flickr, twitter and youtube, emails storage still more than doubles every year and it’s accelerating!

Rather than dealing with a large number of independent storage volumes that must be individually provisioned for capacity and IOPS needs (as with a file-system based architecture), RING instead mutualizes the storage system. Essentially, capacity and IOPS are shared across a pool of storage nodes in such a way that it is not necessary to migrate or rebalance users should a performance spike occur. This removes much of the complexity from an operation point of view as there’s no longer a strong affinity between where the user metadata is located and where the actual content of their mailbox is.

Another big area of concern is under utilization of storage resources, it’s typical to see less than half full disk arrays in a SAN array because of IOPS and inodes (number of files) limitations. We designed an automated tiered storage to takes care of moving data to less expensive, higher density disks according to object access statistics as multiple RINGs can be composed one after the other or in parallel. For example using 7K RPM drives for large objects and 15K RPM or SSD drives for small files and indexes. In this way, we can make the best use of different disk technologies, namely in order of performance, SSD, SAS 10K and terabyte scale SATA drives.

To remove the typical limitation in term of number of files stored on a disk, we use our own data format to pack object into larger containers. This actually solves multiple problems:

  • write IO load is more linear, meaning much better write bandwidth
  • each disk or volume is accessed through a dedicated IO daemon process and is isolated from the main storage process; if a disk crashes, it doesn’t impact anything else
  • billions of files can be stored on a single disk

Comparison matrix

Let’s compare both system in this simple table:

Hadoop HDFS Scality RING
Architecture Centralized around a name node that acts as a central metadata server. Any number of data nodes. Fully distributed architecture using consistent hashing in a 20 bytes (160 bits) key space. Each node server runs the same code.
Single Point of Failure Name node is a single point of failure, if the name node goes down, the filesystem is offline. No single point of failure, metadata and data are distributed in the cluster of nodes.
Clustering/nodes Static configuration of name nodes and data nodes. Peer to Peer algorithm based on CHORD designed to scale past thousands of nodes. Complexity of the algorithm is O(log(N)), N being the number of nodes. Nodes can enter or leave while the system is online.
Replication model Data is replicated on multiple nodes, no need for RAID. Data is replicated on multiple nodes, no need for RAID.
Disk Usage Objects are stored as files with typical inode and directory tree issues. Objects are stored with an optimized container format to linearize writes and reduce or eliminate inode and directory tree issues.
Replication policy Global setting. Per object replication policy, between 0 and 5 replicas. Replication is based on projection of keys across the RING and does not add overhead at runtime as replica keys can be calculated and do not need to be stored in a metadata database.
Rack aware Rack aware setup supported in 3 copies mode. Rack aware setup supported.
Data center aware Not supported Yes, including asynchronous replication
Tiered storage Not supported Yes, rings can be chained or used in parallel. Plugin architecture allows the use of other technologies as backend. For example dispersed storage or ISCSI SAN.

Conclusion – Domain Specific Storage?

The FS part in HDFS is a bit misleading, it cannot be mounted natively to appear as a POSIX filesystem and it’s not what it was designed for. As a distributed processing platform, Hadoop needs a way to reliably and practically store the large dataset it need to work on and pushing the data as close as possible to each computing unit is key for obvious performance reasons.

As I see it, HDFS was designed as a domain specific storage component for large map/reduce computations. Its usage can possibly be extended to similar specific applications.

Scality RING can also be seen as domain specific storage; our domain being unstructured content: files, videos, emails, archives and other user generated content that constitutes the bulk of the storage capacity growth today.

Scality RING and HDFS share the fact that they would be unsuitable to host a MySQL database raw files, however they do not try to solve the same issues and this shows in their respective design and architecture.

Written by Giorgio Regni

December 7, 2010 at 6:45 pm

Posted in Storage

Tagged with , , , , , , ,

Automatic ID assignment in a distributed environment

leave a comment »

In the area of distributed computing, unique ID assignment for machines composing a cluster is often required. When the number of machines is huge, these IDs are generally generated automatically and randomly. This works if systems pick really big random numbers (e.g. 128bits), assuming collisions are unlikely.

Occasionally it is required that systems chose ID with a small number of bits (e.g. 1000 machines with a 16 bits ID). In this case the ID assignement scheme cannot rely on randomness but requires a network and computational challenge. It is exposed here:

In ancient China, under the Tang dynasty, Hong WeiAn was an official at court. He was a very talented accountant and his favorite occupation was to submit maths games to his subordinates. One day he brought them all into the imperial garden and asked them the following problem:

"My friends, you are today 100 people. Put you in a circle. You all have to choose a unique number from 0 to 99, but there are some rules…".

"1 – The first rule is that you can only talk to one person at a time."

"2 – The second rule is that you must not elect a leader".

"3 – There must be the less possible conversations."

And he finished:

"4 – Don't rely too much on randomness! because if you choose a number randomly then the chances for two of you having the same number is nearly 100%"


For clarity, in a modern language it would be understood like:

1 – No broadcast

2 – No leader election to make the job.

3 – The goal is to find a converging solution with the less possible messages exchanged between the subordinates.

4 – Beware of the Birthday paradox! The probability of colliding is approximatively 7d0914f413b28b8ecd626b3268240604 , with n~7 and  |E|=100, nearly 100%

All subordinates have the right to remember which person he has spoken with, they must use the best strategy to converge to the solution without doubt.

At the court of the Emperor, the subjects were very honest and dedicated, so there is no liars and no failures.

The solution of the problem is the generalization of the algorithm and the protocol for any number of subjects for the less possible number of messages.

Please preferably provide a solution where participants do not exchange lists but simple small packets question/answers.

Please post your solution!

Written by Giorgio Regni

November 15, 2010 at 12:04 pm

Posted in Algorithm

Do not design in a vacuum..

leave a comment »

I came upon this picture a few days ago:

OpenOffice Mouse

What happens when you design in a vacuum

Yes, that’s the Open Office Mouse!

Look at the ridiculous number of buttons on this thing, and what about that stick on the left side? I am pretty sure the designer isn’t a lefty as this looks like it can only work with a right hand…

Here’s an excerpt from their marketing materials:

The OOMouse is one of the first computer mice to incorporate an analog joystick and the first to permit the use of the joystick as a keyboard. In the three joystick-as-keyboard modes, the user can assign up to sixteen different keys or macros to the joystick, which provides for easy movement regardless of whether the user is flying through the cells of a large spreadsheet in Microsoft Excel or on the back of an epic flying mount in World of Warcraft.

A mouse that’s good at both Excel and World of Warcraft game! Way to choose your market! By the way why not talk about Open Office since it’s an “Open Office” mouse?

At least, one thing is clear, this wasn’t designed by Apple who created and follows the drive towards simplicity, maybe pushing it too far with the one button mouse in 2005:

apple one button mouse.jpg

Maybe 2005 was too early but this one didn’t sell well either. Since then, Apple has embraced the two buttons + scroll wheel design and even goes further with a multi touch trackpad now.
I am pretty sure that one button mouse was also designed in a vacuum, full of “simple is better” like minded people, looking upon us, the proverbial lemmings end user.

Learning from experience

Actually, I did fall into the same trap in 2008, we were rolling out a new version of our massively scalable email gateway that promised to deliver cutting edge new way to stop spam as early as possible, without even getting down to establishing a TCP connection.

Well, guess what, as technically advanced as this was, it came along with countless issues that prevented any of our customer from deploying it, some of the most important ones:
* Legal issues: rejecting email without sending an error message with a support link for example, was a big NO
* False positives: real legitimate email senders could get blocked by mistake and it would have been a nightmare for them to debug what was actually happening
* Lots of spammers would actually try even harder because they didn’t handle this error case as a permanent failure
* Our product was able to handle thousands of TCP sessions on the same server so it really didn’t matter that much to close a session early

But that didn’t stop us from coding and delivering it as the technical prowess sounded too good to our engineer ears and we didn’t care to listen…

That Henry Ford’s quote

If I had asked people what they wanted, they would have said a faster horse. Henry Ford

This quote is pretty popular and is usually used as a tool to support the idea that customers do not know what they want and are the last person to listen to for vision. Heck, even Steve Jobs uses it!

“It comes down to the very real fact that most customers don’t know what they want in a new product.” Apple customers should be glad Jobs doesn’t do focus groups. If he had, they may never have enjoyed iPods, iTunes, the iPhone, the iPad, or Apple Stores. Jobs doesn’t need focus groups because he understands his customers really, really well. Yes, sometimes better than they know themselves!… Sure, “listen” to your customers and ask them for feedback. Apple does that all the time. But when it comes to breakthrough success at Apple, Steve Jobs and his team are the company’s best focus group. Asked why Apple doesn’t do focus groups, Jobs responded: “We figure out what we want. You can’t go out and ask people ‘what’s the next big thing?’ There’s a great quote by Henry Ford. He said, “If I’d have asked my customers what they wanted, they would have told me ‘A faster horse.’”” Steve jobs

Well it sounds like Apple still listen to their customers and ask them for feedback so I guess it’s more about what kind of question you ask.

Stupid questions of course always get stupid answers back…
The quality of your answers comes from the questions you asked.

In your wildest dream, what should a storage platform look like?

This is the question we asked our customers in 2008, all very very large MSO, cable TV networks and internet service providers.

The answer was clearly the opposite of what they could buy at the time, centralized, monolithic, expensive SAN systems…

We allowed them to dream about the best platform, without worrying about any legacy support or backwards thinking.

It wasn’t easy to get the juices going but after many carefully spaced out meetings, we came down to this list of requirements:

Problem Requirement
« Sharding of database » creates a hard association between application server and user A stateless system. Automatic index load distribution.
Single point of failure : when a SAN / NAS / FC switch reboots, service is down for minutes or hours No component should ever cause a service loss
COST : At 1USD/mail/year just for storage cannot compete. Beyond 300 TB, cost/TB increases. Be able to compete with Google, below 2 USD/Mailbox/year. Leverage decreasing price of generic hardware.
Managing multiple SAN, volumes, tiering is complex, error prone and costly. Ease of management : autonomic, policy based, self-healing system.
Competitive agains Google, Yahoo, etc… Enabling new services : text search, photo recognition, transcoding.

This is the list of requirements we’ve based our Scality Ring platform on, mind you this was before cloud even became a buzzword….

Here’s the architecture:
Scality Ring Architecture v1.3.png

You can learn more about our technology by visiting our website.

Today it’s live, in production taking traffic from millions of users, we could’t have done it without without working with and getting feedback from our customers!

Lesson learned, do not design in a vacuum…

Comments welcomed of course.

Written by Giorgio Regni

October 15, 2010 at 4:28 pm

Hooray, LLVM 2.8 has been released! Features highlights

leave a comment »

LLVM 2.8 has been released, a few major improvements from the previous release, highlights:

image from


  • Clang C++ is now feature-complete with respect to the ISO C++ 1998 and 2003 standards.
  • Objective-C++ is now supported
  • Added support for SSE, AVX, ARM NEON, and AltiVec. (vector instructions)
  • Improved generated code quality in some areas:
    • Good code generation for X86-32 and X86-64 ABI handling.
    • Improved code generation for bit-fields, although important work remains.

Clang Static Analyzer

The Clang Static Analyzer project is an effort to use static source code analysis techniques to automatically find bugs in C and Objective-C programs (and hopefully C++ in the future!). The tool is very good at finding bugs that occur on specific paths through code, such as on error conditions.

VMKit: JVM/CLI Virtual Machine Implementation

The VMKit project is an implementation of a Java Virtual Machine (Java VM or JVM) that uses LLVM for static and just-in-time compilation. As of LLVM 2.8, VMKit now supports copying garbage collectors, and can be configured to use MMTk’s copy mark-sweep garbage collector. In LLVM 2.8, the VMKit .NET VM is no longer being maintained.

LLDB: Low Level Debugger

LLDB is a brand new member of the LLVM umbrella of projects. LLDB is a next generation, high-performance debugger. It is built as a set of reusable components which highly leverage existing libraries in the larger LLVM Project, such as the Clang expression parser, the LLVM disassembler and the LLVM JIT.

LLDB is in early development and not included as part of the LLVM 2.8 release, but is mature enough to support basic debugging scenarios on Mac OS X in C, Objective-C and C++. We’d really like help extending and expanding LLDB to support new platforms, new languages, new architectures, and new features.

Who knew there was so much room left for innovation in a compiler toolchain?

Great work again from the LLVM team!

More details in the official release notes.

Downloading right now…

Written by Giorgio Regni

October 7, 2010 at 11:01 pm

Posted in Language