Zanopia – Stateless application, database & storage architecture

Objects and the cloud.

Posts Tagged ‘cloud

Comparing Scality RING Object Store & Hadoop HDFS file system

with 6 comments

It’s a question that I get a lot so I though let’s answer this one here so I can point people to this blog post when it comes out again!

So first, introduction,

What are Hadoop and HDFS?


Apache Hadoop is a software framework that supports data-intensive distributed applications. It’s open source software released under the Apache license. It can work with thousands of nodes and petabytes of data and was significantly inspired by Google’s MapReduce and Google File System (GFS) papers.


Hadoop was not fundamentally developed as a storage platform but since data mining algorithms like map/reduce work best when they can run as close to the data as possible, it was natural to include a storage component.

This storage component does not need to satisfy generic storage constraints, it just needs to be good at storing data for map/reduce jobs for enormous datasets; and this is exactly what HDFS does.

About Scality RING object store

About Scality

Our core RING product is a software-based solution that utilizes commodity hardware to create a high performance, massively scalable object storage system.

Our technology has been designed from the ground up as a multi petabyte scale tier 1 storage system to serve billions of objects to millions of users at the same time.

We did not come from the backup or CDN spaces

Surprisingly for a storage company, we came from the anti-abuse email space for internet service providers.

Why we developed it?

Scality RING object store architecture
The initial problem our technology was born to solve is the storage of billions of emails – that is: highly transactional data, crazy IOPS demands and a need for an architecture that’s flexible and scalable enough to handle exponential growth. Yes, even with the likes of Facebook, flickr, twitter and youtube, emails storage still more than doubles every year and it’s accelerating!

Rather than dealing with a large number of independent storage volumes that must be individually provisioned for capacity and IOPS needs (as with a file-system based architecture), RING instead mutualizes the storage system. Essentially, capacity and IOPS are shared across a pool of storage nodes in such a way that it is not necessary to migrate or rebalance users should a performance spike occur. This removes much of the complexity from an operation point of view as there’s no longer a strong affinity between where the user metadata is located and where the actual content of their mailbox is.

Another big area of concern is under utilization of storage resources, it’s typical to see less than half full disk arrays in a SAN array because of IOPS and inodes (number of files) limitations. We designed an automated tiered storage to takes care of moving data to less expensive, higher density disks according to object access statistics as multiple RINGs can be composed one after the other or in parallel. For example using 7K RPM drives for large objects and 15K RPM or SSD drives for small files and indexes. In this way, we can make the best use of different disk technologies, namely in order of performance, SSD, SAS 10K and terabyte scale SATA drives.

To remove the typical limitation in term of number of files stored on a disk, we use our own data format to pack object into larger containers. This actually solves multiple problems:

  • write IO load is more linear, meaning much better write bandwidth
  • each disk or volume is accessed through a dedicated IO daemon process and is isolated from the main storage process; if a disk crashes, it doesn’t impact anything else
  • billions of files can be stored on a single disk

Comparison matrix

Let’s compare both system in this simple table:

Hadoop HDFS Scality RING
Architecture Centralized around a name node that acts as a central metadata server. Any number of data nodes. Fully distributed architecture using consistent hashing in a 20 bytes (160 bits) key space. Each node server runs the same code.
Single Point of Failure Name node is a single point of failure, if the name node goes down, the filesystem is offline. No single point of failure, metadata and data are distributed in the cluster of nodes.
Clustering/nodes Static configuration of name nodes and data nodes. Peer to Peer algorithm based on CHORD designed to scale past thousands of nodes. Complexity of the algorithm is O(log(N)), N being the number of nodes. Nodes can enter or leave while the system is online.
Replication model Data is replicated on multiple nodes, no need for RAID. Data is replicated on multiple nodes, no need for RAID.
Disk Usage Objects are stored as files with typical inode and directory tree issues. Objects are stored with an optimized container format to linearize writes and reduce or eliminate inode and directory tree issues.
Replication policy Global setting. Per object replication policy, between 0 and 5 replicas. Replication is based on projection of keys across the RING and does not add overhead at runtime as replica keys can be calculated and do not need to be stored in a metadata database.
Rack aware Rack aware setup supported in 3 copies mode. Rack aware setup supported.
Data center aware Not supported Yes, including asynchronous replication
Tiered storage Not supported Yes, rings can be chained or used in parallel. Plugin architecture allows the use of other technologies as backend. For example dispersed storage or ISCSI SAN.

Conclusion – Domain Specific Storage?

The FS part in HDFS is a bit misleading, it cannot be mounted natively to appear as a POSIX filesystem and it’s not what it was designed for. As a distributed processing platform, Hadoop needs a way to reliably and practically store the large dataset it need to work on and pushing the data as close as possible to each computing unit is key for obvious performance reasons.

As I see it, HDFS was designed as a domain specific storage component for large map/reduce computations. Its usage can possibly be extended to similar specific applications.

Scality RING can also be seen as domain specific storage; our domain being unstructured content: files, videos, emails, archives and other user generated content that constitutes the bulk of the storage capacity growth today.

Scality RING and HDFS share the fact that they would be unsuitable to host a MySQL database raw files, however they do not try to solve the same issues and this shows in their respective design and architecture.


Written by Giorgio Regni

December 7, 2010 at 6:45 pm

Posted in Storage

Tagged with , , , , , , ,

Do not design in a vacuum..

leave a comment »

I came upon this picture a few days ago:

OpenOffice Mouse

What happens when you design in a vacuum

Yes, that’s the Open Office Mouse!

Look at the ridiculous number of buttons on this thing, and what about that stick on the left side? I am pretty sure the designer isn’t a lefty as this looks like it can only work with a right hand…

Here’s an excerpt from their marketing materials:

The OOMouse is one of the first computer mice to incorporate an analog joystick and the first to permit the use of the joystick as a keyboard. In the three joystick-as-keyboard modes, the user can assign up to sixteen different keys or macros to the joystick, which provides for easy movement regardless of whether the user is flying through the cells of a large spreadsheet in Microsoft Excel or on the back of an epic flying mount in World of Warcraft.

A mouse that’s good at both Excel and World of Warcraft game! Way to choose your market! By the way why not talk about Open Office since it’s an “Open Office” mouse?

At least, one thing is clear, this wasn’t designed by Apple who created and follows the drive towards simplicity, maybe pushing it too far with the one button mouse in 2005:

apple one button mouse.jpg

Maybe 2005 was too early but this one didn’t sell well either. Since then, Apple has embraced the two buttons + scroll wheel design and even goes further with a multi touch trackpad now.
I am pretty sure that one button mouse was also designed in a vacuum, full of “simple is better” like minded people, looking upon us, the proverbial lemmings end user.

Learning from experience

Actually, I did fall into the same trap in 2008, we were rolling out a new version of our massively scalable email gateway that promised to deliver cutting edge new way to stop spam as early as possible, without even getting down to establishing a TCP connection.

Well, guess what, as technically advanced as this was, it came along with countless issues that prevented any of our customer from deploying it, some of the most important ones:
* Legal issues: rejecting email without sending an error message with a support link for example, was a big NO
* False positives: real legitimate email senders could get blocked by mistake and it would have been a nightmare for them to debug what was actually happening
* Lots of spammers would actually try even harder because they didn’t handle this error case as a permanent failure
* Our product was able to handle thousands of TCP sessions on the same server so it really didn’t matter that much to close a session early

But that didn’t stop us from coding and delivering it as the technical prowess sounded too good to our engineer ears and we didn’t care to listen…

That Henry Ford’s quote

If I had asked people what they wanted, they would have said a faster horse. Henry Ford

This quote is pretty popular and is usually used as a tool to support the idea that customers do not know what they want and are the last person to listen to for vision. Heck, even Steve Jobs uses it!

“It comes down to the very real fact that most customers don’t know what they want in a new product.” Apple customers should be glad Jobs doesn’t do focus groups. If he had, they may never have enjoyed iPods, iTunes, the iPhone, the iPad, or Apple Stores. Jobs doesn’t need focus groups because he understands his customers really, really well. Yes, sometimes better than they know themselves!… Sure, “listen” to your customers and ask them for feedback. Apple does that all the time. But when it comes to breakthrough success at Apple, Steve Jobs and his team are the company’s best focus group. Asked why Apple doesn’t do focus groups, Jobs responded: “We figure out what we want. You can’t go out and ask people ‘what’s the next big thing?’ There’s a great quote by Henry Ford. He said, “If I’d have asked my customers what they wanted, they would have told me ‘A faster horse.’”” Steve jobs

Well it sounds like Apple still listen to their customers and ask them for feedback so I guess it’s more about what kind of question you ask.

Stupid questions of course always get stupid answers back…
The quality of your answers comes from the questions you asked.

In your wildest dream, what should a storage platform look like?

This is the question we asked our customers in 2008, all very very large MSO, cable TV networks and internet service providers.

The answer was clearly the opposite of what they could buy at the time, centralized, monolithic, expensive SAN systems…

We allowed them to dream about the best platform, without worrying about any legacy support or backwards thinking.

It wasn’t easy to get the juices going but after many carefully spaced out meetings, we came down to this list of requirements:

Problem Requirement
« Sharding of database » creates a hard association between application server and user A stateless system. Automatic index load distribution.
Single point of failure : when a SAN / NAS / FC switch reboots, service is down for minutes or hours No component should ever cause a service loss
COST : At 1USD/mail/year just for storage cannot compete. Beyond 300 TB, cost/TB increases. Be able to compete with Google, below 2 USD/Mailbox/year. Leverage decreasing price of generic hardware.
Managing multiple SAN, volumes, tiering is complex, error prone and costly. Ease of management : autonomic, policy based, self-healing system.
Competitive agains Google, Yahoo, etc… Enabling new services : text search, photo recognition, transcoding.

This is the list of requirements we’ve based our Scality Ring platform on, mind you this was before cloud even became a buzzword….

Here’s the architecture:
Scality Ring Architecture v1.3.png

You can learn more about our technology by visiting our website.

Today it’s live, in production taking traffic from millions of users, we could’t have done it without without working with and getting feedback from our customers!

Lesson learned, do not design in a vacuum…

Comments welcomed of course.

Written by Giorgio Regni

October 15, 2010 at 4:28 pm

Scality SCOP – $100,000 Incentive Fund for Open Source Software Developers

leave a comment »

Full press release and details on

We’re opening the cloud drop by drop! Our goal with this open source library is to promote the use of object based cloud storage and simplify the job of application developers in the process.

We are trying to address the most common user concerns associated with cloud storage (Freedom/Openness, Performance, Security and Visibility) at the client library level so that application developer using Scality Droplet can spend more time focusing on their own user experience instead.

The second part of that strategy is to reward open source developers with our Scality Open Source Program (SCOP) by offering bounties for applications that we feel are a great match for the cloud. Total bounty pot is $100,000, divided into $1000 to $10000 individual app bounties. Look here for the list of applications.

You can also submit your own application idea, if we like it, we’ll create a bounty for it so an open source developer can step up and you basically just outsourced the development of your dream application for free 🙂 Apply here while there’s still money left!

Ping me on twitter @GiorgioRegni

Written by Giorgio Regni

September 23, 2010 at 5:22 am

Scale out application development

leave a comment »

First post! That warrants an introduction:

I am Giorgio Regni, CTO of Scality, a start-up with the goal of solving the very real and pressing problem of storing unstructured content for large service providers like web hosters, cable companies, ISPs and web2.0 type companies.

We have developed an unique distributed object store technology based on the chord peer to peer addressing algorithm to handle petabytes of data with online replication, tiering and multi data center fault tolerance on top of inexpensive generic x86 hardware. It's actually taking full production traffic today at Telenet, a Belgian cable & internet service provider, serving millions of Zimbra email users.

Prior to that, as VP Engineering of Bizanga, I developed a massively scalable anti-abuse layer for email service providers that is today fighting spam and switching more than 1 billion emails a day across the globe at well known locations like Comcast, Cox and Charter in the US.

This blog is about sharing ideas on next generation software platforms to support the exponentially growing needs of today's applications. Exabytes of data, billions of users, millions of request per seconds, the singularity is near, it's mind boggling and we're only at the beginning!

As the likes of facebook, youtube, twitter and their increasing capacity & performance race have shown, there needs to be changes down to the very core of application architectures and I find this entire subject fascinating. Current solutions all fall into two camps:

  • Safely reuse old idea and incrementally improve on them: Virtual machines, SAN, NAS, …

  • Brand new projects starting from scratch, promising a lot but lacking in maturity, interoperability and real world operation feedback

My focus is on very large scale applications enabling services for millions of users. I truly believe a complete next generation application stack to satisfy such gigantic scalability requirements is the way to go, even if it involves rethinking a lot of layers in the process (object storage, distributed metadata, stateless everything, etc…).

My goal is to share news, observations, ideas, vision as well as code. I strongly value feedback and comments, especially from people who are in disagreement!.

Thanks for reading that far!

I can be reached on twitter @GiorgioRegni and linkedin




Written by Giorgio Regni

August 23, 2010 at 10:19 pm