I am certainly not an expert in tape storage, but the question is interesting. It has been said that “Disk is the new Tape” at Chirp 2010 in the presentation Scaling Twitter, but is it really? Could it make sense for somebody to buy a massive tape library, bolt on a REST interface and charge for tape on a pay as you backup model?
To start, let’s assume that there’s some clever way to work out putting data on tape for each specific user so that tens of separate tapes don’t need to mounted by the robot to retrieve data, all the while serializing the writing of data to tape. There is surely enough cleverness out there with the help of some disk caching to do this intelligently. Alternatively it could be an archival feature of current cloud offerings. Now what does tape cost in big volumes really? The math of this one is tricky because the medium is not what costs the most money. Some San Diego Supercomputer Center researchers gave the math a shot and their findings seem well thought through. The article states a price of $500/year/TB for one copy.
The same article also states that the price for storage at Amazon S3 is $1850 TB/year + $205 to initially store the data. Clearly AWS S3 provides a higher quality of storage with more than 1 copy and much more readily available access. Wow, the cost difference is around a factor of 4! Hmm, ,so maybe it would be possible to create the Acme Cloud Tape Backup Service and make a killing! After all Cloud Storage is sometimes referred to as WORN Storage anyway (write once – read never) A mammoth installation should be able to provide tape at more economical prices than this study would indicate, bringing price still lower; now this starts to really look interesting!
Then again, there are some counter arguments:
- For a near-line storage offering to be interesting, it would need to be as reliable and much less expensive than disk based storage. One tape copy will never do, so we better take 2 copies cutting our profits in 1/2 and raising prices to $1000/TB/year
We must also admit that the cost of using disks can realistically be significantly reduced by using different redundancy models such as dispersed storage like Cleversafe and Amplidata propose. It is surely possible to reach the $1000/year number if disk storage is allowed latencies and I/O performance similar to tape values.
Tape has power consumption arguments in its favor, but it would be fairly simple to power down full hard drives and achieve similar results.
The level of investment in hard drive technologies speaks in their favor over time, especially when much of the cost in tape is associated with physical handling of tapes by robots and humans; costs that are not subject to Moore’s laws or its derivatives.
Another advantage of hard disks is their ability to tell us when they are failing and need replacement. Tapes on the other hand must simply be replaced on a regular schedule, because, especially with a single copy, when read errors occur due to tape degradation it is simply too late.
The question is interesting though. We have to admit, tape is still cheaper than disk, and it doesn’t consume power remembering what you asked it to remember. So yeah, I think there just might be a market, but whoever does it better be very good at what they do.
One last thought:
You have to admit Exabyte was a very cool and forward looking name for a tape drive company!
On the other hand their site is Temporarily Unavailable today.
It’s a question that I get a lot so I though let’s answer this one here so I can point people to this blog post when it comes out again!
So first, introduction,
What are Hadoop and HDFS?
Apache Hadoop is a software framework that supports data-intensive distributed applications. It’s open source software released under the Apache license. It can work with thousands of nodes and petabytes of data and was significantly inspired by Google’s MapReduce and Google File System (GFS) papers.
Hadoop was not fundamentally developed as a storage platform but since data mining algorithms like map/reduce work best when they can run as close to the data as possible, it was natural to include a storage component.
This storage component does not need to satisfy generic storage constraints, it just needs to be good at storing data for map/reduce jobs for enormous datasets; and this is exactly what HDFS does.
About Scality RING object store
Our core RING product is a software-based solution that utilizes commodity hardware to create a high performance, massively scalable object storage system.
Our technology has been designed from the ground up as a multi petabyte scale tier 1 storage system to serve billions of objects to millions of users at the same time.
We did not come from the backup or CDN spaces
Surprisingly for a storage company, we came from the anti-abuse email space for internet service providers.
Why we developed it?
The initial problem our technology was born to solve is the storage of billions of emails – that is: highly transactional data, crazy IOPS demands and a need for an architecture that’s flexible and scalable enough to handle exponential growth. Yes, even with the likes of Facebook, flickr, twitter and youtube, emails storage still more than doubles every year and it’s accelerating!
Rather than dealing with a large number of independent storage volumes that must be individually provisioned for capacity and IOPS needs (as with a file-system based architecture), RING instead mutualizes the storage system. Essentially, capacity and IOPS are shared across a pool of storage nodes in such a way that it is not necessary to migrate or rebalance users should a performance spike occur. This removes much of the complexity from an operation point of view as there’s no longer a strong affinity between where the user metadata is located and where the actual content of their mailbox is.
Another big area of concern is under utilization of storage resources, it’s typical to see less than half full disk arrays in a SAN array because of IOPS and inodes (number of files) limitations. We designed an automated tiered storage to takes care of moving data to less expensive, higher density disks according to object access statistics as multiple RINGs can be composed one after the other or in parallel. For example using 7K RPM drives for large objects and 15K RPM or SSD drives for small files and indexes. In this way, we can make the best use of different disk technologies, namely in order of performance, SSD, SAS 10K and terabyte scale SATA drives.
To remove the typical limitation in term of number of files stored on a disk, we use our own data format to pack object into larger containers. This actually solves multiple problems:
- write IO load is more linear, meaning much better write bandwidth
- each disk or volume is accessed through a dedicated IO daemon process and is isolated from the main storage process; if a disk crashes, it doesn’t impact anything else
- billions of files can be stored on a single disk
Let’s compare both system in this simple table:
|Hadoop HDFS||Scality RING|
|Architecture||Centralized around a name node that acts as a central metadata server. Any number of data nodes.||Fully distributed architecture using consistent hashing in a 20 bytes (160 bits) key space. Each node server runs the same code.|
|Single Point of Failure||Name node is a single point of failure, if the name node goes down, the filesystem is offline.||No single point of failure, metadata and data are distributed in the cluster of nodes.|
|Clustering/nodes||Static configuration of name nodes and data nodes.||Peer to Peer algorithm based on CHORD designed to scale past thousands of nodes. Complexity of the algorithm is O(log(N)), N being the number of nodes. Nodes can enter or leave while the system is online.|
|Replication model||Data is replicated on multiple nodes, no need for RAID.||Data is replicated on multiple nodes, no need for RAID.|
|Disk Usage||Objects are stored as files with typical inode and directory tree issues.||Objects are stored with an optimized container format to linearize writes and reduce or eliminate inode and directory tree issues.|
|Replication policy||Global setting.||Per object replication policy, between 0 and 5 replicas. Replication is based on projection of keys across the RING and does not add overhead at runtime as replica keys can be calculated and do not need to be stored in a metadata database.|
|Rack aware||Rack aware setup supported in 3 copies mode.||Rack aware setup supported.|
|Data center aware||Not supported||Yes, including asynchronous replication|
|Tiered storage||Not supported||Yes, rings can be chained or used in parallel. Plugin architecture allows the use of other technologies as backend. For example dispersed storage or ISCSI SAN.|
Conclusion – Domain Specific Storage?
The FS part in HDFS is a bit misleading, it cannot be mounted natively to appear as a POSIX filesystem and it’s not what it was designed for. As a distributed processing platform, Hadoop needs a way to reliably and practically store the large dataset it need to work on and pushing the data as close as possible to each computing unit is key for obvious performance reasons.
As I see it, HDFS was designed as a domain specific storage component for large map/reduce computations. Its usage can possibly be extended to similar specific applications.
Scality RING can also be seen as domain specific storage; our domain being unstructured content: files, videos, emails, archives and other user generated content that constitutes the bulk of the storage capacity growth today.
Scality RING and HDFS share the fact that they would be unsuitable to host a MySQL database raw files, however they do not try to solve the same issues and this shows in their respective design and architecture.
In the area of distributed computing, unique ID assignment for machines composing a cluster is often required. When the number of machines is huge, these IDs are generally generated automatically and randomly. This works if systems pick really big random numbers (e.g. 128bits), assuming collisions are unlikely.
Occasionally it is required that systems chose ID with a small number of bits (e.g. 1000 machines with a 16 bits ID). In this case the ID assignement scheme cannot rely on randomness but requires a network and computational challenge. It is exposed here:
In ancient China, under the Tang dynasty, Hong WeiAn was an official at court. He was a very talented accountant and his favorite occupation was to submit maths games to his subordinates. One day he brought them all into the imperial garden and asked them the following problem:
"My friends, you are today 100 people. Put you in a circle. You all have to choose a unique number from 0 to 99, but there are some rules…".
"2 – The second rule is that you must not elect a leader".
"3 – There must be the less possible conversations."
And he finished:
"4 – Don't rely too much on randomness! because if you choose a number randomly then the chances for two of you having the same number is nearly 100%"
For clarity, in a modern language it would be understood like:
1 – No broadcast
2 – No leader election to make the job.
3 – The goal is to find a converging solution with the less possible messages exchanged between the subordinates.
The solution of the problem is the generalization of the algorithm and the protocol for any number of subjects for the less possible number of messages.
Please preferably provide a solution where participants do not exchange lists but simple small packets question/answers.
Please post your solution!
I came upon this picture a few days ago:
What happens when you design in a vacuum
Yes, that’s the Open Office Mouse!
Look at the ridiculous number of buttons on this thing, and what about that stick on the left side? I am pretty sure the designer isn’t a lefty as this looks like it can only work with a right hand…
Here’s an excerpt from their marketing materials:
The OOMouse is one of the first computer mice to incorporate an analog joystick and the first to permit the use of the joystick as a keyboard. In the three joystick-as-keyboard modes, the user can assign up to sixteen different keys or macros to the joystick, which provides for easy movement regardless of whether the user is flying through the cells of a large spreadsheet in Microsoft Excel or on the back of an epic flying mount in World of Warcraft.
A mouse that’s good at both Excel and World of Warcraft game! Way to choose your market! By the way why not talk about Open Office since it’s an “Open Office” mouse?
At least, one thing is clear, this wasn’t designed by Apple who created and follows the drive towards simplicity, maybe pushing it too far with the one button mouse in 2005:
Maybe 2005 was too early but this one didn’t sell well either. Since then, Apple has embraced the two buttons + scroll wheel design and even goes further with a multi touch trackpad now.
I am pretty sure that one button mouse was also designed in a vacuum, full of “simple is better” like minded people, looking upon us, the proverbial lemmings end user.
Learning from experience
Actually, I did fall into the same trap in 2008, we were rolling out a new version of our massively scalable email gateway that promised to deliver cutting edge new way to stop spam as early as possible, without even getting down to establishing a TCP connection.
Well, guess what, as technically advanced as this was, it came along with countless issues that prevented any of our customer from deploying it, some of the most important ones:
* Legal issues: rejecting email without sending an error message with a support link for example, was a big NO
* False positives: real legitimate email senders could get blocked by mistake and it would have been a nightmare for them to debug what was actually happening
* Lots of spammers would actually try even harder because they didn’t handle this error case as a permanent failure
* Our product was able to handle thousands of TCP sessions on the same server so it really didn’t matter that much to close a session early
But that didn’t stop us from coding and delivering it as the technical prowess sounded too good to our engineer ears and we didn’t care to listen…
That Henry Ford’s quote
If I had asked people what they wanted, they would have said a faster horse. Henry Ford
This quote is pretty popular and is usually used as a tool to support the idea that customers do not know what they want and are the last person to listen to for vision. Heck, even Steve Jobs uses it!
“It comes down to the very real fact that most customers don’t know what they want in a new product.” Apple customers should be glad Jobs doesn’t do focus groups. If he had, they may never have enjoyed iPods, iTunes, the iPhone, the iPad, or Apple Stores. Jobs doesn’t need focus groups because he understands his customers really, really well. Yes, sometimes better than they know themselves!… Sure, “listen” to your customers and ask them for feedback. Apple does that all the time. But when it comes to breakthrough success at Apple, Steve Jobs and his team are the company’s best focus group. Asked why Apple doesn’t do focus groups, Jobs responded: “We figure out what we want. You can’t go out and ask people ‘what’s the next big thing?’ There’s a great quote by Henry Ford. He said, “If I’d have asked my customers what they wanted, they would have told me ‘A faster horse.’”” Steve jobs
Well it sounds like Apple still listen to their customers and ask them for feedback so I guess it’s more about what kind of question you ask.
Stupid questions of course always get stupid answers back…
The quality of your answers comes from the questions you asked.
In your wildest dream, what should a storage platform look like?
This is the question we asked our customers in 2008, all very very large MSO, cable TV networks and internet service providers.
The answer was clearly the opposite of what they could buy at the time, centralized, monolithic, expensive SAN systems…
We allowed them to dream about the best platform, without worrying about any legacy support or backwards thinking.
It wasn’t easy to get the juices going but after many carefully spaced out meetings, we came down to this list of requirements:
|« Sharding of database » creates a hard association between application server and user||A stateless system. Automatic index load distribution.|
|Single point of failure : when a SAN / NAS / FC switch reboots, service is down for minutes or hours||No component should ever cause a service loss|
|COST : At 1USD/mail/year just for storage cannot compete. Beyond 300 TB, cost/TB increases.||Be able to compete with Google, below 2 USD/Mailbox/year. Leverage decreasing price of generic hardware.|
|Managing multiple SAN, volumes, tiering is complex, error prone and costly.||Ease of management : autonomic, policy based, self-healing system.|
|Competitive agains Google, Yahoo, etc…||Enabling new services : text search, photo recognition, transcoding.|
This is the list of requirements we’ve based our Scality Ring platform on, mind you this was before cloud even became a buzzword….
Here’s the architecture:
You can learn more about our technology by visiting our website.
Today it’s live, in production taking traffic from millions of users, we could’t have done it without without working with and getting feedback from our customers!
Lesson learned, do not design in a vacuum…
Comments welcomed of course.
LLVM 2.8 has been released, a few major improvements from the previous release, highlights:
- Clang C++ is now feature-complete with respect to the ISO C++ 1998 and 2003 standards.
- Objective-C++ is now supported
- Added support for SSE, AVX, ARM NEON, and AltiVec. (vector instructions)
- Improved generated code quality in some areas:
- Good code generation for X86-32 and X86-64 ABI handling.
- Improved code generation for bit-fields, although important work remains.
Clang Static Analyzer
The Clang Static Analyzer project is an effort to use static source code analysis techniques to automatically find bugs in C and Objective-C programs (and hopefully C++ in the future!). The tool is very good at finding bugs that occur on specific paths through code, such as on error conditions.
VMKit: JVM/CLI Virtual Machine Implementation
The VMKit project is an implementation of a Java Virtual Machine (Java VM or JVM) that uses LLVM for static and just-in-time compilation. As of LLVM 2.8, VMKit now supports copying garbage collectors, and can be configured to use MMTk’s copy mark-sweep garbage collector. In LLVM 2.8, the VMKit .NET VM is no longer being maintained.
LLDB: Low Level Debugger
LLDB is a brand new member of the LLVM umbrella of projects. LLDB is a next generation, high-performance debugger. It is built as a set of reusable components which highly leverage existing libraries in the larger LLVM Project, such as the Clang expression parser, the LLVM disassembler and the LLVM JIT.
LLDB is in early development and not included as part of the LLVM 2.8 release, but is mature enough to support basic debugging scenarios on Mac OS X in C, Objective-C and C++. We’d really like help extending and expanding LLDB to support new platforms, new languages, new architectures, and new features.
Who knew there was so much room left for innovation in a compiler toolchain?
Great work again from the LLVM team!
More details in the official release notes.
Downloading right now…
Amazon Simple Storage Service "S3" was one of the first solutions which allowed any user to store and access its files and documents securely and durably on the Internet.
Because S3 protocol included a comprehensive set of features guaranteeing security, integrity and durability of storage, and also because it has been made publicly available, it has been widely adopted by open source and proprietary client tools becoming "de facto" a standard.
S3 protocol allowed an interesting model of charging users for data transfer and data storage. Consequently it has been adopted by the hosting industry now offering services compatible with the S3 protocol.
The idea is that every user owns a set of "buckets", there is currently a maximum of 100 per user at Amazon. Each bucket can be viewed as a directory containing files. By default a bucket is private and only the owner can access it, but a bucket can be made public or it is possible to set more fine grained permissions on it, e.g. by allowing other users to view files. Users can initiate all kind of requests, e.g. listing buckets, putting files in buckets, getting files, modifying ACL of files, binding metadata to files, etc.
S3 protocol is roughly an improved REST protocol, which adds strong authentication features, first of all it guarantees the identity of the user but it allows nice features like access delegation to other users, temporary access and so on. It is also possible, thanks to protocol, to guarantee the integrity of the content by using a MD5 checksum computed before transfer and re-computed when the file is finally stored in the hosting provider cloud. File transfer can be encrypted by using HTTPS instead of default HTTP.
Libdroplet is a C library which implements the S3 protocol and facilitates the writing of tools which interacts with S3 services.
Libdroplet comes with a set of features which enhances the S3 protocol:
- Multi-profile system
- Fully multi-threaded (efficient in a data center environment)
- Virtual directories with true absolute and relative path support
- On-the-fly encryption/decryption and buffered I/O
- Manages storage pricing
- Simplified metadata management
It also includes a small shell tool which allows to browse over buckets with file and directory completion.
First, download latest version of libdroplet there.
Untar the archive and compile it:
$ tar zxvf scality-Droplet-2a678dd.tar.gz
$ cd scality-Droplet-2a678dd
$ sudo make install
$ mkdir ~/.droplet
$ cp doc/default.profile ~/.droplet
$ cp doc/AWS_US-Standard_Storage.pricing ~/.droplet
$ edit ~/.droplet/default.profile
(set your access_key and secret_key which you get at your hosting provider)
Test your configuration with dplsh:
bucket1:/> mkdir foo
bucket1:/> cd foo
bucket1:/foo/> ls -l
bucket1:/foo/> put /etc/hosts
bucket1:/foo/> ls -l
It is possible to create new profiles in ~./droplet directory, e.g. foo.profile. The profile is then selectable by using the DPLPROFILE environment variable. You can find additional help on dplsh.
Droplet library API is split into three layers:
- S3 request builder API
- S3 convenience API
- Vdir high-level API
The Vdir high-level API allows to specify files by their absolute and relative paths (e.g. "../foo/bar") . For this it uses the Delimiter API from S3. It also enables features likeencryption on the fly, buffered I/O, etc:
//manipulate virtual directories
dpl_status_t dpl_opendir(dpl_ctx_t *ctx, char *path, void **dir_hdlp);
dpl_status_t dpl_readdir(void *dir_hdl, dpl_dirent_t *dirent);
int dpl_eof(void *dir_hdl);
void dpl_closedir(void *dir_hdl);
dpl_status_t dpl_chdir(dpl_ctx_t *ctx, char *path);
dpl_status_t dpl_mkdir(dpl_ctx_t *ctx, char *path);
dpl_status_t dpl_rmdir(dpl_ctx_t *ctx, char *path);
dpl_status_t dpl_openwrite(dpl_ctx_t *ctx, char *path, u_int flags, dpl_dict_t *metadata, dpl_canned_acl_t canned_acl, u_int data_len, dpl_vfile_t **vfilep);
dpl_status_t dpl_write(dpl_vfile_t *vfile, char *buf, u_int len);
dpl_status_t dpl_openread(dpl_ctx_t *ctx, char *path, u_int flags, dpl_condition_t *condition, dpl_buffer_func_t buffer_func, void *cb_arg);
dpl_status_t dpl_unlink(dpl_ctx_t *ctx, char *path);
dpl_status_t dpl_getattr(dpl_ctx_t *ctx, char *path, dpl_condition_t *condition, dpl_dict_t **metadatap);
A C example (look at examples/recurse.c):
* simple example which recurses a directory tree
//vfs style call to change directory
ret = dpl_chdir(ctx, dir);
if (DPL_SUCCESS != ret)
//vfs style call to open a directory
ret = dpl_opendir(ctx, ".", &dir_hdl);
if (DPL_SUCCESS != ret)
//vfs style readdir
ret = dpl_readdir(dir_hdl, &dirent);
if (DPL_SUCCESS != ret)
if (strcmp(dirent.name, "."))
for (i = 0;i < level;i++)
if (DPL_FTYPE_DIR == dirent.type)
ret = recurse(ctx, dirent.name, level + 1);
if (DPL_SUCCESS != ret)
dpl_closedir(dir_hdl); //close a directory
if (level > 0)
//vfs like functions manipulate relative paths
ret = dpl_chdir(ctx, "..");
if (DPL_SUCCESS != ret)
char *bucket = NULL;
if (2 != argc)
fprintf(stderr, "usage: recurse bucket\n");
bucket = argv;
//initialize the lib
ret = dpl_init();
if (DPL_SUCCESS != ret)
fprintf(stderr, "dpl_init failed\n");
//create a droplet context
ctx = dpl_ctx_new(NULL, NULL);
if (NULL == ctx)
fprintf(stderr, "dpl_ctx_new failed\n");
ctx->cur_bucket = bucket; //set current bucket
ret = recurse(ctx, "/", 0);
if (DPL_SUCCESS != ret)
fprintf(stderr, "error recursing\n");
dpl_ctx_free(ctx); //free the droplet context
dpl_free(); //terminates the library
That's all folks!
Full press release and details on scop.scality.com.
We’re opening the cloud drop by drop! Our goal with this open source library is to promote the use of object based cloud storage and simplify the job of application developers in the process.
We are trying to address the most common user concerns associated with cloud storage (Freedom/Openness, Performance, Security and Visibility) at the client library level so that application developer using Scality Droplet can spend more time focusing on their own user experience instead.
The second part of that strategy is to reward open source developers with our Scality Open Source Program (SCOP) by offering bounties for applications that we feel are a great match for the cloud. Total bounty pot is $100,000, divided into $1000 to $10000 individual app bounties. Look here for the list of applications.
You can also submit your own application idea, if we like it, we’ll create a bounty for it so an open source developer can step up and you basically just outsourced the development of your dream application for free 🙂 Apply here while there’s still money left!
Ping me on twitter @GiorgioRegni