October 16, 2007

First Release of Hbase in Hadoop

HbaseWith all of the excitement in the past few weeks with IBM partnering with Google to create a distributed systems lab for college students based on Hadoop and Kosmix releasing the open source Kosmos Distributed File System with Hadoop compatibility, we thought it would be a good time to talk about an area of the Hadoop project that we’ve spent a lot of time on over the past year and want to invite you to use and contribute to: HBase.

There is a lot to the “secret sauce” that is Powerset Web search, like the XLE, licensed from PARC, ranking algorithms, and the ever-important onomasticon. These components are consequently proprietary.

For any other component, we try to use open source software if it is available. One of the unsung heroes that forms the foundation for all of these components is the ability to process insane amounts of data. This is especially true for a Natural Language search engine. A typical keyword search engine will gather hundreds of terabytes of raw data to index the Web. Then, that raw data is analyzed to create a similar amount of secondary data, which is used to rank search results. Since our technology creates a massive amount of secondary data through its deep language analysis, Powerset will be generating far more data than a typical search engine, eventually ranging up to petabytes of data.

Google uses a number of well-known components to fulfill their data processing needs: a distributed file system (GFS) , Map/Reduce, and BigTable.

Instead of creating a proprietary copy of these pieces of infrastructure, Powerset decided instead to turn to Hadoop, a Lucene sub-project that is a framework for running data-intensive applications on large clusters of commodity hardware.

Powerset’s already benefited greatly from the use of Hadoop: our index build process is entirely based on a Hadoop cluster running the Hadoop Distributed File System (HDFS) and makes use of Hadoop’s map/reduce features.

Unfortunately, there was no Hadoop equivalent to Google’s Bigtable storage engine. Because we have benefited greatly by leveraging the available Hadoop technology, Powerset decided to give back to the community by developing an open source analog to Bigtable that is built on top of HDFS. After all, we need to develop it anyway, it isn’t part of the Powerset “secret sauce”, and we, in turn, could benefit from contributions from other members of the community.

For the past 10 months, Michael Stack (known to his friends as, St.Ack) and I have been working full time on developing the open source Bigtable-like storage engine that we call HBase, a sub-project of Hadoop.

Because of the progress that we and the other HBase contributors have made in the past couple of months, we’re happy to announce that the first HBase release will be available with the Hadoop 0.15.0 release. As with any project, there’s always more to do, but that’s exactly why we invite you to join the community. If you’re an interested software developer or a company that needs a platform for large scale computing, in particular if you are looking for a distributed storage system for massive amounts of structured data, try HBase. We’re happy to help you get started: all of our contact information is located conveniently within the HBase site.

We hope that you’ll both use HBase and contribute!

Jim Kellerman, Senior Engineer, Powerset

Comments (17)

Blah, as interesting as a flat tire.

Instead of generating a search engine that falls far short of Google and only a fringe crowd will use, you could use your collective energies to making a more accurate eng/any latin character language translation software.

If I understood right: you are going to copy google? IMHO it’s wrong way, you’ll never beat’em on their lands..

very very nice topic thnx (:

very interesting, but I don’t agree with you Idetrorce

Oh, and did not know about it. Thanks for the information …

You ideas are great! Thanks a lot! I am looking forward for new ideas.

Your crawl is by far the most offensive ever to hit our site. I hope your company falls flat on it’s inefficient face.

Preved dyatlam!

Will this be any better than Hypertable (hypertable.org) which claims it is already compatible with HadoopFS? I wonder if Hypertable is automatically faster because it is written in C++?

Maybe someone can do a comparison. And throw Amazon SimpleDB into the mix, although SimpleDB is bound to be slower, and has some policy limitations, e.g. a limit on number of attributes, etc.

very interesting, but I don’t agree with you Idetrorce.

thanx.

Your crawl is by far the most offensive ever to hit our site. I hope your company falls flat on it’s inefficient face.

If I understood right: you are going to copy google? IMHO it’s wrong way, you’ll never beat’em on their lands..

Preved dyatlam!

interesting that googs and kosmix teamed up for this project.

Post a comment

Comments are moderated. Think before you post.

line


About Powerset | Blog | News | Team | Careers | Contact

© 2007 - Powerset Inc. - All rights reserved.