First Release of Hbase in Hadoop
With all of the excitement in the past few weeks with IBM partnering with Google to create a distributed systems lab for college students based on Hadoop and Kosmix releasing the open source Kosmos Distributed File System with Hadoop compatibility, we thought it would be a good time to talk about an area of the Hadoop project that we’ve spent a lot of time on over the past year and want to invite you to use and contribute to: HBase.
There is a lot to the “secret sauce” that is Powerset Web search, like the XLE, licensed from PARC, ranking algorithms, and the ever-important onomasticon. These components are consequently proprietary.
For any other component, we try to use open source software if it is available. One of the unsung heroes that forms the foundation for all of these components is the ability to process insane amounts of data. This is especially true for a Natural Language search engine. A typical keyword search engine will gather hundreds of terabytes of raw data to index the Web. Then, that raw data is analyzed to create a similar amount of secondary data, which is used to rank search results. Since our technology creates a massive amount of secondary data through its deep language analysis, Powerset will be generating far more data than a typical search engine, eventually ranging up to petabytes of data.
Google uses a number of well-known components to fulfill their data processing needs: a distributed file system (GFS) , Map/Reduce, and BigTable.
Instead of creating a proprietary copy of these pieces of infrastructure, Powerset decided instead to turn to Hadoop, a Lucene sub-project that is a framework for running data-intensive applications on large clusters of commodity hardware.
Powerset’s already benefited greatly from the use of Hadoop: our index build process is entirely based on a Hadoop cluster running the Hadoop Distributed File System (HDFS) and makes use of Hadoop’s map/reduce features.
Unfortunately, there was no Hadoop equivalent to Google’s Bigtable storage engine. Because we have benefited greatly by leveraging the available Hadoop technology, Powerset decided to give back to the community by developing an open source analog to Bigtable that is built on top of HDFS. After all, we need to develop it anyway, it isn’t part of the Powerset “secret sauce”, and we, in turn, could benefit from contributions from other members of the community.
For the past 10 months, Michael Stack (known to his friends as, St.Ack) and I have been working full time on developing the open source Bigtable-like storage engine that we call HBase, a sub-project of Hadoop.
Because of the progress that we and the other HBase contributors have made in the past couple of months, we’re happy to announce that the first HBase release will be available with the Hadoop 0.15.0 release. As with any project, there’s always more to do, but that’s exactly why we invite you to join the community. If you’re an interested software developer or a company that needs a platform for large scale computing, in particular if you are looking for a distributed storage system for massive amounts of structured data, try HBase. We’re happy to help you get started: all of our contact information is located conveniently within the HBase site.
We hope that you’ll both use HBase and contribute!
– Jim Kellerman, Senior Engineer, Powerset


Comments (17)
Awesome!
Posted by Vincent | October 18, 2007 02:43 PM
Posted on October 18, 2007 02:43 PM
Blah, as interesting as a flat tire.
Instead of generating a search engine that falls far short of Google and only a fringe crowd will use, you could use your collective energies to making a more accurate eng/any latin character language translation software.
Posted by Rick | October 22, 2007 05:03 PM
Posted on October 22, 2007 05:03 PM
If I understood right: you are going to copy google? IMHO it’s wrong way, you’ll never beat’em on their lands..
Posted by Gene | October 29, 2007 10:37 AM
Posted on October 29, 2007 10:37 AM
早泄 健康 情感 肝癌 食道癌 脑瘤 子宫颈癌 结肠癌
Posted by srysey | November 09, 2007 09:02 AM
Posted on November 09, 2007 09:02 AM
very very nice topic thnx (:
Posted by havalanı transfer ankara | November 14, 2007 09:17 PM
Posted on November 14, 2007 09:17 PM
very interesting, but I don’t agree with you Idetrorce
Posted by Idetrorce | December 15, 2007 11:32 AM
Posted on December 15, 2007 11:32 AM
Oh, and did not know about it. Thanks for the information …
Posted by Andy | December 19, 2007 05:06 PM
Posted on December 19, 2007 05:06 PM
very nice
Posted by sohbet | December 31, 2007 06:47 PM
Posted on December 31, 2007 06:47 PM
You ideas are great! Thanks a lot! I am looking forward for new ideas.
Posted by Snowcore | January 11, 2008 10:08 AM
Posted on January 11, 2008 10:08 AM
Your crawl is by far the most offensive ever to hit our site. I hope your company falls flat on it’s inefficient face.
Posted by Michael | February 21, 2008 02:32 AM
Posted on February 21, 2008 02:32 AM
Preved dyatlam!
Posted by Jenna | February 25, 2008 10:03 AM
Posted on February 25, 2008 10:03 AM
Will this be any better than Hypertable (hypertable.org) which claims it is already compatible with HadoopFS? I wonder if Hypertable is automatically faster because it is written in C++?
Maybe someone can do a comparison. And throw Amazon SimpleDB into the mix, although SimpleDB is bound to be slower, and has some policy limitations, e.g. a limit on number of attributes, etc.
Posted by Gerard Sychay | March 11, 2008 07:52 PM
Posted on March 11, 2008 07:52 PM
very interesting, but I don’t agree with you Idetrorce.
thanx.
Posted by iso belgesi | March 16, 2008 01:27 AM
Posted on March 16, 2008 01:27 AM
Your crawl is by far the most offensive ever to hit our site. I hope your company falls flat on it’s inefficient face.
Posted by teşvik belgesi | March 25, 2008 11:18 AM
Posted on March 25, 2008 11:18 AM
If I understood right: you are going to copy google? IMHO it’s wrong way, you’ll never beat’em on their lands..
Posted by teşvik belgesi | March 25, 2008 11:22 AM
Posted on March 25, 2008 11:22 AM
Preved dyatlam!
Posted by ISO 9001 | March 25, 2008 09:19 PM
Posted on March 25, 2008 09:19 PM
interesting that googs and kosmix teamed up for this project.
Posted by Trade | April 04, 2008 12:55 AM
Posted on April 04, 2008 12:55 AM