Powerset Blog
First Release of Hbase in Hadoop
With all of the excitement in the past few weeks with IBM partnering with Google to create a distributed systems lab for college students based on Hadoop and Kosmix releasing the open source Kosmos Distributed File System with Hadoop compatibility, we thought it would be a good time to talk about an area of the Hadoop project that we’ve spent a lot of time on over the past year and want to invite you to use and contribute to: HBase.
There is a lot to the "secret sauce" that is Powerset Web search, like the XLE, licensed from PARC, ranking algorithms, and the ever-important onomasticon. These components are consequently proprietary.
For any other component, we try to use open source software if it is available. One of the unsung heroes that forms the foundation for all of these components is the ability to process insane amounts of data. This is especially true for a Natural Language search engine. A typical keyword search engine will gather hundreds of terabytes of raw data to index the Web. Then, that raw data is analyzed to create a similar amount of secondary data, which is used to rank search results. Since our technology creates a massive amount of secondary data through its deep language analysis, Powerset will be generating far more data than a typical search engine, eventually ranging up to petabytes of data.
Google uses a number of well-known components to fulfill their data processing needs: [a distributed file system GFS , Map/Reduce, and BigTable.
Instead of creating a proprietary copy of these pieces of infrastructure, Powerset decided instead to turn to Hadoop, a Lucene sub-project that is a framework for running data-intensive applications on large clusters of commodity hardware. Powerset’s already benefited greatly from the use of Hadoop: our index build process is entirely based on a Hadoop cluster running the Hadoop Distributed File System (HDFS) and makes use of Hadoop’s map/reduce features.
Unfortunately, there was no Hadoop equivalent to Google’s Bigtable storage engine. Because we have benefited greatly by leveraging the available Hadoop technology, Powerset decided to give back to the community by developing an open source analog to Bigtable that is built on top of HDFS. After all, we need to develop it anyway, it isn’t part of the Powerset "secret sauce", and we, in turn, could benefit from contributions from other members of the community.
For the past 10 months, Michael Stack (known to his friends as, St.Ack) and I have been working full time on developing the open source Bigtable-like storage engine that we call HBase, a sub-project of Hadoop. Because of the progress that we and the other HBase contributors have made in the past couple of months, we’re happy to announce that the first HBase release will be available with the Hadoop 0.15.0 release. As with any project, there’s always more to do, but that’s exactly why we invite you to join the community. If you’re an interested software developer or a company that needs a platform for large scale computing, in particular if you are looking for a distributed storage system for massive amounts of structured data, try HBase. We’re happy to help you get started: all of our contact information is located conveniently within the HBase site.
We hope that you’ll both use HBase and contribute! – Jim Kellerman, Senior Engineer, Powerset