April 16, 2008

Semantic Technology Happy Hour, Round 1

Powerset, Metaweb, and Radar Networks share many similarities: we’re semantic technology companies, we’re building bleeding edge products, we employ a small army of PhDs, and our offices are within a couple blocks of each other in the SOMA district of San Francisco. With such geographic and conceptual proximity, Powerset’s co-founder and CTO Barney Pell suggested that we start referring to the neighborhood as SEMA, or the Semantic Technology Area.

For an inaugural SEMA event, Powerset decided to host the first Semantic Technology Happy Hour last night. In addition to sizable contingents from the core Trinity, this Happy Hour included guest representatives from BooRah and TrueKnowledge, who will both be participating with Powerset at next week’s Alternative Search Engine Meetup. A few members of the press also were present. In particular, Dan Farber wrote a great article on his blog about the event.

Semantic Web Happy Hour

Overall, the format was casual and focused on building personal relationships among the represented companies. As you can imagine in a group with a high median IQ, conversations tended to be spectacularly geeky. TrueKnowledge gave us a peek under the hood of its beta and Powerset demonstrated our integration to Freebase. At one point, a group stood in front of the projected computer and threw out queries to see all of the different Freebase types that Powerset could handle. After being lubricated by a drink or two, inter-company gaming commenced, featuring pool, foosball, and ping pong. A few of the hardcore folks from Powerset and Metaweb ended up at the Hotel Utah for after hours merrymaking.

In the spirit of building the semantic technology community, there will likely be a follow-up event in the next few months. If you’re a semantic technology company, especially in SEMA, contact me (mark AT powerset.com) if you’d like to be included in the next iteration of the Semantic Technology Happy Hour.


View Larger Map

February 22, 2008

HBase User Group & Update

HBase has been gaining a lot of traction since our last post. Hadoop was promoted to a top-level Apache project and Yahoo announced that it has the world’s largest Hadoop production application. HBase is now a full sub-project of Hadoop. Also, other companies like Rapleaf are using HBase in production.

After the success of the HBase tech presentation at Rapleaf HQ, Powerset has decided to organize a second Hbase User Group meeting. If you’re using HBase now or evaluating HBase as a data store, join us to network with other database geeks and give suggestions and feedback to the Hbase core development team. The User Group will meet on Tuesday, March 4, from 5:00-7:00 p.m. at the Powerset HQ. Just register on the event page at Upcoming so we can plan how much food and booze to stock.

If you’re interested in HBase or Hadoop and can’t make it to the User Group meeting, there’s a Hadoop Summit at Yahoo on March 25. Powerset will be doing a presentation on HBase and Rapleaf will be showing off their application.

October 16, 2007

First Release of Hbase in Hadoop

HbaseWith all of the excitement in the past few weeks with IBM partnering with Google to create a distributed systems lab for college students based on Hadoop and Kosmix releasing the open source Kosmos Distributed File System with Hadoop compatibility, we thought it would be a good time to talk about an area of the Hadoop project that we’ve spent a lot of time on over the past year and want to invite you to use and contribute to: HBase.

There is a lot to the “secret sauce” that is Powerset Web search, like the XLE, licensed from PARC, ranking algorithms, and the ever-important onomasticon. These components are consequently proprietary.

For any other component, we try to use open source software if it is available. One of the unsung heroes that forms the foundation for all of these components is the ability to process insane amounts of data. This is especially true for a Natural Language search engine. A typical keyword search engine will gather hundreds of terabytes of raw data to index the Web. Then, that raw data is analyzed to create a similar amount of secondary data, which is used to rank search results. Since our technology creates a massive amount of secondary data through its deep language analysis, Powerset will be generating far more data than a typical search engine, eventually ranging up to petabytes of data.

Google uses a number of well-known components to fulfill their data processing needs: a distributed file system (GFS) , Map/Reduce, and BigTable.

Instead of creating a proprietary copy of these pieces of infrastructure, Powerset decided instead to turn to Hadoop, a Lucene sub-project that is a framework for running data-intensive applications on large clusters of commodity hardware.

Powerset’s already benefited greatly from the use of Hadoop: our index build process is entirely based on a Hadoop cluster running the Hadoop Distributed File System (HDFS) and makes use of Hadoop’s map/reduce features.

Unfortunately, there was no Hadoop equivalent to Google’s Bigtable storage engine. Because we have benefited greatly by leveraging the available Hadoop technology, Powerset decided to give back to the community by developing an open source analog to Bigtable that is built on top of HDFS. After all, we need to develop it anyway, it isn’t part of the Powerset “secret sauce”, and we, in turn, could benefit from contributions from other members of the community.

For the past 10 months, Michael Stack (known to his friends as, St.Ack) and I have been working full time on developing the open source Bigtable-like storage engine that we call HBase, a sub-project of Hadoop.

Because of the progress that we and the other HBase contributors have made in the past couple of months, we’re happy to announce that the first HBase release will be available with the Hadoop 0.15.0 release. As with any project, there’s always more to do, but that’s exactly why we invite you to join the community. If you’re an interested software developer or a company that needs a platform for large scale computing, in particular if you are looking for a distributed storage system for massive amounts of structured data, try HBase. We’re happy to help you get started: all of our contact information is located conveniently within the HBase site.

We hope that you’ll both use HBase and contribute!

Jim Kellerman, Senior Engineer, Powerset

September 17, 2007

Powerset launches Powerset Labs at TechCrunch40

SF BetaPowerset is undertaking a huge task: building a natural language search engine that reads and understands every sentence on the Web. The good news is that thanks to the technology that we’ve licensed from PARC combined with homegrown technology developed by our strong team of linguists and computer scientists, we’re well on the path to achieving this goal.

We realize that most companies wait to launch until they have a completely usable beta version. Because Powerset is a natural language search engine, the earlier we have input from the best natural language processing units on the planet – the brains of humans – the quicker our search engine will improve. Through a combination of quantitative feedback, qualitative suggestions and AI learning techniques, Powerset will get much smarter when people are interacting with it.

That’s the reason we’ve decided to create Powerset Labs, which we are opening up to the first group of users today at TechCrunch40. Powerset Labs is a community where users can provide feedback on our product design and natural language engine. While users will not be able to interact with the Powerset open search box across the Web, we are giving users a peek at technology demonstrations that show off some of Powerset’s natural language processing capabilities.

Though the content of Powerset Labs will change based on user feedback, we wanted to share with you what we demoed today at TC40: Powermouse and Use Cases.

Powermouse is a window into Powerset’s natural language index. When Powerset reads sentences in Wikipedia, we go from open text to representations of meaning. In other words, we take text and turn it into structured “facts.” When users enter a query into Powermouse, they’ll be able to browse the “facts” stored in our index. In the example below, when wrestling star “Hulk Hogan” is entered in the first “something” box, users can see all of the facts we’ve indexed about him. Now, if you add “defeat” into the connection box, users see all of the facts that we’ve indexed from Wikipedia about wrestlers that Hogan has defeated. Here’s a Hulk Hogan screenshot. In addition to showing off the power of our index, Powermouse also shows a different type of interface that’s possible with a natural language index.

Uses Cases demonstrates how a natural language query can exploit Powerset’s index. Unlike Powermouse, Use Cases lets users express their intent in natural language. We’ve picked about a dozen use cases that illustrate how a natural language index can return results that are qualitatively different than keyword results. For example, here’s a query of “Who mocked Blair?” that shows how Powerset understands all of the various ways “mock” can be expressed in English. After a query, we encourage users to tell Powerset which of our results are good and which are not. We also ask that users vote on which results are better: the Powerset results or the keyword results on the right. All of this user feedback is what will help make Powerset a better search engine.

Once users have tried out the applications in Powerset Labs, we invite them to submit ideas about how to make them better. Within Powerset Labs, users can browse through ideas, vote on the best ideas and comment on ideas. As users participate in Powerlabs, they’ll get karma points for everything they do. Eventually, users with the most karma will get perks within the community. The bottom line: We’re listening and we’ll try to implement your brilliance as soon as we possibly can.

As a note, Powerset has received a lot of attention over the past few months and we’ve been overwhelmed with the number of people who have signed up for Powerset Labs. Instead of letting everyone in at once, we’ve decided to let people into Poweset Labs in the order they signed up. We want to make sure that each group of Powerset Labs users gets a great experience, so we’re going to grow the community slowly and carefully. If you’d like to sign up, go to labs.powerset.com and we’ll be letting in the next wave of users as soon as possible.

We’re really excited to see the Powerset Labs community grow, to gather and implement your feedback, to share with you more and more technology demonstrations and prototypes, and ultimately to deliver a transformative search engine to the world.

Oh, and we have our first official social media press release about our launch at TechCrunch40 if you are press and need contact information, quotes or screenshots.

August 31, 2007

Parsing Miss South Carolina's Statement

It’s not like it’s easy to parse Wikipedia, but at least most of the its text is (usually) written with correct spelling, capitalized proper names, meaningful paragraph structure and so on. A natural question is: how will our system perform on the rest of the Web with all of its slang, non-standard syntax, and so on? To put Powerset to the test, two of our engineers, Lukas Biewald and Brendan O'Connor, ran our entire parsing and indexing system on the hardest corpus we could find: Miss South Carlolina's response to the question, "Recent polls have shown that a fifth of Americans can't locate the US on a map. Why do you think this is?"

They fed this transcription into the XLE verbatim, disfluencies and all:

I personally believe that U.S. Americans are unable to do so because uh some uh people out there in our nation don't have maps and uh I believe that our ed- education like such as in South Africa and uh the- the Iraq everywhere like such as and I believe that they should uh our education over here in the U.S. should help the U.S. or- or- should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future
One might think that such a convoluted mess of words (I hesitate to call it "English") would be impossible to parse, but here is the C-structures that our parser generates: Parse

Unsurprisingly, the sentence is fragmented quite a bit, but the parser clearly managed to extract useful structure throughout the sentence. The last large verb phrase “should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future” seems very close to correct, which is pretty impressive (Language Log has more discussion on the weird “the Iraq” construction). The output of the semantics system is too long to put here, but in some ways, it’s amazing that we were able to extract any semantics at all.

And, believe it or not, we can actually run some queries against the Carolina Index (as it's known at Powerset). It's hard to think of a reasonable question, but we asked, "Who does education help?" and returned and highlighted the right answer: "Americans". Or should we have returned "U.S. Americans"?

Who does education help

August 30, 2007

Barney to speak at the Singularity Summit

SF BetaThe Singularity Institute is hosting the Singularity Summit in San Francisco on September 8-9. Dr. Barney Pell, Powerset’s own CEO, will be presenting in the first session called “What are the Pathways and Major Challenges.” In his talk, Barney will predict that the path to Artificial General Inteligence (AGI) will be based on a rich interplay in which top-down engineering and bottom-up brain simulation approaches meet in the middle. He’ll also talk about the role of economics in accelerating the development of AI systems. Powerset is actually an example of this: once natural language becomes a part of core search, companies will invest large funds in natural language technologies, dwarfing historical investments to date. If you want to read more, you can get your tickets today (only $50!) or read Barney’s more detailed blog post or even listen to Barney’s interview with Dan Farber. We hope to see you there!

July 26, 2007

The Tyranny of the Common Name

An acquaintance of mine asked me to email him something, so I asked him for his email address. He gave me his business card. It had only his first and last name printed on it, nothing else.

I said, “Uh…wait a sec…so what’s your email address?”

He said, “Just put my name into any reasonable search engine and my homepage will pop right up.”

I was immediately a little bit jealous – I’d always wanted to be able to use my name alone as a ‘personal URI’, but either I’m not famous enough or my name isn’t unique enough. Or some combination of the two.

If I put my first and last name into any reasonable search engine, I do not pop right up. In fact, the most famous person with my name is an English darts player, who became even more famous for the first televised nine-dart finish.

I have on occasion told people they could find me (by which of course, I meant ‘find my little self-created electronic shrine on the web’) by putting my full namefirst, middle, last – into a reasonable search engine and then my homepage (more or less) pops right up. And of course not everyone who might want to find me knows those details or realizes that they form a useful search key for finding me on the web.

Now, there’s a big business in trying to stand out in the results from search engines and I’m not even going to go there in this essay. “Optimizing” (some would say “gaming”) search engine results is in many ways a recapitulation of the process of trying to be first (or sometimes last) in lexicographic sort order in the yellow pages: AAAAAAA Plumbers has a better change of getting called by a desperate homeowner with a leaking toilet than the next name on the list. Ever since I was a child I’ve thought that people with names closer to the front of the alphabet have a bit of an advantage over those of us whose names put us in the back of the pack. But perhaps this advantage is cancelled out by those occasions when being first is the last thing one would want.

I think the little accidents (and sometimes choices) of personal and cultural history that automatically improve the visibility of certain individuals and bury others are a never-ending source of wonder.

Conventions for personal names are really quite complicated, a topic of long and deep study called onomastics. If you go looking for this in wikipedia you’ll be disappointed – it’s only a placeholder for this topic. It’s a hypernym of what you really want: anthroponymy (scans like anthropology).

The patronymic conventions which are frequently encountered in the modern descendents of Indo-European cultures are in fact rather recent (Johnson was at one time the son of someone named John, and Berger was something like a townsperson), but when considered from a more global perspective things get really interesting. It’s been estimated that there are 296 million Lis in the world (nearly the population of the US), this being the most common surname in China. (The notion surname is really not quite right here since in that part of the world the family name is usually cited first, but that’s how they’re called in my source. The same is true in Japan, Korea and elsewhere). In Burma, it was pretty common to name your kid after the day of the week he or she was born on, so there’s a dearth of first names there. In places where there are not a lot of people it’s not so hard to identify someone and so a single name does just fine; just don’t try to google someone from the Maldives and expect their email address to pop right up.

I really feel sorry for people who have names that are already in use as common nouns (or adjectives) – the Browns, Greens, Bushes, and Stones of the world – they’re even further down the search engine result list. Unless, again, you’re already famous, like Robin Hood. Oh wait, we now need to distinguish him from the rest of the robin hoods of the world out there giving to the poor after stealing from the rich. Oh, the irony of having one’s personal name become a household name! Or worse, a verb (e.g. to get borked or meired)

Perhaps someday conventions will arise to allow us commoners to distinguish ourselves from the hordes of others clamoring to use our names. I know someone who signs his emails ‘Jeff (meaning “an instance of Jeff” in Lisp). Or perhaps we could start using our email addresses or screen names – it would certainly help in disambiguating names in scientific papers. Anyone for an ISO standard for adding diacritics, indexicals, or skolems to names to distinguish one from another?

I’d love to hear from you if you are interested in this topic. Just put my full name, all three of them, into any reasonable search engine and you’ll find me.

-John (Brandon) Lowe, Senior Scientist

July 17, 2007

Powerset to Demo at SF Beta

SF BetaIf you can’t wait for the release of Powerlabs in September and want to get a sneak peek at Powerset’s natural language search platform, Powerset will be doing a demo at SF Beta on July 24. SF Beta is held at the very chic 111 Minna Gallery, which has cool art on the walls, a full bar, and plenty of space for mingling and greeting. Powerset’s Director of Product, Dr. Scott Prevost, will be on hand to give a demo. Tickets are sold at a discount before the event, so save a few dollars for another drink and get your tickets today. We hope to see you there!

July 16, 2007

Implicature

Linguists use odd words. Some of them sound very strange, but actually refer to something fairly ordinary, when you think about it. They’re ideas where “there ought to be a word” – so a word gets invented.

I recently learned one of these words which I’ve found to be very useful. It’s influenced my “mental language” for understanding all sorts of thing related to communication. This word is “implicature”.

“Implicature” was coined by a guy named Paul Grice to help describe situations in which what a speaker means is not the same as what she actually says. It happens in all sorts of ways. For example, suppose someone says “I went to the grocery store and saw my grandmother.” Most people would assume that the speaker was stating that he saw his grandmother at the grocery store, but of course that’s not what was said. There is an implicature that the grandmother was seen at the grocery store. It can be more subtle, however. For example, an indirect answer counts: “Are you going to the party?” “I have to go to a wedding.” The speaker didn’t say that she’s not going to the party.

The above examples are fairly straightforward. But implicatures can have teeth. “Some power companies are not environmentally insensitive.” This sentence has an implicature which says that most of them are insensitive. Politicians are of course masters of this kind of implicature.

In fact, the concept is so broad that it can be seen almost everywhere, especially in conversation. Sarcasm and irony are kinds of implicatures.

I like the word “implicature” because it’s a pointer to the fact that there are always unspoken assumptions. Communication requires a context. This simple idea is behind one of the most complex (or complexly argued) terms in modern philosophy, literary criticism, and historical analysis — deconstruction.

“Sound and fury, signifying nothing”… maybe, but that “signifying” itself has a lot of life in it. Signify something today!

-Doug Cutrell, Powerset Engineer

July 13, 2007

Powerset to Host a Lunch 2.0

WatermelonPowerset is hosting a Lunch 2.0 on Thursday, August 9 from 11:30a.m. to 1:30p.m at our SoMa offices here in sunny San Francisco (really, SoMa often has great weather during the day). You’ll get to meet other San Francisco hipster-geeks, eat excellent BBQ, enjoy a beautiful summer afternoon, and see a demo of our next-generation search product. If you’d like to come, just sign up on the invitation page at Upcoming.org. We hope to see you there!

June 26, 2007

Noun-Noun Compound is Like a Chocolate Box

A noun-noun compound is a noun phrase composed only of nouns, like chocolate box. And ok, the title above isn’t quite right. It should be “A Noun-Noun Compound is like a Box of Chocolates.” And in fact Chocolate Box, although a common phrase (383,000 hits on Google today) is ambiguous in meaning. Is it a box made of chocolate or a box filled with, chocolate? Logically, and based on life experience, either of these works. But what it clearly doesn’t mean is a box used to spread chocolate or a box used in a chocolate room.

Why would I even propose these meanings? Because these are logical meanings for some other noun compounds. It’s what is meant by butter knife – a knife used to spread butter – and kitchen knife – a knife used in the kitchen.

How do we, as language speakers, know that chocolate box usually means a box filled with chocolates, but butter knife means a knife used to spread or cut butter? We can also talk about a steel knife – a knife made of steel – which is like the box made of chocolate interpretation of chocolate box.

The answer of course is that we know what the individual words mean, and we know a lot about the world and how these words are used in life. We know that engine oil is oil used to lubricate an engine, while olive oil is oil extracted from olives. But it’s much harder for a computer program to figure out this sort of thing.

My PhD students and I have tackled the problem of how to write programs that figure out what the relationship is between two nouns in a noun compound. We’ve tackled it in two different ways. My former student Barbara Rosario created an algorithm that classifies each word in the noun-noun compound into a category, and then formulates hypotheses about the relations between the two words based on those categories. So PLANT followed-by LIQUID means that the liquid is derived from or extracted from the plant, giving us compounds like olive oil, orange juice, and almond extract. We found this method worked well in a specialized domain (biomedical text). But for general text, it can be hard to get things right using this technique. For example, MATERIAL followed-by INSTRUMENT commonly can go two different ways. A steel knife is a knife made from steel but a steel cutter is a tool used to cut steel.

So later another student of mine, Preslav Nakov, came up with a very simple, very effective solution to the problem . What he realized is that noun-noun compounds are like chocolate truffles; you have to break them apart to see what is inside.

That is, we break the two nouns apart, flip their order, and then look for verbs that lie inside. So to find out what the relationship is between chocolate and box, we search for the following on a web search engine:

box that * chocolate
boxes which * chocolate
box which * chocolates
and so on. (The * is a wildcard; any word can be placed in its position.) We then parse the text summaries that the search engine returns, to find which ones have verbs in the * position, and then count up the most frequent verbs. These verbs (sometimes along with their associated prepositions) very often give a succinct and meaningful summary of the relationship between the two nouns. For example, the most frequent verbs associated with olive oil are come from, obtained from, made from, and produced from. Verbs frequently associated with chocolate box include hold and contain. From these verbs we can infer the relationship that lies hidden between the nouns.

Why does this matter? Say you were building a search engine that can process natural language, and you want to deal with a user query like tell me about student protests. You want to accurately expand this search to look for demonstrations that draw students but not demonstrations that condemn students, which would be a paraphrase for war protests. Smart paraphrasing may lead to smarter search results.

-Marti Hearst

June 22, 2007

"Who proved Fermat’s last theorem?"

Query of the Week: Who proved Fermat’s last theorem?

query

Sometimes how you say something is just as important as what you say. This is true in life and in searching the Internet. For example, let’s look at the query “who proved Fermat’s last theorem.” Powerset knows you are looking for a “who” and searches for matching sentences that fill in the “who” blank. Other search engines look for specific words, but have no way of knowing about the missing information you’re looking for.

In this example, Powerset produces results that highlight passages from wikipedia about proving Fermat’s last theorem. Note that the first result answers the question outright and even identifies the “who”, while other results more loosely match to proving special cases of the theorem.

query

When the question is switched from “who” to “when,” Powerset retrieves the matching passages that include a date or time. Notice how the highlighting of the first result changes. This is an example of how different question words can change the shape of the results, and draw the user’s attention directly to the interesting parts.

-Scott Prevost, PhD, Director of Product

June 21, 2007

Powerset to launch front-end on Ruby

RubyYes, it’s true. Today, Powersetter Kevin Clark posted an article on his blog, confirming that Powerset will launch it’s front end in Ruby.

Powerset already uses Ruby for many internal tools, our blog and website look beautiful thanks to Mephisto, and top Ruby engineers seem to be spontaneously generating at our Headquarters. We’re glad to be a member of the Ruby community and look forward to giving back.

June 18, 2007

Search Engines Leaking Oil for Holes

One morning back in grad school I was sitting in Cafe Milano on Bancroft Avenue in Berkeley, reading the campus newspaper, the Daily Cal. In true Berkeley form, in the 70’s the newspaper’s staff had rebelled and taken the paper off-campus, away from the control of university officials. Along with being anti-authority, the paper also didn’t seem to appreciate overzealous copy editing. While reading it that morning, I came across the following sentence:

The complex houses married and single students and their families.

Did you have trouble understanding this sentence? I must have stared at it for 5 minutes before it made sense to me. This is what linguists call a Garden Path Sentence, after the expression “leading someone down the garden path,” meaning to deceive. The idea is the garden path is very pleasant, so the person being lead down it is distracted and doesn’t realize that they are being deceived. (My former officemate Dan Jurafsky included this example in his textbook Speech and Language Processing; it’s now something of a classic.)

Let’s analyze why this sentence is so confusing.

When you first start reading it, you see the word The which is a common way to start a sentence in English, and usually signals that the words that follow will be a noun phrase (a sequence of adjectives followed by nouns). Next you see complex, and since this word is most often used as an adjective, and since you saw the word the right before it, you’re very primed to think it’s an adjective. You then see the next word, houses, which is a perfectly fine, very common plural noun. So all is well, you’re reading a simple noun phrase, although it is a bit strange, since houses are usually described as large or charming or decrepit, but not complex. And then the kicker – you see married which is usually a verb, but houses never get married! Now you have to back up and see what went wrong. If you stare long enough you will probably realize that this sentence is talking about housing complexes that bored students … sorry, that board students, of both the married and single variety.

You can make your own garden path sentences by following a few simple heuristics – this is how I made the title for this post. The trick is to choose words that can act as both nouns and verbs, or as both adjectives and nouns, words like store, search, and post. Then follow the ambiguous word by another word that can take on more than one form. The hard part is to then add on another noun phrase that makes sense with the less common interpretation of the second word.

So in my example, Search Engines Leaking Oil for Holes, I intend you to interpret the first two words in their most common interpretation, as a plural noun-noun compound, search engines. I then take advantage of the fact that search can be both a noun and a verb, and add the verb leaking to change the meaning to be search a leaking engine. I then tack on the rest to complete the sentence.

Another one I came up with this way is:

Blog Posts Digest Stories

The idea here is that posts when modifying blog has come to mean the outcome of posting something to a blog, so it’s closely related to the verb form of posts. I then tack on digest which is also both a noun and a verb representing similar concepts. The nominal form “digest” can be thought of as the outcome of someone reading a lot of articles, metaphorically “digesting” them, and producing a shortened list. So blog posts digest stories is kind of a double-entendre.

I’ll close with a challenge for Powerset fans: how hard would it be to come up with an automatic garden-path sentence generator?

-Marti Hearst

Editor’s Note: Marti is a professor at Berkeley in the very cool School of Information and a consultant at Powerset. All opinions expressed are more or less hers. When she’s inspired by a cool feature of language, she’ll blog it here.

June 16, 2007

"what did steve jobs say about the iPod?"

Suppose you wanted to find every sentence in Wikipedia where Steve Jobs reportedly said something about the iPod. Pretty easy, right? You’re a pretty good searcher, so you pull up your favorite search engine and type in Steve Jobs iPod site:en.wikipedia.org.

query

OK. Maybe half the results are useful, but others, including the top three, are clearly off the mark. So then you try Steve Jobs said iPod site:en.wikipedia.org.

query

Hmm, not much better. The first result is clearly not what you wanted. The second result might be valuable (Steve Jobs said something), but you can’t really tell because of the ellipsis. Did he really say something about the iPod? Again, maybe half of the results are directly relevant, but you’ve got to click some links to tell for sure. And you’re not even sure you’ve captured all the different ways “saying” something can be expressed.

Luckily, Google has a feature that lets you search alternative sets of words. Since you’re an advanced user, you type in a more complex query: Steve Jobs said OR mentioned OR claimed iPod site:en.wikipedia.org. Same problems. Google doesn’t seem to understand that it’s important that Steve Jobs did the saying, and the thing he said something about was the iPod. So it’s a good thing Google allows wildcard search that let’s you specify the order of the words. Using this knowledge, you try yet another query: ”Steve Jobs said * iPod” OR “Steve Jobs mentioned * iPod” OR “Steve Jobs claimed * iPod” site:en.wikipedia.org. That should do the trick.

What? No results. This is starting to get a little frustrating. Time to return to daydreaming about that new iPhone you plan to buy.

Now suppose you had Powerset instead of Google or Yahoo. Powerset analyzes your query for its meaning, and then looks for sentences in its index that have similar meaning. You decide to give it a shot and try a query a normal person might understand:

What did Steve Jobs say about the iPod?

query

Whew! That sure was easier than playing the keyword guessing game. Powerset searched all pages in Wikipedia where Steve Jobs is saying, stating, telling, mentioning, claiming, announcing, etc. something about the iPod. The trick isn’t just knowing that “mentioning” and “saying” can mean the same thing, it’s also knowing that in given sentence, Steve Jobs is doing the saying, and the thing he’s saying something about is the iPod. This is possible because Powerset matches the structure and meaning of your query with the structure and meaning of every sentence and document in the index, and then returns those passages that truly match your intent.

This is one illustration of the power of natural language search. With Powerset you often end up with the information you want. With keyword search you often end up with a new research project.

-Scott Prevost, PhD, Director of Product

line


About Powerset | Blog | News | Team | Careers | Contact

© 2007 - Powerset Inc. - All rights reserved.