Powerset

Powerset Blog

Noun-Noun Compound is Like a Chocolate Box

Posted by Mark Johnson Tue, 26 Jun 2007 02:03:00 GMT

A noun-noun compound is a noun phrase composed only of nouns, like chocolate box. And ok, the title above isn’t quite right. It should be "A Noun-Noun Compound is like a Box of Chocolates." And in fact Chocolate Box, although a common phrase (383,000 hits on Google today) is ambiguous in meaning. Is it a box made of chocolate or a box filled with, chocolate? Logically, and based on life experience, either of these works. But what it clearly doesn’t mean is a box used to spread chocolate or a box used in a chocolate room.

Why would I even propose these meanings? Because these are logical meanings for some other noun compounds. It’s what is meant by butter knife – a knife used to spread butter – and kitchen knife – a knife used in the kitchen.

How do we, as language speakers, know that chocolate box usually means a box filled with chocolates, but butter knife means a knife used to spread or cut butter? We can also talk about a steel knife – a knife made of steel – which is like the box made of chocolate interpretation of chocolate box.

The answer of course is that we know what the individual words mean, and we know a lot about the world and how these words are used in life. We know that engine oil is oil used to lubricate an engine, while olive oil is oil extracted from olives. But it’s much harder for a computer program to figure out this sort of thing.

My PhD students and I have tackled the problem of how to write programs that figure out what the relationship is between two nouns in a noun compound. We’ve tackled it in two different ways. My former student Barbara Rosario created an algorithm that classifies each word in the noun-noun compound into a category, and then formulates hypotheses about the relations between the two words based on those categories. So PLANT followed-by LIQUID means that the liquid is derived from or extracted from the plant, giving us compounds like olive oil, orange juice, and almond extract. We found this method worked well in a specialized domain (biomedical text). But for general text, it can be hard to get things right using this technique. For example, MATERIAL followed-by INSTRUMENT commonly can go two different ways. A steel knife is a knife made from steel but a steel cutter is a tool used to cut steel.

So later another student of mine, Preslav Nakov, came up with a very simple, very effective solution to the problem . What he realized is that noun-noun compounds are like chocolate truffles; you have to break them apart to see what is inside. That is, we break the two nouns apart, flip their order, and then look for verbs that lie inside. So to find out what the relationship is between chocolate and box, we search for the following on a web search engine:

box that * chocolate
boxes which * chocolate
box which * chocolates

and so on. (The * is a wildcard; any word can be placed in its position.) We then parse the text summaries that the search engine returns, to find which ones have verbs in the * position, and then count up the most frequent verbs. These verbs (sometimes along with their associated prepositions) very often give a succinct and meaningful summary of the relationship between the two nouns. For example, the most frequent verbs associated with olive oil are come from, obtained from, made from, and produced from. Verbs frequently associated with chocolate box include hold and contain. From these verbs we can infer the relationship that lies hidden between the nouns.

Why does this matter? Say you were building a search engine that can process natural language, and you want to deal with a user query like tell me about student protests. You want to accurately expand this search to look for demonstrations that draw students but not demonstrations that condemn students, which would be a paraphrase for war protests. Smart paraphrasing may lead to smarter search results.

-Marti Hearst

"Who proved Fermat’s last theorem?"

Posted by Mark Johnson Fri, 22 Jun 2007 13:06:00 GMT

 query

Query of the Week: Who proved Fermat’s last theorem?

Sometimes how you say something is just as important as what you say. This is true in life and in searching the Internet. For example, let’s look at the query “who proved Fermat’s last theorem.” Powerset knows you are looking for a “who” and searches for matching sentences that fill in the “who” blank. Other search engines look for specific words, but have no way of knowing about the missing information you’re looking for. In this example, Powerset produces results that highlight passages from wikipedia about proving Fermat’s last theorem. Note that the first result answers the question outright and even identifies the “who”, while other results more loosely match to proving special cases of the theorem. query

When the question is switched from “who” to “when,” Powerset retrieves the matching passages that include a date or time. Notice how the highlighting of the first result changes. This is an example of how different question words can change the shape of the results, and draw the user’s attention directly to the interesting parts. -Scott Prevost, PhD, Director of Product

Powerset to launch front-end on Ruby

Posted by Mark Johnson Thu, 21 Jun 2007 17:28:00 GMT

RubyYes, it’s true. Today, Powersetter Kevin Clark posted an article on his blog, confirming that Powerset will launch it’s front end in Ruby.Powerset already uses Ruby for many internal tools, our blog and website look beautiful thanks to Mephisto, and top Ruby engineers seem to be spontaneously generating at our Headquarters. We’re glad to be a member of the Ruby community and look forward to giving back.

Search Engines Leaking Oil for Holes

Posted by Mark Johnson Mon, 18 Jun 2007 23:44:00 GMT

One morning back in grad school I was sitting in Cafe Milano on Bancroft Avenue in Berkeley, reading the campus newspaper, the Daily Cal. In true Berkeley form, in the 70’s the newspaper’s staff had rebelled and taken the paper off-campus, away from the control of university officials. Along with being anti-authority, the paper also didn’t seem to appreciate overzealous copy editing. While reading it that morning, I came across the following sentence:

The complex houses married and single students and their families.

Did you have trouble understanding this sentence? I must have stared at it for 5 minutes before it made sense to me. This is what linguists call a Garden Path Sentence, after the expression "leading someone down the garden path," meaning to deceive. The idea is the garden path is very pleasant, so the person being lead down it is distracted and doesn’t realize that they are being deceived. (My former officemate Dan Jurafsky included this example in his textbook Speech and Language Processing; it’s now something of a classic.) Let’s analyze why this sentence is so confusing. When you first start reading it, you see the word The which is a common way to start a sentence in English, and usually signals that the words that follow will be a noun phrase (a sequence of adjectives followed by nouns). Next you see complex, and since this word is most often used as an adjective, and since you saw the word the right before it, you’re very primed to think it’s an adjective. You then see the next word, houses, which is a perfectly fine, very common plural noun. So all is well, you’re reading a simple noun phrase, although it is a bit strange, since houses are usually described as large or charming or decrepit, but not complex. And then the kicker – you see married which is usually a verb, but houses never get married! Now you have to back up and see what went wrong. If you stare long enough you will probably realize that this sentence is talking about housing complexes that bored students … sorry, that board students, of both the married and single variety. You can make your own garden path sentences by following a few simple heuristics – this is how I made the title for this post. The trick is to choose words that can act as both nouns and verbs, or as both adjectives and nouns, words like store, search, and post. Then follow the ambiguous word by another word that can take on more than one form. The hard part is to then add on another noun phrase that makes sense with the less common interpretation of the second word. So in my example, Search Engines Leaking Oil for Holes, I intend you to interpret the first two words in their most common interpretation, as a plural noun-noun compound, search engines. I then take advantage of the fact that search can be both a noun and a verb, and add the verb leaking to change the meaning to be search a leaking engine. I then tack on the rest to complete the sentence. Another one I came up with this way is:

Blog Posts Digest Stories

The idea here is that posts when modifying blog has come to mean the outcome of posting something to a blog, so it’s closely related to the verb form of posts. I then tack on digest which is also both a noun and a verb representing similar concepts. The nominal form "digest" can be thought of as the outcome of someone reading a lot of articles, metaphorically "digesting" them, and producing a shortened list. So blog posts digest stories is kind of a double-entendre. I’ll close with a challenge for Powerset fans: how hard would it be to come up with an automatic garden-path sentence generator?

-Marti Hearst

*Editor’s Note: Marti is a professor at Berkeley in the very cool School of Information and a consultant at Powerset. All opinions expressed are more or less hers. When she’s inspired by a cool feature of language, she’ll blog it here.*

"what did steve jobs say about the iPod?"

Posted by Mark Johnson Sat, 16 Jun 2007 01:57:00 GMT

Suppose you wanted to find every sentence in Wikipedia where Steve Jobs reportedly said something about the iPod. Pretty easy, right? You’re a pretty good searcher, so you pull up your favorite search engine and type in Steve Jobs iPod site:en.wikipedia.org. query

OK. Maybe half the results are useful, but others, including the top three, are clearly off the mark. So then you try Steve Jobs said iPod

site:en.wikipedia.org.

query

Hmm, not much better. The first result is clearly not what you wanted. The second result might be valuable (Steve Jobs said something), but you can’t really tell because of the ellipsis. Did he really say something about the iPod? Again, maybe half of the results are directly relevant, but you’ve got to click some links to tell for sure. And you’re not even sure you’ve captured all the different ways “saying” something can be expressed. Luckily, Google has a feature that lets you search alternative sets of words. Since you’re an advanced user, you type in a more complex query: Steve Jobs said OR mentioned OR claimed iPod site:en.wikipedia.org. Same problems. Google doesn’t seem to understand that it’s important that Steve Jobs did the saying, and the thing he said something about was the iPod. So it’s a good thing Google allows wildcard search that let’s you specify the order of the words. Using this knowledge, you try yet another query: ”Steve Jobs said * iPod” OR “Steve Jobs mentioned * iPod” OR “Steve Jobs claimed * iPod” site:en.wikipedia.org. That should do the trick. What? No results. This is starting to get a little frustrating. Time to return to daydreaming about that new iPhone you plan to buy. Now suppose you had Powerset instead of Google or Yahoo. Powerset analyzes your query for its meaning, and then looks for sentences in its index that have similar meaning. You decide to give it a shot and try a query a normal person might understand: **What did Steve Jobs say about the iPod?**

query

Whew! That sure was easier than playing the keyword guessing game. Powerset searched all pages in Wikipedia where Steve Jobs is saying, stating, telling, mentioning, claiming, announcing, etc. something about the iPod. The trick isn’t just knowing that “mentioning” and “saying” can mean the same thing, it’s also knowing that in given sentence, Steve Jobs is doing the saying, and the thing he’s saying something about is the iPod. This is possible because Powerset matches the structure and meaning of your query with the structure and meaning of every sentence and document in the index, and then returns those passages that truly match your intent.

This is one illustration of the power of natural language search. With Powerset you often end up with the information you want. With keyword search you often end up with a new research project.

-Scott Prevost, PhD, Director of Product

"There was cat all over the driveway"

Posted by Mark Johnson Tue, 12 Jun 2007 04:15:00 GMT

As a graduate student, I took courses from UC Berkeley’s Professor Charles (Chuck) Fillmore, one of the world’s greatest linguists. Prof. Fillmore’s specialty is the interrelationship between meaning and the structure of language, and his FrameNet project is building a high-quality, immensely valuable computational linguistics resource. When he lectures, Prof. Filmore has a very sweet, understated style, so I’m always surprised by his manner when he discusses language topics related to violent themes; the best description is: "with relish". I still remember an example he gave in class one day when he was illustrating the difference between count nouns and mass nouns in English. Count nouns refer to nouns that are individual objects with precise boundaries, like trees, houses, cats. Mass nouns (also known as uncountable nouns) often refer to things that do not have well-defined boundaries and are not easily identified as discrete entities: "surf", "traffic", and "electricity" are examples. In English, we indicate the difference between a mass noun and a count noun by the words we put in front of them. You can’t have three traffics or four electricities. And you can’t say "Give me cup." without a preceding article (such as "a" or "the"). Some words fall into both categories, depending on how you think about them. Sand in aggregate is a mass noun, but you can look at individual grains of sand as well. It’s the usage of the word that is key. In English we have to use three words to indicate the count version of sand ("grain of sand") and just one for the mass version. Mass nouns can also refer to groups of count nouns, as in "furniture" and "poultry". Which leads me to Prof. Fillmore’s illustrative and memorable example of mass versus count nouns. He noted the difference between saying "There are cats all over the driveway." and "There is cat all over the driveway." The second sentence suggests that something awful has happened, turning the cat from a count into a big mass. And he said it with a twinkle in his eye.

-Marti Hearst

*Editor’s Note: Marti is a professor at Berkeley in the very cool School of Information and a consultant at Powerset. All opinions expressed are more or less hers. When she’s inspired by a cool feature of language, she’ll blog it here.*

"politicians who died in office"

Posted by Mark Johnson Fri, 08 Jun 2007 21:17:00 GMT

As a teaser for Powerlabs, we’ve decided to release a “Query of the Week.” Each week, we’ll release a screenshot of Powerset search results that demonstrate aspects of our Natural Language technology. When Powerlabs is fully launched, you’ll get to play with our search engine yourself, so be sure to sign up today at http://www.powerset.com. The query of this week is…

politicians who died in office

politicians who died in office

For those who think natural language search is just about asking questions, note that this query isn’t a question or even a complete sentence. The point is that Powerset respects the natural linguistic structure of the query. The results are appropriately limited to sentences that explicitly refer to politicians dying in office, not just pages that contain the key words.

Note that Powerset also knows that “governors” are types of politicians. This is an example of how Powerset uses ontologies to identify specific examples that match more abstract concepts. In the second result, Powerset actually knows that Daniel MacDonald is a member of the Canadian Parliament, and that MPs are types of politicians.

Look for another query next week!

- Scott Prevost, Director of Product

UPDATE (6/11/07) – The search results above represent an index built only with Wikipedia articles, not the entire Web. For comparison, try a site-restricted search.

Powerlabs Now Accepting Signups!

Posted by Mark Johnson Wed, 06 Jun 2007 17:40:00 GMT

powerlabs_coming_soonMany of you have left comments on the Powerset blog like "when will u b ready?" essentially asking when users will get to see and play with Powerset. We’ve listened! While we’re not ready quite yet to show off our stuff, we are getting very close and will be launching Powerlabs soon. In Powerlabs, you’ll be able to see product demos and product ideas, learn about natural language search, and, most importantly, give your feedback on our product and contribute to next generation search. We’ll post more details shortly, but please sign up as soon as possible to make sure that you’re in the earliest wave of users we let into Powerlabs. If you’re interested, go to our homepage and sign up today for Powerlabs or sign up below.

Sign up for Powerlabs today

Cheers!
-Mark Johnson, Product Manager

Opposites Attract

Posted by Mark Johnson Mon, 04 Jun 2007 22:40:00 GMT

In the old Family Feud game show, contestants battle to fill in the blank with the most popular answers for questions like "Name a famous Julia" or "Name something you do on the beach". The skill of the game is to figure out what other people think and are likely to say; this is a hard game to do well at if you are not deeply familiar with the idiosyncrasies of the culture and the language. Native speakers of a language have a lot of tacit knowledge about which words tend to go together and which clash. If we play a game where I say a word and you say the first word that comes to your mind, I can predict with some degree of accuracy what you’ll say. (Quick – I say "doctor" – what do you say?) This tacit knowledge is based partly on which words we’ve heard and read together in the past. Don’t believe me? Two computational linguistics researchers named Justeson and Katz did a study back in the early 90’s that showed that by counting how often pairs of words occurred close together in a stretch of text, they could predict which words would be considered the opposites – or antonyms – of a given word, and which would not. Are antonyms obvious? What is the opposite of "light"? Most people would say "dark", but why not "dim"? What’s the opposite of "big"? Most people would say "small", but why not "little"? What is the opposite of "rough"? Why is it "smooth" rather than "even"? The answer is that we as native speakers have been exposed to these words co-occurring in sentences day after day, week after week, month after month, and year after year, till we mentally "hear" these as being the correct antonyms and don’t ever consider the alternatives. Initially, I found it counter-intuitive that antonyms co-occur. You would think that they talk about things that are opposite one another, so their contexts shouldn’t overlap. But this is where the linguistic aspect comes in. Antonyms are often used to contrast concepts, and so often appear close together in text. Thus you might say "She’s not happy, she’s sad" or "Do you want the big one or the little one?" We even hear antonyms together in common phrases and clichés, such as "it’s like night and day". In summary, although there are many concepts that represent opposite meanings, only certain words are conventionally used to express these opposites, thus achieving the status of antonyms. People learn which ones sound right by frequent exposure to the words within text and spoken language. If you don’t know the language well, you use the wrong adjective and you lose at Family Feud. And that’s the long and the short of it.

(Did you say "nurse" above? Of course you did!)

-Marti Hearst

*Editor’s Note: Marti is a professor at Berkeley in the very cool School of Information and a consultant at Powerset. All opinions expressed are more or less hers. When she’s inspired by a cool feature of language, she’ll blog it here.*