Powerset

Powerset Blog

Noun-Noun Compound is Like a Chocolate Box

Posted by Mark Johnson Tue, 26 Jun 2007 02:03:00 GMT

A noun-noun compound is a noun phrase composed only of nouns, like chocolate box. And ok, the title above isn’t quite right. It should be "A Noun-Noun Compound is like a Box of Chocolates." And in fact Chocolate Box, although a common phrase (383,000 hits on Google today) is ambiguous in meaning. Is it a box made of chocolate or a box filled with, chocolate? Logically, and based on life experience, either of these works. But what it clearly doesn’t mean is a box used to spread chocolate or a box used in a chocolate room.

Why would I even propose these meanings? Because these are logical meanings for some other noun compounds. It’s what is meant by butter knife – a knife used to spread butter – and kitchen knife – a knife used in the kitchen.

How do we, as language speakers, know that chocolate box usually means a box filled with chocolates, but butter knife means a knife used to spread or cut butter? We can also talk about a steel knife – a knife made of steel – which is like the box made of chocolate interpretation of chocolate box.

The answer of course is that we know what the individual words mean, and we know a lot about the world and how these words are used in life. We know that engine oil is oil used to lubricate an engine, while olive oil is oil extracted from olives. But it’s much harder for a computer program to figure out this sort of thing.

My PhD students and I have tackled the problem of how to write programs that figure out what the relationship is between two nouns in a noun compound. We’ve tackled it in two different ways. My former student Barbara Rosario created an algorithm that classifies each word in the noun-noun compound into a category, and then formulates hypotheses about the relations between the two words based on those categories. So PLANT followed-by LIQUID means that the liquid is derived from or extracted from the plant, giving us compounds like olive oil, orange juice, and almond extract. We found this method worked well in a specialized domain (biomedical text). But for general text, it can be hard to get things right using this technique. For example, MATERIAL followed-by INSTRUMENT commonly can go two different ways. A steel knife is a knife made from steel but a steel cutter is a tool used to cut steel.

So later another student of mine, Preslav Nakov, came up with a very simple, very effective solution to the problem . What he realized is that noun-noun compounds are like chocolate truffles; you have to break them apart to see what is inside. That is, we break the two nouns apart, flip their order, and then look for verbs that lie inside. So to find out what the relationship is between chocolate and box, we search for the following on a web search engine:

box that * chocolate
boxes which * chocolate
box which * chocolates

and so on. (The * is a wildcard; any word can be placed in its position.) We then parse the text summaries that the search engine returns, to find which ones have verbs in the * position, and then count up the most frequent verbs. These verbs (sometimes along with their associated prepositions) very often give a succinct and meaningful summary of the relationship between the two nouns. For example, the most frequent verbs associated with olive oil are come from, obtained from, made from, and produced from. Verbs frequently associated with chocolate box include hold and contain. From these verbs we can infer the relationship that lies hidden between the nouns.

Why does this matter? Say you were building a search engine that can process natural language, and you want to deal with a user query like tell me about student protests. You want to accurately expand this search to look for demonstrations that draw students but not demonstrations that condemn students, which would be a paraphrase for war protests. Smart paraphrasing may lead to smarter search results.

-Marti Hearst