Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

Thursday, September 17, 2009

Revisiting Optimal Number of Keywords


When building a query, how many keywords is enough and what number is too many?

There really are no absolutes here because so many variables are involved, but there are some guiding principles which I've found consistently helpful.

The first item on the Query Checklist remains highly relevant: 'How many key concepts are contained in the question?' If you are merely interested in mp3 players, there's one concept contained in two words. On the other hand, if you want to know 'How many buffalo are there today in North America?', then you have four key concepts with which to contend:
what - many (number), buffalo
where - North America
when - today
Generally, the more defined the objective, the more concepts there are. Searching for just one or more than three concepts both may be problematic because of literal matching.

Literal matching: Too few terms
One-word queries are often ineffective because they match so much information that is irrelevant. The reason results are irrelevant is that the search wasn't defined sufficiently to begin with. I may want to find information on buffalo, but if I search only using that word, I will have to browse through a lot of information I may not care about. Interestingly, about half the college-aged subjects taking College Board's ICT test a couple years ago used one-word queries (citation needed--anyone up for the challenge?). One-word or single concept queries are probably good enough if you want to do a broad scan of the information landscape pertaining to a product, a person, an idea, etc., but they tend to cast a very wide net and consequently slow you down.

Part of the problem with a word like buffalo is that it has more than one meaning. You only had one meaning in mind but the search engine doesn' t know that because it looks for literal matches. This is where I usually introduce the 1 in 5 rule (although it's more of a phenomenon of language than it is a rule). On average, there are five terms that may be used for the concept you have in mind. You say buffalo, others say bison. There are probably only a couple more (ungulate anyone?), but in some cases there could be many more alternate terms (this happens especially with verbs).

Literal matching: Too many terms
Trying to match all the same words used by an author becomes increasingly difficult the more words you use. The beauty of search engines is that when you use words in a meaningful context, they tend to retrieve the meaning you have in mind. That's why a search for many buffalo North America doesn't yield information about buffalo wings or the Buffalo Bills football team. But those may not be the exact words an expert used when writing on the subject I'm researching. He or she may have used population instead of many, bison instead of buffalo. Proper nouns such as North America are more likely to be matched. The more terms you use, the less likely it becomes you will find an exact match.

I've had a lot of success searching for two or three concepts in my career. It requires keeping important concepts in mind that aren't really needed in the query--such as today in the buffalo example. Scanning the results, I look for current data, not information from the 1800's.

Sure, you can use queries containing more than three concepts, but unless you have a good idea what words an expert used, you're pushing your luck. Probability is against you. You're better off keeping your query simple.