Tuesday, September 15, 2009

Revisiting Word Order

When does the order of keywords matter?

The ninth item of the query checklist was always last because keyword order mattered the least. This remains largely the case.

Take a query I used today while doing some IMSA program planning: business ethics simulation. There are five other ways to order the terms. But does it make any difference?

Analyzing the top ten results in Google, Bing and Yahoo, here's how many different results were obtained when the order was switched (a total of 60 different results per engine is theoretically possible):
14 - Google
15 - Bing
15- Yahoo
A few other insights are worth mentioning:

Google returned the identical top result no matter the keyword order. The second and third slots were filled consistently by the same two pages with minimal alternation. In all, six returns were common across all possible keyword combinations. Queries that returned the most diverse results were: business ethics simulation, ethics simulation business and ethics business simulation. I'm not sure what to make of this observation, but I thought I'd mention it nonetheless. Any ideas?

Compared to Google, Bing was more varied in its ranking of results. No page was consistently the top result, although five pages appeared in the top ten on all trials. While Bing produced one more unique page than Google, several pages were from the same site. Of greater interest, Bing and Google returned a number of pages not replicated by the other (see below).

Yahoo, like Google, consistently returned the identical top page no matter what the query order. The second return was also identical across all queries, although this page was related to the first, so not entirely a unique return. Again, five of the same results were found with every query. Yahoo did not return Google's top return at all, but both Google and Bing included Yahoo's top result.

All three search engines combined produced a total of 31 unique returns. If I had stopped after entering the first query--business ethics simulation--the three search engines would have yielded 21 different pages. Fifteen additional queries netted only 10 additional, unique pages. Probably not worth the effort.

Pages unique to each search engine:
7 - Google
4 - Bing
9 - Yahoo
What to make of this? The biggest lesson, it seems to me, is that searching different databases is more worthwhile than playing with word order. Without looking past the first page of each, I netted twice as many highly ranked results than if I had only used Google. (Now whether the results are all that relevant is a matter of investigation). By contrast, I netted only 4-5 new pages by sticking with one search engine and varying the keyword order.

Based on the number of unique results, if you're not using Yahoo, you might consider adding it to your list of go-to search engines.

Some differences are obtained by changing the word order, but maybe not enough (in this case) to warrant going through all the permutations. In general, stick with the natural language order of the words. It seems natural to say business ethics simulation. The other forms seem a bit awkward or forced. Since search engines look for words in relationship to one another, and this is the order most people might use when writing about business ethics simulations, it's good enough. I'm sure there are cases you can think of when a particular order works better. If there are, post your reply.

There's one case when order is highly important: when operators are used. The operator modifies the keywords around it, so if placed in the wrong order, the results may be wildly unpredictable. For example: business OR ethics OR simulation (a student favorite when they stumble upon the OR operator).

Next time: revisiting the optimal number of keywords.


Scot Witt said...

Hi Carl-

Having worked with Google on a major database project which used Googlebase and Google, we found we could not predict the Google search results from day to day.

It seems the almighty search algorithm includes either a number of hits history or click through history component- perhaps both.

I'd suggest doing your test over two or three days and documenting results..I think you'll be suprized. I waited a couple of hours and the results changed with some finely honed keywords (scientific and consumer level botanical names).

Another thought- Everyone seems to think I'm pretty good with searches (writer, analyst, words? What can I tell you?) and find the real issue is narrowing the search...which is why I do a general search first- if I get what I need, I'm done. If I get a lot of strays, I go into the advanced search and start narrowing the results using a lot of NOT boolean logic on common hits which are confusing the issue.

Carl Heine said...

Your points are well taken. From a practical perspective, I don't think most people will think to repeat queries once they've done them, except maybe to look again at a result they didn't check out the first time. Then they may notice variations in the results obtained. I found some minor changes in the results the second time I submitted a query, which--as you found--indicates there are dynamic things to which the search algorithm is sensitive. Word order swapping still produces fewer new results than trying the same query in a different database.

Your use of the NOT operator is typically something I try to avoid. I may address the strategy of exclusion in an upcoming blog and welcome your input.