Search engine company Cuil has launched a new service earlier today, claiming a 120 billion page index – larger than any of its competitors, including Google. Building an index of this magnitude is definitely impressive (or shall I say cool) but only addresses one aspect of the ever-growing problem of information overload on the web. The others are maintaining freshness, scaling query execution, delivering relevance and providing usable presentation.
Perhaps the greatest achievement of Google is index freshness and query scalability. They don’t need thousands of machines for nothing. Page-rank did introduce an ingenious and elegant way of ranking results but it has not fundamental change the TF-IDF algorithms used by information retrieval engines 30 years its senior. So where is the brilliance? In my humble option it’s the sheer engineering required to incrementally update a very large weighted reachability matrix, scaling the crawler and delivering sub-second query execution time.
Cuil are claiming a massive improvement in these areas, which could potentially reduce the cost of index maintenance and consequently increase its freshness, as well as deliver lightning fast performance to end-users. However, Cuil are also among the first large scale search engines to introduce some level of semantic analysis. The ‘Explore by Category’ widget on the top right of the results page is probably the most significant usability improvement delivered by Cuil today. This small feature, if used along a rich taxonomy (it’s too early to say, but the few queries I have tried did come up with relevant categories) is a huge promise. It has the potential to change the way people interact with search results – navigating deeper into a concept rather than manually refining the query.
Relevance of the result set is a different story. Here, again through a brief experiment, I was not overly impressed. Yes, results on the first page (depending on how broad the query was) came up relevant, but are they the most relevant? This, of course, is highly dependent on the subjective measure of what one considers relevant to a search term.
Now this is what really got me puzzled – presentation. What is this nonsense? If all pages are considered equal candidates for a match and users expect the ‘answer’ to be in one ‘most relevant’ page, what value does a news portal arrangement provide? Is the second block on the third row more relevant than the one right next to it? Just show results in a way that enables a quick scan. At the end of the day, users are looking for the needle – they don’t want to stay and read, and sure don’t care about a bigger haystack.