Friday, November 24, 2006

A search engine written in Haskell. Part two

NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module   where you can find an example.


In the first post  of this Blog , I described what would be the core or a search engine. Now it´s time to extend it a little more. Up to now, the search return a list of URI strings (unique keys in the case of database records, file-names in the case of files on a file-system, URL´s in the case of Web pages.
 
But it is meaningless to present this list to the user. An excerpt of each object, with the paragraphs that contains the keywords searched for may be something very valuable. At the end, if I take a little more effort and runtime resources to present something with meaning, it is probable that the user will not repeat the search many times and browse many files.  This is something that, no doubt, would be far more costly in every aspect.
 
 So the purpose is to extract an excerpt from the list of URIs returned by the core search procedure.
 
showExcerpts:: [String]-> String-> [(String,Int)]
showExcerpts xs r=  se xs text [] 0 []
    where
        text= show r

        se:: [String]->String->String->Int->[(String,Int)]->[(String,Int)]

        se _ [] _ _ fr= fr

        se qs t@(x:xs) f hasKeys fr
          | (not.null) $ filter (==True)$ map ((flip  isPrefixOf) t) qs= 
                 se qs xs (append f  x) (hasKeys+1) fr --some word found

               
          | x `elem` ".,"=  -- limit of fragment
                case hasKeys of
                 0 -> se qs xs [] 0 fr  --the previous fragment was

                                        --void of keys
                 _ -> se qs xs [] 0 ((f,hasKeys):fr) --added the old,

                                                     --new fragment starts

          | otherwise= se qs xs (append f x) hasKeys fr


        append l c= l ++ [c]

 
ShowExcerpts take a set of keywords, a Show-able something and return the paragraphs that contains at least one of the keywords and the number of keywords found in each paragraphs.
 
for example,  In this test I use text from Paul Graham the genius :
 
 
 
main=
        print $ showExcerpts ["Daddy","model"] text

  where text=  "When people care enough about something to do it well, those who do it best tend to be far better than everyone else. There's a huge gap between Leonardo and second-rate contemporaries like Borgognone. You see the same gap between Raymond Chandler and the average writer of detective novels. A top-ranked professional chess player could play ten thousand games against an ordinary club player without losing once.Like chess or painting or writing novels, making money is a very specialized skill. But for some reason we treat this skill differently. No one complains when a few people surpass all the rest at playing chess or writing novels, but when a few people make more money than the rest, we get editorials saying this is wrong.Why? The pattern of variation seems no different than for any other skill. What causes people to react so strongly when the skill is making money?I think there are three reasons we treat making money as different: the misleading model of wealth we learn as children; the disreputable way in which, till recently, most fortunes were accumulated; and the worry that great variations in income are somehow bad for society. As far as I can tell, the first is mistaken, the second outdated, and the third empirically false. Could it be that, in a modern democracy, variation in income is actually a sign of health?The Daddy Model of WealthBecause kids are unable to create wealth, whatever they have has to be given to them. And when wealth is something you're given, then of course it seems that it should be distributed equally. [2] As in most families it is. The kids see to that. \"Unfair,\" they cry, when one sibling gets more than another.In the real world, you can't keep living off your parents. If you want something, you either have to make it, or do something of equivalent value for someone else, in order to get them to give you enough money to buy it. In the real world, wealth is (except for a few specialists like thieves and speculators) something you have to create, not something that's distributed by Daddy. And since the ability and desire to create it vary from person to person, it's not made equally.It may seem unlikely in principle that one individual could really generate so much more wealth than another. The key to this mystery is to revisit that question, are they really worth 100 of us? Would a basketball team trade one of their players for 100 random people? What would Apple's next product look like if you replaced Steve Jobs with a committee of 100 random people? [6] These things don't scale linearly. Perhaps the CEO or the professional athlete has only ten times (whatever that means) the skill and determination of an ordinary person. But it makes all the difference that it's concentrated in one individual. When we say that one kind of work is overpaid and another underpaid, what are we really saying? In a free market, prices are determined by what buyers want. People like baseball more than poetry, so baseball players make more than poets. To say that a certain kind of work is

 
 the results are:
 
C:\Documents and Settings\Propietario\Escritorio\haskell>main

[(" Why else would this idea occur in this odd context? Whereas if the speaker were still operating on the Daddy Model",2),(" The appearance of the word unjust here is the unmistakable spectral signature of the Daddy Model",2),(" not something that's distributed by Daddy",1),(" variation in income is actually a sign of health?The Daddy Model of Wealth Because kids are unable to create wealth",2)]
   
 
 Things to note:
  • The paragraphs are not sorted by relevance, but the relevance is calculated (the numbers attached to each paragraph).
  • There is no limit in the number of paragraphs to return
 
This is by design. It is a task of the consumer to decide either to sort it, limit the number of paragraphs or whatever. That is so because this algorithm do it all in one single pass, so that the lazy nature of Haskell permit getExcerpts stop wasting resources when the consumer decides that there are enough paragraphs extracted. Great!.
 
 there are many many cases like this in web programming where laziness pays a lot: HTML Pagination for example. This is something that happens too in the presentation of search results.
 
Note that there are hard coded thinks, for example, I take the dot and the comma as the only separators of paragraphs. By the way, How the object can be obtained realistically ?. This is a mook-up. it is certainly not general and hardly reusable. 
 
For that matter I have enhanced the Type class Indexable  for retrieving, convert to string and give the separators to split the text in words and paragraphs.
 
class Indexable a where
    listIze   :: a -> [String]
    getFromURI:: String -> IO a --get the object as s String
    toString :: a-> String
    expSeparators :: [Char]
    wordSeparators:: [Char]

 
The additions are not orthogonal with listIze, but at this moment, it is enough
 
So how looks like the integration of this stuff with the previous search engine described in the first post ?
 
-- return a list of URIs and excerpts for each document that match the search criteria

searchResults keys= do
  let xs=listIze $ Words str
  uris<- search xs
  rs  <- map getFromURI uris
 
  return $ zip uris (map (showExcerpts xs) rs)

 
searchResults is, basically, a map to call showExcerpts for each URI returned by search.
 
Again, there is no limit in the number of the results returned. The lazy nature of Haskell permits me to give to the consumer total freedom and responsibility, without complicating the code with clumsy parameters and buffering.  
 
The caching of document content for extracting excerpts is not the search engine responsibility up to now. Google do it, so it may be a good idea to cache the content if possible. The transactional cache  can be a good candidate for this role. Specially when the objects indexed are being updated transactionally and the index engine has to take care of the changes. I´m working on it. 
 

No comments: