Tuesday, November 14, 2006

A search engine written in Haskell

NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module   where you can find an example.

It is so easy to transform ideas into programs using Haskell.
I was in a borin conference. The conferences excite my imagination in the same way than walking fast. I started to think about how the core of a search engine would look like. I also need one for my portal http://www.FreeChooser.com (more on this portal coming soon). I needed to index not only web pages, but also Haskell data records.
My search engine needed a compact index database, well designed so that the the search should be fast, even when searching for many words at the same time, it would return a short set of documents that contains all the words requested. This is the way to achieve a compromise between simplicity, speed, ease of use and resource waste.  I didn't realize that the core idea could have been achieved in less than 50 lines of Haskell code!!!
I heard time ago about a search engine www.alltheweb.com that used bit maps to achieve the fastest search results six years ago.
During the conference, I figured out a way to use bitmaps. For each word, the index must store a set of bits. each bit correspond to a document identified by the position in the bitmap (an Inte(ger)) and the association of this Int(eger) with the URL of the document. the bit will have a 1 when the document has this word and 0 when not. Very simple.
To search for the documents that contain term1 AND term2 ... AND TermN is just  a bitwise AND of the respective bitmap terms.
So I started to code it as soon as I returned home. Here is the code:

Module Search where
-- The Core of a Search Engine
-- Author: Alberto Gómez Corona
-- Language: Haskell
-- Terms of use: you can do whatever you want
-- with this code as long as you keep this notice

--store the URI list of indexed objects (reverse to intKey)
keyInt= unsafePerformIO $ new (==) hashString :: HashTable String Int

--store the bitmaps for each word
hs= unsafePerformIO $ new (==) hashString :: HashTable String Integer

--store the URN for each bit position
intKey= unsafePerformIO $ new (==) hashInt:: HashTable Int String

--store the last bit position/las int id of the last object indexed
indexRef =unsafePerformIO $ newMVar 0 :: MVar Int

-- add the content of an object, identified by the uri string:
addObject:: (Indexable a)=> String->a->IO()
addObject uri content= do
  mk<- HT.lookup keyInt uri
  case mk of
    Just k -> return () --to avoid repeated indexing of documents
    Nothing-> addObject1 uri content
 addObject1  uri content=do
  n <- takeMVar indexRef
  add n (k n) $ listIze content
  update keyInt uri n
  update intKey n uri
  putMVar indexRef (n+1)

   add n k  []= return ()
   add n k (x:xs)= do
        mv<- HT.lookup hs x
        case mv of
                  Just v -> update hs x (v .|. k)
                  Nothing-> update hs x k
        add n k xs
   k n= shiftL 1 n

-- search for a set of ANDed keywords and return the list of URIs That contain all of them
search xs = do
  n <- readMVar indexRef
  print n
  mbi<- mapM (HT.lookup hs)xs
  let bi = catMaybes mbi
  -- if a word has not been found then return empty results
  if length bi/= length mbi then return [] else do
   let r= foldr (.&.) (head bi)  bi -- select te doc bits that have all the words  
       --r has the bits of the objects that match
   print r
   getObjecURIs n 0 r []

    getObjecURIs :: Int->Int->Integer->[String]->IO [String]
    getObjecURIs 0 _ _ rs= return rs
    getObjecURIs n i r rs=
        case (testBit r 0) of
          True -> do
                  urn <- HT.lookup intKey i
                  getObjecURIs (n-1)(i+1) (shiftR r 1) (fromJust urn:rs)

          False-> getObjecURIs (n-1)(i+1) (shiftR r 1) rs
Of course, it is necessary to add persistence for the index database hs and the rest of the associated HashTables and so on.
There are two procedures addObject (to add a new object URI and his string content) to the index database. The other method is Search. It takes a list of words as arguments and return the list of documents that have all the words.
To feed addObject,  I also started to define  filters for special kind of objects: (text files, Haskell registers, XML/HTML files). The code below shows an early attemp to do so. Note the type signature of adObject.
I have defined a Class Indexable that has a function that convert the object into a list of strings:

--a is indexable if it can be serialized into a list of word strings        

class Indexable a where
    listIze:: a->[String]

data Words= Words String

instance Indexable Words where
  listIze (Words s)= str2List s (\c ->isAlphaNum c || c=='@') where

   str2List [] _        = []
   str2List s  f        = h:r where       
    (h,t)        = myreads2 s f
    r        = str2List t f

    myreads2 [] _=([],[])
    myreads2 (c:rest) f | not(f c)  = ([], rest)
                        | otherwise = (c:r, skipspaces s f)
                            where (r,s) = myreads2 rest f

    skipspaces [] _ = []
    skipspaces (c:rest) f  | not(f c)        = skipspaces rest f
                            | otherwise        = c:rest

data XMLText= XMLText String

instance Indexable XMLText where
  listIze (XMLText s)= str2List s (\c ->isAlphaNum c || c=='@') where

   str2List "" _        = []
   str2List ('<':rs) f  = str2List (tail $ dropWhile(/='>') rs) f
   str2List s  f        = h:r where
    (h,t)        = myreads2 s f
    r                = str2List t f

    myreads2 [] _=([],[])
    myreads2 (c:rest) f | not(f c)  = ([], rest)
                       | otherwise = (c:r, skipspaces s f)
                                   where (r,s) = myreads2 rest f

    skipspaces [] _ = []
    skipspaces ('<':rs) f = skipspaces (tail $ dropWhile(/='>') rs) f
    skipspaces (c:rest) f
               | not(f c)        = skipspaces rest f
             | otherwise        = c:rest

data Idx a= Idx a
instance Show a => Indexable (Idx a) where
    listIze  a = listIze $ Words $ show a

instance Show a=> Show(Idx a) where
    show (Idx a)= show a
With this quick and dirty filter definitions for Text, XML/HTML and Haskell registers, we can do this:

data Register=Reg String Integer deriving Show


  addObject "input1.txt" $ Words "John, 18"

  addObject "input2.html" $ XMLText "<html><body><p>John</p><body>18</body></html>"

  addObject "HaskelReg.hs" $ Idx $ Reg "John" 18

  r<- search ["John","18"]
  print r

The source of Search.hs can be found here
See part 2  of the search engine

Note: full text Search is now a module in TCache (Data.TCache.IndexText) integrated in the query language (Data.TCache.IndexQuery) see:


No comments: