It is so easy to transform ideas into programs using Haskell.
I was in a borin conference. The conferences excite my imagination in the same way than walking fast. I started to think about how the core of a search engine would look like. I also need one for my portal http://www.FreeChooser.com (more on this portal coming soon). I needed to index not only web pages, but also Haskell data records.
My search engine needed a compact index database, well designed so that the the search should be fast, even when searching for many words at the same time, it would return a short set of documents that contains all the words requested. This is the way to achieve a compromise between simplicity, speed, ease of use and resource waste. I didn't realize that the core idea could have been achieved in less than 50 lines of Haskell code!!!
I heard time ago about a search engine www.alltheweb.com that used bit maps to achieve the fastest search results six years ago.
During the conference, I figured out a way to use bitmaps. For each word, the index must store a set of bits. each bit correspond to a document identified by the position in the bitmap (an Inte(ger)) and the association of this Int(eger) with the URL of the document. the bit will have a 1 when the document has this word and 0 when not. Very simple.
To search for the documents that contain term1 AND term2 ... AND TermN is just a bitwise AND of the respective bitmap terms.
So I started to code it as soon as I returned home. Here is the code:
: |
Module Search where ------------------------------------------------- -- The Core of a Search Engine -- Author: Alberto Gómez Corona -- Language: Haskell -- Terms of use: you can do whatever you want -- with this code as long as you keep this notice ------------------------------------------------ --store the URI list of indexed objects (reverse to intKey) keyInt= unsafePerformIO $ new (==) hashString :: HashTable String Int --store the bitmaps for each word hs= unsafePerformIO $ new (==) hashString :: HashTable String Integer --store the URN for each bit position intKey= unsafePerformIO $ new (==) hashInt:: HashTable Int String --store the last bit position/las int id of the last object indexed indexRef =unsafePerformIO $ newMVar 0 :: MVar Int -- add the content of an object, identified by the uri string: addObject:: (Indexable a)=> String->a->IO() addObject uri content= do mk<- HT.lookup keyInt uri case mk of Just k -> return () --to avoid repeated indexing of documents Nothing-> addObject1 uri content where addObject1 uri content=do n <- takeMVar indexRef add n (k n) $ listIze content update keyInt uri n update intKey n uri putMVar indexRef (n+1) where add n k []= return () add n k (x:xs)= do mv<- HT.lookup hs x case mv of Just v -> update hs x (v .|. k) Nothing-> update hs x k add n k xs k n= shiftL 1 n -- search for a set of ANDed keywords and return the list of URIs That contain all of them search xs = do n <- readMVar indexRef print n mbi<- mapM (HT.lookup hs)xs let bi = catMaybes mbi -- if a word has not been found then return empty results if length bi/= length mbi then return [] else do let r= foldr (.&.) (head bi) bi -- select te doc bits that have all the words --r has the bits of the objects that match print r getObjecURIs n 0 r [] where getObjecURIs :: Int->Int->Integer->[String]->IO [String] getObjecURIs 0 _ _ rs= return rs getObjecURIs n i r rs= case (testBit r 0) of True -> do urn <- HT.lookup intKey i getObjecURIs (n-1)(i+1) (shiftR r 1) (fromJust urn:rs) False-> getObjecURIs (n-1)(i+1) (shiftR r 1) rs |
Of course, it is necessary to add persistence for the index database hs and the rest of the associated HashTables and so on.
There are two procedures addObject (to add a new object URI and his string content) to the index database. The other method is Search. It takes a list of words as arguments and return the list of documents that have all the words.
To feed addObject, I also started to define filters for special kind of objects: (text files, Haskell registers, XML/HTML files). The code below shows an early attemp to do so. Note the type signature of adObject.
I have defined a Class Indexable that has a function that convert the object into a list of strings:
--a is indexable if it can be serialized into a list of word strings class Indexable a where listIze:: a->[String] data Words= Words String instance Indexable Words where listIze (Words s)= str2List s (\c ->isAlphaNum c || c=='@') where str2List [] _ = [] str2List s f = h:r where (h,t) = myreads2 s f r = str2List t f myreads2 [] _=([],[]) myreads2 (c:rest) f | not(f c) = ([], rest) | otherwise = (c:r, skipspaces s f) where (r,s) = myreads2 rest f skipspaces [] _ = [] skipspaces (c:rest) f | not(f c) = skipspaces rest f | otherwise = c:rest data XMLText= XMLText String instance Indexable XMLText where listIze (XMLText s)= str2List s (\c ->isAlphaNum c || c=='@') where str2List "" _ = [] str2List ('<':rs) f = str2List (tail $ dropWhile(/='>') rs) f str2List s f = h:r where (h,t) = myreads2 s f r = str2List t f myreads2 [] _=([],[]) myreads2 (c:rest) f | not(f c) = ([], rest) | otherwise = (c:r, skipspaces s f) where (r,s) = myreads2 rest f skipspaces [] _ = [] skipspaces ('<':rs) f = skipspaces (tail $ dropWhile(/='>') rs) f skipspaces (c:rest) f | not(f c) = skipspaces rest f | otherwise = c:rest data Idx a= Idx a instance Show a => Indexable (Idx a) where listIze a = listIze $ Words $ show a instance Show a=> Show(Idx a) where show (Idx a)= show a |
With this quick and dirty filter definitions for Text, XML/HTML and Haskell registers, we can do this:
data Register=Reg String Integer deriving Show main=do addObject "input1.txt" $ Words "John, 18" addObject "input2.html" $ XMLText "<html><body><p>John</p><body>18</body></html>" addObject "HaskelReg.hs" $ Idx $ Reg "John" 18 r<- search ["John","18"] print r |
The source of Search.hs can be found
here
See part 2 of the search engine
Note: full text Search is now a module in TCache (Data.TCache.IndexText) integrated in the query language (Data.TCache.IndexQuery) see:
http://hackage.haskell.org/package/TCache
See part 2 of the search engine
Note: full text Search is now a module in TCache (Data.TCache.IndexText) integrated in the query language (Data.TCache.IndexQuery) see:
http://hackage.haskell.org/package/TCache
No comments:
Post a Comment