Friday, November 24, 2006

A search engine written in Haskell. Part two

NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module   where you can find an example.


In the first post  of this Blog , I described what would be the core or a search engine. Now it´s time to extend it a little more. Up to now, the search return a list of URI strings (unique keys in the case of database records, file-names in the case of files on a file-system, URL´s in the case of Web pages.
 
But it is meaningless to present this list to the user. An excerpt of each object, with the paragraphs that contains the keywords searched for may be something very valuable. At the end, if I take a little more effort and runtime resources to present something with meaning, it is probable that the user will not repeat the search many times and browse many files.  This is something that, no doubt, would be far more costly in every aspect.
 
 So the purpose is to extract an excerpt from the list of URIs returned by the core search procedure.
 
showExcerpts:: [String]-> String-> [(String,Int)]
showExcerpts xs r=  se xs text [] 0 []
    where
        text= show r

        se:: [String]->String->String->Int->[(String,Int)]->[(String,Int)]

        se _ [] _ _ fr= fr

        se qs t@(x:xs) f hasKeys fr
          | (not.null) $ filter (==True)$ map ((flip  isPrefixOf) t) qs= 
                 se qs xs (append f  x) (hasKeys+1) fr --some word found

               
          | x `elem` ".,"=  -- limit of fragment
                case hasKeys of
                 0 -> se qs xs [] 0 fr  --the previous fragment was

                                        --void of keys
                 _ -> se qs xs [] 0 ((f,hasKeys):fr) --added the old,

                                                     --new fragment starts

          | otherwise= se qs xs (append f x) hasKeys fr


        append l c= l ++ [c]

 
ShowExcerpts take a set of keywords, a Show-able something and return the paragraphs that contains at least one of the keywords and the number of keywords found in each paragraphs.
 
for example,  In this test I use text from Paul Graham the genius :
 
 
 
main=
        print $ showExcerpts ["Daddy","model"] text

  where text=  "When people care enough about something to do it well, those who do it best tend to be far better than everyone else. There's a huge gap between Leonardo and second-rate contemporaries like Borgognone. You see the same gap between Raymond Chandler and the average writer of detective novels. A top-ranked professional chess player could play ten thousand games against an ordinary club player without losing once.Like chess or painting or writing novels, making money is a very specialized skill. But for some reason we treat this skill differently. No one complains when a few people surpass all the rest at playing chess or writing novels, but when a few people make more money than the rest, we get editorials saying this is wrong.Why? The pattern of variation seems no different than for any other skill. What causes people to react so strongly when the skill is making money?I think there are three reasons we treat making money as different: the misleading model of wealth we learn as children; the disreputable way in which, till recently, most fortunes were accumulated; and the worry that great variations in income are somehow bad for society. As far as I can tell, the first is mistaken, the second outdated, and the third empirically false. Could it be that, in a modern democracy, variation in income is actually a sign of health?The Daddy Model of WealthBecause kids are unable to create wealth, whatever they have has to be given to them. And when wealth is something you're given, then of course it seems that it should be distributed equally. [2] As in most families it is. The kids see to that. \"Unfair,\" they cry, when one sibling gets more than another.In the real world, you can't keep living off your parents. If you want something, you either have to make it, or do something of equivalent value for someone else, in order to get them to give you enough money to buy it. In the real world, wealth is (except for a few specialists like thieves and speculators) something you have to create, not something that's distributed by Daddy. And since the ability and desire to create it vary from person to person, it's not made equally.It may seem unlikely in principle that one individual could really generate so much more wealth than another. The key to this mystery is to revisit that question, are they really worth 100 of us? Would a basketball team trade one of their players for 100 random people? What would Apple's next product look like if you replaced Steve Jobs with a committee of 100 random people? [6] These things don't scale linearly. Perhaps the CEO or the professional athlete has only ten times (whatever that means) the skill and determination of an ordinary person. But it makes all the difference that it's concentrated in one individual. When we say that one kind of work is overpaid and another underpaid, what are we really saying? In a free market, prices are determined by what buyers want. People like baseball more than poetry, so baseball players make more than poets. To say that a certain kind of work is

 
 the results are:
 
C:\Documents and Settings\Propietario\Escritorio\haskell>main

[(" Why else would this idea occur in this odd context? Whereas if the speaker were still operating on the Daddy Model",2),(" The appearance of the word unjust here is the unmistakable spectral signature of the Daddy Model",2),(" not something that's distributed by Daddy",1),(" variation in income is actually a sign of health?The Daddy Model of Wealth Because kids are unable to create wealth",2)]
   
 
 Things to note:
  • The paragraphs are not sorted by relevance, but the relevance is calculated (the numbers attached to each paragraph).
  • There is no limit in the number of paragraphs to return
 
This is by design. It is a task of the consumer to decide either to sort it, limit the number of paragraphs or whatever. That is so because this algorithm do it all in one single pass, so that the lazy nature of Haskell permit getExcerpts stop wasting resources when the consumer decides that there are enough paragraphs extracted. Great!.
 
 there are many many cases like this in web programming where laziness pays a lot: HTML Pagination for example. This is something that happens too in the presentation of search results.
 
Note that there are hard coded thinks, for example, I take the dot and the comma as the only separators of paragraphs. By the way, How the object can be obtained realistically ?. This is a mook-up. it is certainly not general and hardly reusable. 
 
For that matter I have enhanced the Type class Indexable  for retrieving, convert to string and give the separators to split the text in words and paragraphs.
 
class Indexable a where
    listIze   :: a -> [String]
    getFromURI:: String -> IO a --get the object as s String
    toString :: a-> String
    expSeparators :: [Char]
    wordSeparators:: [Char]

 
The additions are not orthogonal with listIze, but at this moment, it is enough
 
So how looks like the integration of this stuff with the previous search engine described in the first post ?
 
-- return a list of URIs and excerpts for each document that match the search criteria

searchResults keys= do
  let xs=listIze $ Words str
  uris<- search xs
  rs  <- map getFromURI uris
 
  return $ zip uris (map (showExcerpts xs) rs)

 
searchResults is, basically, a map to call showExcerpts for each URI returned by search.
 
Again, there is no limit in the number of the results returned. The lazy nature of Haskell permits me to give to the consumer total freedom and responsibility, without complicating the code with clumsy parameters and buffering.  
 
The caching of document content for extracting excerpts is not the search engine responsibility up to now. Google do it, so it may be a good idea to cache the content if possible. The transactional cache  can be a good candidate for this role. Specially when the objects indexed are being updated transactionally and the index engine has to take care of the changes. I´m working on it. 
 

Thursday, November 16, 2006

Haskell Transactional Cache

NOTE: last version download at: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/TCache

I was impressed by Haskell since I started to learn it a year ago. I was looking for a language that can be used for genetic programming, and haskell eases the creation custom languages such is the one I needed.

Soon I was fascinated by the pure functional approach. I was also attracted by the concept of referential transparency, that makes life easier for code clarity and maintainability. The smart type system also was impressive on avoiding common errors at compilation time that plagued other languages. For these reason and a lot more I though that Haskell has reasonable perspectives to be the future Java. Concerning Web programming, despite the timid effort done up to now, surprisingly, Haskell is in the best track to win the race for the best Web engine. Haskell Server Pages has a quite good web framework with unique capabilities. HSP Client-side has the best approach for Ajax programming.

But Haskell still has many holes for rapid application development, something that is vital for escalating Haskell from naive, fun driven web projects to industrial acceptance. It lacks a MVC (Model View Controller) framework. Haskell has not something like Rails for Rubi or Hibernate for Java.

One of the problems of Web development is the three tier programming problems. It is about how to keep low the impedance between layers when each layer has his own architectural requirements so that they are designed around different paradigms. With impedance I refer to the additional problems introduced within each layer, at every stage of the software life-cycle, produced by the mismatch with other layers.

Typically, the data layer is constructed around a database and a relational language. The business layer uses a object oriented language, while the presentation layer uses HTML, Javascript and some other web technologies. Depending on the framework chosen, more often than not, the middle layer end with a mess of relational, business and presentation code.

That happens even when the framework explicitly forbid that, unless some clear configurable interfaces are defined to isolate them. By using the natural data on each layer and providing  mapping mechanisms between the idiosyncrasies of the respective layers avoids the middle layer contamination. This keep the impedance low not only at the design and development time but also in  test, integration and
maintenance. Each layer expert  can perform his work with little care about the technologies configurations, design decisionsand changes in the other layers. Such kind of mappings are exemplified, in the case of the Business-Data layer boundary, by storage-independent data persistence mechanisms, such is Hibernate for Java and Rails for Rubi.  In the case of the Business-Presentation Layer boundary, something like the Custom Tags of JSP and ASP.NET do the same role into isolating the presentation from the  logic of the business layer . They are also mapping mechanisms. XSLT also play this role.

In the other side, I HATE designing and programming databases (database record and field access is so low level that it can ruin the readability and elegance of any algorithm). I want to use my own defined objects, not the pieces of information defined to accomplish with low level data requirements. I want not to take care about on every single access to the data I need.

So i need to:


  1.  Abstract myself from the physical data structure and rather use the data definitions that fit with the nature of the problem to solve

  2.  I need Transaction-aware data objects, because in the multi-threaded world of Internet servers, two clients may want to buy the same Item at the same time and I want to decrease the stock of this item two times instead of one, for example (if there are two items at least).

  3.  I also need a way to define mappings between the objects I use and the physical storage, no matter that it is a file-system or a Database.

  4.  should be abstract to work with any kind of data definitions.

Once the database is just a persistence mechanism for a web application the transactionality can not be provided by the database because the updates are done in the cached objects. The data has to be transactional in memory. It is very important to have a general mechanism of transactions for this
reason.

 I created TCache.hs (download), a transactional cache that abstract the user from physical data and
transactional plumbing. The data access uses a very simple semantics: to access a data object I use just the data object itself updated with just the fields necessary for composing a unique key.

The transactional cache takes care of transactions, synchronizationwith the physical data ( one way, from the cache to the PD up to now) and maintenance of the cache size by dropping the data with older access.

The transaction uses STM (Software Transactional Memory) so that noblocking is used, and the resulting application is faster at doing transactions. For this reason it only compiles in GHC.

The cache is implemented in a Hash Table (Data.HashTable) so that many thread may be doing transactions at the same time.

Below you can find an usage example:



module Main where
-------------------------------------------------
-- A example of Transactional cache usage (TCache.hs)
-- (Something like the Java Hibernate)
-- Author: Alberto Gómez Corona Nov 2006
-- Language: Haskell
-- Terms of use: you can do whatever you want
-- with this code as long as you keep this notice
------------------------------------------------

import TCache

import Control.Concurrent
import System.Directory



--1 and 4: The data elements to be used in the example: A user will repeatedly buy Items.

data  Data=   User{uname::String, uid::String, spent:: Int} |
              Item{iname::String, iid::String, price::Int, stock::Int}

              deriving (Read, Show)

--3 The mappings between the cache and the phisical storage are defined by the interface IResource

--      to extract the resource unique key,
--      to read the resource from the physical storage,
--      to store it and
--      to delete the resource from the physical storage.


instance IResource Data where
        keyResource         User{uid=id}= id
        keyResource         Item{iid=id}= id      
        readResource    e= do s<- readFile$ keyResource e
                              return$ Just $ read s                                 
        writeResource   e= writeFile (keyResource e) $ show e

        delResource     e= removeFile $ keyResource e

-- buy is the operation to be performed in the example


--4 withResources gets a partial definition of each resource necessary for extracting the key, fill all the rest of the data structures (if found ) and return a list of Maybe Data. BuyIt is part of the domain problem. it receive this list and generates a new list of dat objects that are updated in the cache. buyIt is executed atomically.


user `buy` item=
    withResources[user,item] buyIt

where

    buyIt[Just us,Just it]
       | stock it > 0= [us',it'] `debug1` "john buy a PC"
       | otherwise   = error "stock is empty for this product"

    buyIt[_,_] = error "either the user or the item does not exist"
    us'= us{spent=spent us + price it}
    it'= it{stock= stock it-1}

main= do
        -- create resources (acces no resources and return two new Data objects defined in items)
        withResources[]items

        --11 PCs are charged  to the John´s account in paralel, to show transactionality
        --because there are only 10 PCs in stock, the last thread must return an error

        for 11 $ forkIO $ User{uid="U12345"} `buy` Item{iid="I54321"}
      
        --wait 5 seconds        
        threadDelay 5000000

        -- write the cache content in a persistent store (invoque writeResource for each resource)
        -- in a real application clearSyncCache can be used instead to adjust size and write the cache periodically

        syncCache (refcache :: Cache Data)

        -- the files have been created. the files U12345 and I54321 must contain the result of the 11 iterations

  where
        items _=
              [User "John" "U12345" 0
              ,Item "PC" "I54321" 6000 10
]


Here follows is a typical output. I  printed a trace with the text atomic try as the first statement of the atomic transaction. You can se how the first transaction initiated blocks the rest and even some retry. That is so because all the updates go to the same registers. Overall, STM is faster than blocking and far easier. My first version used  blocking and the code were far more complicated.



"atomic try"
"atomic try"
john buy a PC
"atomic try""atomic try""atomic try""atomic try""atomic try""atomic try""atomic
try""atomic try""atomic try""atomic try"
john buy a PC

john buy a PC

john buy a PC


john buy a PC
john buy a PC



john buy a PC
john buy a PC


john buy a PC


john buy a PC
john buy a PC

"atomic try""atomic try"
john buy a PC

john buy a PC
"atomic try"
john buy a PC"atomic try"
main: stock is empty for this product



The final content of the files are:



U12345:  User {uname = "John", uid = "U12345", spent = 60000}

I54321:  Item {iname = "PC", iid = "I54321", price = 6000, stock = 0}



The amount spent by John is the sum of th prices of the ten PCs in stock and the stock is 0, so the transactions have been managed correcly.

There are other operations; getResources that simply return the data requested withour doing any transaction.   clearSyncCacheProc start the process for cache maintenance and syncronization according with his passed parameters: time between checks, cache size and filtering criteria. getResourcesID permits not only update but also delete cache elements in the course of a transaction.

The module is far from finised. The syncronization mechanism must be refined for example with cache invalidation machanisms. I welcome anyone interested in to improve it.



NOTES:


I just realized that my transactional cache has been used in this excelent paper: Dissecting Transactional Executions in Haskell.


The test case used in the study is more or less the same example described above. Since all the parallel treads update the same two objects all the time, this is the worst case. TCache uses in-memory non blocking transactions . Each transaction do not block, but instead he try to perform the task and rollback at the very end if things have been changed by other thread in the meantime. Just like databases. The bad thing is that the more CPU cores are executing the example, the more work being rolled back is done.

Fortunately this is not the case in real life. TCache and the STM transactions are designed with the assumption that the application manages many objects and the threads update simultaneously different objects most of the time. A user does not buy eleven times simultaneously the same product. And non blocking transactions scale better than blocking ones in real life scenarios, just because they don´t block.

Tuesday, November 14, 2006

A search engine written in Haskell

NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module   where you can find an example.

It is so easy to transform ideas into programs using Haskell.
 
I was in a borin conference. The conferences excite my imagination in the same way than walking fast. I started to think about how the core of a search engine would look like. I also need one for my portal http://www.FreeChooser.com (more on this portal coming soon). I needed to index not only web pages, but also Haskell data records.
 
My search engine needed a compact index database, well designed so that the the search should be fast, even when searching for many words at the same time, it would return a short set of documents that contains all the words requested. This is the way to achieve a compromise between simplicity, speed, ease of use and resource waste.  I didn't realize that the core idea could have been achieved in less than 50 lines of Haskell code!!!
 
I heard time ago about a search engine www.alltheweb.com that used bit maps to achieve the fastest search results six years ago.
 
During the conference, I figured out a way to use bitmaps. For each word, the index must store a set of bits. each bit correspond to a document identified by the position in the bitmap (an Inte(ger)) and the association of this Int(eger) with the URL of the document. the bit will have a 1 when the document has this word and 0 when not. Very simple.
 
To search for the documents that contain term1 AND term2 ... AND TermN is just  a bitwise AND of the respective bitmap terms.
 
So I started to code it as soon as I returned home. Here is the code:
 
 
 
:

Module Search where
-------------------------------------------------
-- The Core of a Search Engine
-- Author: Alberto Gómez Corona
-- Language: Haskell
-- Terms of use: you can do whatever you want
-- with this code as long as you keep this notice
------------------------------------------------


--store the URI list of indexed objects (reverse to intKey)
keyInt= unsafePerformIO $ new (==) hashString :: HashTable String Int

--store the bitmaps for each word
hs= unsafePerformIO $ new (==) hashString :: HashTable String Integer

--store the URN for each bit position
intKey= unsafePerformIO $ new (==) hashInt:: HashTable Int String

--store the last bit position/las int id of the last object indexed
indexRef =unsafePerformIO $ newMVar 0 :: MVar Int


-- add the content of an object, identified by the uri string:
addObject:: (Indexable a)=> String->a->IO()
addObject uri content= do
  mk<- HT.lookup keyInt uri
  case mk of
    Just k -> return () --to avoid repeated indexing of documents
    Nothing-> addObject1 uri content
 where
 addObject1  uri content=do
  n <- takeMVar indexRef
  add n (k n) $ listIze content
  update keyInt uri n
  update intKey n uri
  putMVar indexRef (n+1)

  where
   add n k  []= return ()
   add n k (x:xs)= do
        mv<- HT.lookup hs x
        case mv of
                  Just v -> update hs x (v .|. k)
                  Nothing-> update hs x k
        add n k xs
  
   k n= shiftL 1 n

-- search for a set of ANDed keywords and return the list of URIs That contain all of them
search xs = do
  n <- readMVar indexRef
  print n
  mbi<- mapM (HT.lookup hs)xs
  let bi = catMaybes mbi
  -- if a word has not been found then return empty results
  if length bi/= length mbi then return [] else do
   let r= foldr (.&.) (head bi)  bi -- select te doc bits that have all the words  
       --r has the bits of the objects that match
   print r
   getObjecURIs n 0 r []

  where
    getObjecURIs :: Int->Int->Integer->[String]->IO [String]
    getObjecURIs 0 _ _ rs= return rs
    getObjecURIs n i r rs=
        case (testBit r 0) of
          True -> do
                  urn <- HT.lookup intKey i
                  getObjecURIs (n-1)(i+1) (shiftR r 1) (fromJust urn:rs)

          False-> getObjecURIs (n-1)(i+1) (shiftR r 1) rs
 
  
Of course, it is necessary to add persistence for the index database hs and the rest of the associated HashTables and so on.
There are two procedures addObject (to add a new object URI and his string content) to the index database. The other method is Search. It takes a list of words as arguments and return the list of documents that have all the words.
 
To feed addObject,  I also started to define  filters for special kind of objects: (text files, Haskell registers, XML/HTML files). The code below shows an early attemp to do so. Note the type signature of adObject.
 
I have defined a Class Indexable that has a function that convert the object into a list of strings:
 
 




--a is indexable if it can be serialized into a list of word strings        

class Indexable a where
    listIze:: a->[String]
  

data Words= Words String

instance Indexable Words where
  listIze (Words s)= str2List s (\c ->isAlphaNum c || c=='@') where

   str2List [] _        = []
   str2List s  f        = h:r where       
    (h,t)        = myreads2 s f
    r        = str2List t f

    myreads2 [] _=([],[])
    myreads2 (c:rest) f | not(f c)  = ([], rest)
                        | otherwise = (c:r, skipspaces s f)
                            where (r,s) = myreads2 rest f

    skipspaces [] _ = []
    skipspaces (c:rest) f  | not(f c)        = skipspaces rest f
                            | otherwise        = c:rest


data XMLText= XMLText String

instance Indexable XMLText where
  listIze (XMLText s)= str2List s (\c ->isAlphaNum c || c=='@') where

   str2List "" _        = []
   str2List ('<':rs) f  = str2List (tail $ dropWhile(/='>') rs) f
   str2List s  f        = h:r where
    (h,t)        = myreads2 s f
    r                = str2List t f

    myreads2 [] _=([],[])
    myreads2 (c:rest) f | not(f c)  = ([], rest)
                       | otherwise = (c:r, skipspaces s f)
                                   where (r,s) = myreads2 rest f

    skipspaces [] _ = []
    skipspaces ('<':rs) f = skipspaces (tail $ dropWhile(/='>') rs) f
    skipspaces (c:rest) f
               | not(f c)        = skipspaces rest f
             | otherwise        = c:rest

data Idx a= Idx a
instance Show a => Indexable (Idx a) where
    listIze  a = listIze $ Words $ show a

instance Show a=> Show(Idx a) where
    show (Idx a)= show a
   
With this quick and dirty filter definitions for Text, XML/HTML and Haskell registers, we can do this:
 
 

data Register=Reg String Integer deriving Show


main=do

  addObject "input1.txt" $ Words "John, 18"

  addObject "input2.html" $ XMLText "<html><body><p>John</p><body>18</body></html>"

  addObject "HaskelReg.hs" $ Idx $ Reg "John" 18

  r<- search ["John","18"]
  print r
 

The source of Search.hs can be found here
 
See part 2  of the search engine

Note: full text Search is now a module in TCache (Data.TCache.IndexText) integrated in the query language (Data.TCache.IndexQuery) see:

http://hackage.haskell.org/package/TCache