Haskell and Web: 2006

Wednesday, December 13, 2006

Some ideas in progress

I´m wondering about how the search engine and the transactional cache can be used together in new ways. The transactional cache has only access by unique index, there is no freedom to access objects by arbitrary expressións with boolean conditions and field values a la SQL. In the other side, the search engine has no relevance and is not enough flexible for some needs. Can I inprove the search engine to provide both functionalities?. I found that it is possible:

To improve the relevance of the search engine results
Search queries with arbitrary boolean expressions
Query for haskell objects with certain field values.
To use the query language to retrieve sets of haskell registers and perform atomic operations with these registers.

A search engine written in Haskell. Part two

NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module where you can find an example.

In the first post of this Blog , I described what would be the core or a search engine. Now it´s time to extend it a little more. Up to now, the search return a list of URI strings (unique keys in the case of database records, file-names in the case of files on a file-system, URL´s in the case of Web pages.

But it is meaningless to present this list to the user. An excerpt of each object, with the paragraphs that contains the keywords searched for may be something very valuable. At the end, if I take a little more effort and runtime resources to present something with meaning, it is probable that the user will not repeat the search many times and browse many files. This is something that, no doubt, would be far more costly in every aspect.

So the purpose is to extract an excerpt from the list of URIs returned by the core search procedure.

showExcerpts:: [String]-> String-> [(String,Int)] showExcerpts xs r= se xs text [] 0 [] where text= show r se:: [String]->String->String->Int->[(String,Int)]->[(String,Int)] se _ [] _ _ fr= fr se qs t@(x:xs) f hasKeys fr \| (not.null) $ filter (==True)$ map ((flip isPrefixOf) t) qs= se qs xs (append f x) (hasKeys+1) fr --some word found \| x `elem` ".,"= -- limit of fragment case hasKeys of 0 -> se qs xs [] 0 fr --the previous fragment was `--void of keys _ -> se qs xs [] 0 ((f,hasKeys):fr) --added the old,` `--new fragment starts \| otherwise= se qs xs (append f x) hasKeys fr append l c= l ++ [c]`

ShowExcerpts take a set of keywords, a Show-able something and return the paragraphs that contains at least one of the keywords and the number of keywords found in each paragraphs.

for example, In this test I use text from Paul Graham the genius :

main= print $ showExcerpts ["Daddy","model"] text where text= "When people care enough about something to do it well, those who do it best tend to be far better than everyone else. There's a huge gap between Leonardo and second-rate contemporaries like Borgognone. You see the same gap between Raymond Chandler and the average writer of detective novels. A top-ranked professional chess player could play ten thousand games against an ordinary club player without losing once.Like chess or painting or writing novels, making money is a very specialized skill. But for some reason we treat this skill differently. No one complains when a few people surpass all the rest at playing chess or writing novels, but when a few people make more money than the rest, we get editorials saying this is wrong.Why? The pattern of variation seems no different than for any other skill. What causes people to react so strongly when the skill is making money?I think there are three reasons we treat making money as different: the misleading model of wealth we learn as children; the disreputable way in which, till recently, most fortunes were accumulated; and the worry that great variations in income are somehow bad for society. As far as I can tell, the first is mistaken, the second outdated, and the third empirically false. Could it be that, in a modern democracy, variation in income is actually a sign of health?The Daddy Model of WealthBecause kids are unable to create wealth, whatever they have has to be given to them. And when wealth is something you're given, then of course it seems that it should be distributed equally. [2] As in most families it is. The kids see to that. \"Unfair,\" they cry, when one sibling gets more than another.In the real world, you can't keep living off your parents. If you want something, you either have to make it, or do something of equivalent value for someone else, in order to get them to give you enough money to buy it. In the real world, wealth is (except for a few specialists like thieves and speculators) something you have to create, not something that's distributed by Daddy. And since the ability and desire to create it vary from person to person, it's not made equally.It may seem unlikely in principle that one individual could really generate so much more wealth than another. The key to this mystery is to revisit that question, are they really worth 100 of us? Would a basketball team trade one of their players for 100 random people? What would Apple's next product look like if you replaced Steve Jobs with a committee of 100 random people? [6] These things don't scale linearly. Perhaps the CEO or the professional athlete has only ten times (whatever that means) the skill and determination of an ordinary person. But it makes all the difference that it's concentrated in one individual. When we say that one kind of work is overpaid and another underpaid, what are we really saying? In a free market, prices are determined by what buyers want. People like baseball more than poetry, so baseball players make more than poets. To say that a certain kind of work is

the results are:

C:\Documents and Settings\Propietario\Escritorio\haskell>main

[(" Why else would this idea occur in this odd context? Whereas if the speaker were still operating on the Daddy Model",2),(" The appearance of the word unjust here is the unmistakable spectral signature of the Daddy Model",2),(" not something that's distributed by Daddy",1),(" variation in income is actually a sign of health?The Daddy Model of Wealth Because kids are unable to create wealth",2)]

Things to note:

The paragraphs are not sorted by relevance, but the relevance is calculated (the numbers attached to each paragraph).
There is no limit in the number of paragraphs to return

This is by design. It is a task of the consumer to decide either to sort it, limit the number of paragraphs or whatever. That is so because this algorithm do it all in one single pass, so that the lazy nature of Haskell permit getExcerpts stop wasting resources when the consumer decides that there are enough paragraphs extracted. Great!.

there are many many cases like this in web programming where laziness pays a lot: HTML Pagination for example. This is something that happens too in the presentation of search results.

Note that there are hard coded thinks, for example, I take the dot and the comma as the only separators of paragraphs. By the way, How the object can be obtained realistically ?. This is a mook-up. it is certainly not general and hardly reusable.

For that matter I have enhanced the Type class Indexable for retrieving, convert to string and give the separators to split the text in words and paragraphs.

class Indexable a where listIze :: a -> [String] getFromURI:: String -> IO a --get the object as s String toString :: a-> String expSeparators :: [Char] wordSeparators:: [Char]

The additions are not orthogonal with listIze, but at this moment, it is enough

So how looks like the integration of this stuff with the previous search engine described in the first post ?

-- return a list of URIs and excerpts for each document that match the search criteria
searchResults keys= do let xs=listIze $ Words str uris<- search xs rs <- map getFromURI uris return $ zip uris (map (showExcerpts xs) rs)

searchResults is, basically, a map to call showExcerpts for each URI returned by search.

Again, there is no limit in the number of the results returned. The lazy nature of Haskell permits me to give to the consumer total freedom and responsibility, without complicating the code with clumsy parameters and buffering.

The caching of document content for extracting excerpts is not the search engine responsibility up to now. Google do it, so it may be a good idea to cache the content if possible. The transactional cache can be a good candidate for this role. Specially when the objects indexed are being updated transactionally and the index engine has to take care of the changes. I´m working on it.

Thursday, November 16, 2006

Haskell Transactional Cache

NOTE: last version download at: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/TCache

I was impressed by Haskell since I started to learn it a year ago. I was looking for a language that can be used for genetic programming, and haskell eases the creation custom languages such is the one I needed.

Soon I was fascinated by the pure functional approach. I was also attracted by the concept of referential transparency, that makes life easier for code clarity and maintainability. The smart type system also was impressive on avoiding common errors at compilation time that plagued other languages. For these reason and a lot more I though that Haskell has reasonable perspectives to be the future Java. Concerning Web programming, despite the timid effort done up to now, surprisingly, Haskell is in the best track to win the race for the best Web engine. Haskell Server Pages has a quite good web framework with unique capabilities. HSP Client-side has the best approach for Ajax programming.

But Haskell still has many holes for rapid application development, something that is vital for escalating Haskell from naive, fun driven web projects to industrial acceptance. It lacks a MVC (Model View Controller) framework. Haskell has not something like Rails for Rubi or Hibernate for Java.

One of the problems of Web development is the three tier programming problems. It is about how to keep low the impedance between layers when each layer has his own architectural requirements so that they are designed around different paradigms. With impedance I refer to the additional problems introduced within each layer, at every stage of the software life-cycle, produced by the mismatch with other layers.

Typically, the data layer is constructed around a database and a relational language. The business layer uses a object oriented language, while the presentation layer uses HTML, Javascript and some other web technologies. Depending on the framework chosen, more often than not, the middle layer end with a mess of relational, business and presentation code.

That happens even when the framework explicitly forbid that, unless some clear configurable interfaces are defined to isolate them. By using the natural data on each layer and providing mapping mechanisms between the idiosyncrasies of the respective layers avoids the middle layer contamination. This keep the impedance low not only at the design and development time but also in test, integration and
maintenance. Each layer expert can perform his work with little care about the technologies configurations, design decisionsand changes in the other layers. Such kind of mappings are exemplified, in the case of the Business-Data layer boundary, by storage-independent data persistence mechanisms, such is Hibernate for Java and Rails for Rubi. In the case of the Business-Presentation Layer boundary, something like the Custom Tags of JSP and ASP.NET do the same role into isolating the presentation from the logic of the business layer . They are also mapping mechanisms. XSLT also play this role.

In the other side, I HATE designing and programming databases (database record and field access is so low level that it can ruin the readability and elegance of any algorithm). I want to use my own defined objects, not the pieces of information defined to accomplish with low level data requirements. I want not to take care about on every single access to the data I need.

So i need to:

Abstract myself from the physical data structure and rather use the data definitions that fit with the nature of the problem to solve
I need Transaction-aware data objects, because in the multi-threaded world of Internet servers, two clients may want to buy the same Item at the same time and I want to decrease the stock of this item two times instead of one, for example (if there are two items at least).
I also need a way to define mappings between the objects I use and the physical storage, no matter that it is a file-system or a Database.
should be abstract to work with any kind of data definitions.

Once the database is just a persistence mechanism for a web application the transactionality can not be provided by the database because the updates are done in the cached objects. The data has to be transactional in memory. It is very important to have a general mechanism of transactions for this
reason.

I created TCache.hs (download), a transactional cache that abstract the user from physical data and
transactional plumbing. The data access uses a very simple semantics: to access a data object I use just the data object itself updated with just the fields necessary for composing a unique key.

The transactional cache takes care of transactions, synchronizationwith the physical data ( one way, from the cache to the PD up to now) and maintenance of the cache size by dropping the data with older access.

The transaction uses STM (Software Transactional Memory) so that noblocking is used, and the resulting application is faster at doing transactions. For this reason it only compiles in GHC.

The cache is implemented in a Hash Table (Data.HashTable) so that many thread may be doing transactions at the same time.

Below you can find an usage example:

module Main where ------------------------------------------------- -- A example of Transactional cache usage (TCache.hs) -- (Something like the Java Hibernate) -- Author: Alberto Gómez Corona Nov 2006 -- Language: Haskell -- Terms of use: you can do whatever you want -- with this code as long as you keep this notice ------------------------------------------------ import TCache import Control.Concurrent import System.Directory

--1 and 4: The data elements to be used in the example: A user will repeatedly buy Items. data Data=   User{uname::String, uid::String, spent:: Int} |               Item{iname::String, iid::String, price::Int, stock::Int}               deriving (Read, Show) --3 The mappings between the cache and the phisical storage are defined by the interface IResource
--      to extract the resource unique key,
--      to read the resource from the physical storage,
--      to store it and
--      to delete the resource from the physical storage.

instance IResource Data where         keyResource         User{uid=id}= id         keyResource         Item{iid=id}= id               readResource    e= do s<- readFile$ keyResource e                               return$ Just $ read s                                          writeResource   e= writeFile (keyResource e) $ show e         delResource     e= removeFile $ keyResource e -- buy is the operation to be performed in the example

--4 withResources gets a partial definition of each resource necessary for extracting the key, fill all the rest of the data structures (if found ) and return a list of Maybe Data. BuyIt is part of the domain problem. it receive this list and generates a new list of dat objects that are updated in the cache. buyIt is executed atomically.

user `buy` item=     withResources[user,item] buyIt where     buyIt[Just us,Just it]        | stock it > 0= [us',it'] `debug1` "john buy a PC"        | otherwise   = error "stock is empty for this product"     buyIt[_,_] = error "either the user or the item does not exist"     us'= us{spent=spent us + price it}     it'= it{stock= stock it-1} main= do         -- create resources (acces no resources and return two new Data objects defined in items)         withResources[]items         --11 PCs are charged to the John´s account in paralel, to show transactionality         --because there are only 10 PCs in stock, the last thread must return an error         for 11 $ forkIO $ User{uid="U12345"} `buy` Item{iid="I54321"}                --wait 5 seconds                 threadDelay 5000000         -- write the cache content in a persistent store (invoque writeResource for each resource)         -- in a real application clearSyncCache can be used instead to adjust size and write the cache periodically         syncCache (refcache :: Cache Data)         -- the files have been created. the files U12345 and I54321 must contain the result of the 11 iterations   where         items _=               [User "John" "U12345" 0               ,Item "PC" "I54321" 6000 10]

Here follows is a typical output. I printed a trace with the text atomic try as the first statement of the atomic transaction. You can se how the first transaction initiated blocks the rest and even some retry. That is so because all the updates go to the same registers. Overall, STM is faster than blocking and far easier. My first version used blocking and the code were far more complicated.

"atomic try"
"atomic try"
john buy a PC
"atomic try""atomic try""atomic try""atomic try""atomic try""atomic try""atomic
try""atomic try""atomic try""atomic try"
john buy a PC

john buy a PC

john buy a PC

john buy a PC
john buy a PC

john buy a PC
john buy a PC

john buy a PC

john buy a PC
john buy a PC
"atomic try""atomic try"
john buy a PC

john buy a PC
"atomic try"
john buy a PC"atomic try"
main: stock is empty for this product

The final content of the files are:

U12345: User {uname = "John", uid = "U12345", spent = 60000}

I54321: Item {iname = "PC", iid = "I54321", price = 6000, stock = 0}

The amount spent by John is the sum of th prices of the ten PCs in stock and the stock is 0, so the transactions have been managed correcly.

There are other operations; getResources that simply return the data requested withour doing any transaction. clearSyncCacheProc start the process for cache maintenance and syncronization according with his passed parameters: time between checks, cache size and filtering criteria. getResourcesID permits not only update but also delete cache elements in the course of a transaction.

The module is far from finised. The syncronization mechanism must be refined for example with cache invalidation machanisms. I welcome anyone interested in to improve it.

NOTES:

I just realized that my transactional cache has been used in this excelent paper: Dissecting Transactional Executions in Haskell.

The test case used in the study is more or less the same example described above. Since all the parallel treads update the same two objects all the time, this is the worst case. TCache uses in-memory non blocking transactions . Each transaction do not block, but instead he try to perform the task and rollback at the very end if things have been changed by other thread in the meantime. Just like databases. The bad thing is that the more CPU cores are executing the example, the more work being rolled back is done.

Fortunately this is not the case in real life. TCache and the STM transactions are designed with the assumption that the application manages many objects and the threads update simultaneously different objects most of the time. A user does not buy eleven times simultaneously the same product. And non blocking transactions scale better than blocking ones in real life scenarios, just because they don´t block.

Tuesday, November 14, 2006

A search engine written in Haskell

NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module where you can find an example.

It is so easy to transform ideas into programs using Haskell.

I was in a borin conference. The conferences excite my imagination in the same way than walking fast. I started to think about how the core of a search engine would look like. I also need one for my portal http://www.FreeChooser.com (more on this portal coming soon). I needed to index not only web pages, but also Haskell data records.

My search engine needed a compact index database, well designed so that the the search should be fast, even when searching for many words at the same time, it would return a short set of documents that contains all the words requested. This is the way to achieve a compromise between simplicity, speed, ease of use and resource waste. I didn't realize that the core idea could have been achieved in less than 50 lines of Haskell code!!!

I heard time ago about a search engine www.alltheweb.com that used bit maps to achieve the fastest search results six years ago.

During the conference, I figured out a way to use bitmaps. For each word, the index must store a set of bits. each bit correspond to a document identified by the position in the bitmap (an Inte(ger)) and the association of this Int(eger) with the URL of the document. the bit will have a 1 when the document has this word and 0 when not. Very simple.

To search for the documents that contain term1 AND term2 ... AND TermN is just a bitwise AND of the respective bitmap terms.

So I started to code it as soon as I returned home. Here is the code:

Module Search where ------------------------------------------------- -- The Core of a Search Engine -- Author: Alberto Gómez Corona -- Language: Haskell -- Terms of use: you can do whatever you want -- with this code as long as you keep this notice ------------------------------------------------

--store the URI list of indexed objects (reverse to intKey) keyInt= unsafePerformIO $ new (==) hashString :: HashTable String Int --store the bitmaps for each word hs= unsafePerformIO $ new (==) hashString :: HashTable String Integer --store the URN for each bit position intKey= unsafePerformIO $ new (==) hashInt:: HashTable Int String --store the last bit position/las int id of the last object indexed indexRef =unsafePerformIO $ newMVar 0 :: MVar Int -- add the content of an object, identified by the uri string: addObject:: (Indexable a)=> String->a->IO() addObject uri content= do mk<- HT.lookup keyInt uri case mk of Just k -> return () --to avoid repeated indexing of documents Nothing-> addObject1 uri content where addObject1 uri content=do n <- takeMVar indexRef add n (k n) $ listIze content update keyInt uri n update intKey n uri putMVar indexRef (n+1) where add n k []= return () add n k (x:xs)= do mv<- HT.lookup hs x case mv of Just v -> update hs x (v .|. k) Nothing-> update hs x k add n k xs k n= shiftL 1 n -- search for a set of ANDed keywords and return the list of URIs That contain all of them search xs = do n <- readMVar indexRef print n mbi<- mapM (HT.lookup hs)xs let bi = catMaybes mbi -- if a word has not been found then return empty results if length bi/= length mbi then return [] else do let r= foldr (.&.) (head bi) bi -- select te doc bits that have all the words --r has the bits of the objects that match print r getObjecURIs n 0 r [] where getObjecURIs :: Int->Int->Integer->[String]->IO [String] getObjecURIs 0 _ _ rs= return rs getObjecURIs n i r rs= case (testBit r 0) of True -> do urn <- HT.lookup intKey i getObjecURIs (n-1)(i+1) (shiftR r 1) (fromJust urn:rs) False-> getObjecURIs (n-1)(i+1) (shiftR r 1) rs

Of course, it is necessary to add persistence for the index database hs and the rest of the associated HashTables and so on.
There are two procedures addObject (to add a new object URI and his string content) to the index database. The other method is Search. It takes a list of words as arguments and return the list of documents that have all the words.

To feed addObject, I also started to define filters for special kind of objects: (text files, Haskell registers, XML/HTML files). The code below shows an early attemp to do so. Note the type signature of adObject.

I have defined a Class Indexable that has a function that convert the object into a list of strings:

--a is indexable if it can be serialized into a list of word strings class Indexable a where listIze:: a->[String] data Words= Words String instance Indexable Words where listIze (Words s)= str2List s (\c ->isAlphaNum c || c=='@') where str2List [] _ = [] str2List s f = h:r where (h,t) = myreads2 s f r = str2List t f myreads2 [] _=([],[]) myreads2 (c:rest) f | not(f c) = ([], rest) | otherwise = (c:r, skipspaces s f) where (r,s) = myreads2 rest f skipspaces [] _ = [] skipspaces (c:rest) f | not(f c) = skipspaces rest f | otherwise = c:rest data XMLText= XMLText String instance Indexable XMLText where listIze (XMLText s)= str2List s (\c ->isAlphaNum c || c=='@') where str2List "" _ = [] str2List ('<':rs) f = str2List (tail $ dropWhile(/='>') rs) f str2List s f = h:r where (h,t) = myreads2 s f r = str2List t f myreads2 [] _=([],[]) myreads2 (c:rest) f | not(f c) = ([], rest) | otherwise = (c:r, skipspaces s f) where (r,s) = myreads2 rest f skipspaces [] _ = [] skipspaces ('<':rs) f = skipspaces (tail $ dropWhile(/='>') rs) f skipspaces (c:rest) f | not(f c) = skipspaces rest f | otherwise = c:rest data Idx a= Idx a instance Show a => Indexable (Idx a) where listIze a = listIze $ Words $ show a instance Show a=> Show(Idx a) where show (Idx a)= show a

With this quick and dirty filter definitions for Text, XML/HTML and Haskell registers, we can do this:

data Register=Reg String Integer deriving Show main=do addObject "input1.txt" $ Words "John, 18" addObject "input2.html" $ XMLText "<html><body><p>John</p><body>18</body></html>" addObject "HaskelReg.hs" $ Idx $ Reg "John" 18 r<- search ["John","18"] print r

The source of Search.hs can be found here

See part 2 of the search engine

Note: full text Search is now a module in TCache (Data.TCache.IndexText) integrated in the query language (Data.TCache.IndexQuery) see:

http://hackage.haskell.org/package/TCache

Haskell and Web