Thursday, November 16, 2006

Haskell Transactional Cache

NOTE: last version download at: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/TCache

I was impressed by Haskell since I started to learn it a year ago. I was looking for a language that can be used for genetic programming, and haskell eases the creation custom languages such is the one I needed.

Soon I was fascinated by the pure functional approach. I was also attracted by the concept of referential transparency, that makes life easier for code clarity and maintainability. The smart type system also was impressive on avoiding common errors at compilation time that plagued other languages. For these reason and a lot more I though that Haskell has reasonable perspectives to be the future Java. Concerning Web programming, despite the timid effort done up to now, surprisingly, Haskell is in the best track to win the race for the best Web engine. Haskell Server Pages has a quite good web framework with unique capabilities. HSP Client-side has the best approach for Ajax programming.

But Haskell still has many holes for rapid application development, something that is vital for escalating Haskell from naive, fun driven web projects to industrial acceptance. It lacks a MVC (Model View Controller) framework. Haskell has not something like Rails for Rubi or Hibernate for Java.

One of the problems of Web development is the three tier programming problems. It is about how to keep low the impedance between layers when each layer has his own architectural requirements so that they are designed around different paradigms. With impedance I refer to the additional problems introduced within each layer, at every stage of the software life-cycle, produced by the mismatch with other layers.

Typically, the data layer is constructed around a database and a relational language. The business layer uses a object oriented language, while the presentation layer uses HTML, Javascript and some other web technologies. Depending on the framework chosen, more often than not, the middle layer end with a mess of relational, business and presentation code.

That happens even when the framework explicitly forbid that, unless some clear configurable interfaces are defined to isolate them. By using the natural data on each layer and providing  mapping mechanisms between the idiosyncrasies of the respective layers avoids the middle layer contamination. This keep the impedance low not only at the design and development time but also in  test, integration and
maintenance. Each layer expert  can perform his work with little care about the technologies configurations, design decisionsand changes in the other layers. Such kind of mappings are exemplified, in the case of the Business-Data layer boundary, by storage-independent data persistence mechanisms, such is Hibernate for Java and Rails for Rubi.  In the case of the Business-Presentation Layer boundary, something like the Custom Tags of JSP and ASP.NET do the same role into isolating the presentation from the  logic of the business layer . They are also mapping mechanisms. XSLT also play this role.

In the other side, I HATE designing and programming databases (database record and field access is so low level that it can ruin the readability and elegance of any algorithm). I want to use my own defined objects, not the pieces of information defined to accomplish with low level data requirements. I want not to take care about on every single access to the data I need.

So i need to:


  1.  Abstract myself from the physical data structure and rather use the data definitions that fit with the nature of the problem to solve

  2.  I need Transaction-aware data objects, because in the multi-threaded world of Internet servers, two clients may want to buy the same Item at the same time and I want to decrease the stock of this item two times instead of one, for example (if there are two items at least).

  3.  I also need a way to define mappings between the objects I use and the physical storage, no matter that it is a file-system or a Database.

  4.  should be abstract to work with any kind of data definitions.

Once the database is just a persistence mechanism for a web application the transactionality can not be provided by the database because the updates are done in the cached objects. The data has to be transactional in memory. It is very important to have a general mechanism of transactions for this
reason.

 I created TCache.hs (download), a transactional cache that abstract the user from physical data and
transactional plumbing. The data access uses a very simple semantics: to access a data object I use just the data object itself updated with just the fields necessary for composing a unique key.

The transactional cache takes care of transactions, synchronizationwith the physical data ( one way, from the cache to the PD up to now) and maintenance of the cache size by dropping the data with older access.

The transaction uses STM (Software Transactional Memory) so that noblocking is used, and the resulting application is faster at doing transactions. For this reason it only compiles in GHC.

The cache is implemented in a Hash Table (Data.HashTable) so that many thread may be doing transactions at the same time.

Below you can find an usage example:



module Main where
-------------------------------------------------
-- A example of Transactional cache usage (TCache.hs)
-- (Something like the Java Hibernate)
-- Author: Alberto Gómez Corona Nov 2006
-- Language: Haskell
-- Terms of use: you can do whatever you want
-- with this code as long as you keep this notice
------------------------------------------------

import TCache

import Control.Concurrent
import System.Directory



--1 and 4: The data elements to be used in the example: A user will repeatedly buy Items.

data  Data=   User{uname::String, uid::String, spent:: Int} |
              Item{iname::String, iid::String, price::Int, stock::Int}

              deriving (Read, Show)

--3 The mappings between the cache and the phisical storage are defined by the interface IResource

--      to extract the resource unique key,
--      to read the resource from the physical storage,
--      to store it and
--      to delete the resource from the physical storage.


instance IResource Data where
        keyResource         User{uid=id}= id
        keyResource         Item{iid=id}= id      
        readResource    e= do s<- readFile$ keyResource e
                              return$ Just $ read s                                 
        writeResource   e= writeFile (keyResource e) $ show e

        delResource     e= removeFile $ keyResource e

-- buy is the operation to be performed in the example


--4 withResources gets a partial definition of each resource necessary for extracting the key, fill all the rest of the data structures (if found ) and return a list of Maybe Data. BuyIt is part of the domain problem. it receive this list and generates a new list of dat objects that are updated in the cache. buyIt is executed atomically.


user `buy` item=
    withResources[user,item] buyIt

where

    buyIt[Just us,Just it]
       | stock it > 0= [us',it'] `debug1` "john buy a PC"
       | otherwise   = error "stock is empty for this product"

    buyIt[_,_] = error "either the user or the item does not exist"
    us'= us{spent=spent us + price it}
    it'= it{stock= stock it-1}

main= do
        -- create resources (acces no resources and return two new Data objects defined in items)
        withResources[]items

        --11 PCs are charged  to the John´s account in paralel, to show transactionality
        --because there are only 10 PCs in stock, the last thread must return an error

        for 11 $ forkIO $ User{uid="U12345"} `buy` Item{iid="I54321"}
      
        --wait 5 seconds        
        threadDelay 5000000

        -- write the cache content in a persistent store (invoque writeResource for each resource)
        -- in a real application clearSyncCache can be used instead to adjust size and write the cache periodically

        syncCache (refcache :: Cache Data)

        -- the files have been created. the files U12345 and I54321 must contain the result of the 11 iterations

  where
        items _=
              [User "John" "U12345" 0
              ,Item "PC" "I54321" 6000 10
]


Here follows is a typical output. I  printed a trace with the text atomic try as the first statement of the atomic transaction. You can se how the first transaction initiated blocks the rest and even some retry. That is so because all the updates go to the same registers. Overall, STM is faster than blocking and far easier. My first version used  blocking and the code were far more complicated.



"atomic try"
"atomic try"
john buy a PC
"atomic try""atomic try""atomic try""atomic try""atomic try""atomic try""atomic
try""atomic try""atomic try""atomic try"
john buy a PC

john buy a PC

john buy a PC


john buy a PC
john buy a PC



john buy a PC
john buy a PC


john buy a PC


john buy a PC
john buy a PC

"atomic try""atomic try"
john buy a PC

john buy a PC
"atomic try"
john buy a PC"atomic try"
main: stock is empty for this product



The final content of the files are:



U12345:  User {uname = "John", uid = "U12345", spent = 60000}

I54321:  Item {iname = "PC", iid = "I54321", price = 6000, stock = 0}



The amount spent by John is the sum of th prices of the ten PCs in stock and the stock is 0, so the transactions have been managed correcly.

There are other operations; getResources that simply return the data requested withour doing any transaction.   clearSyncCacheProc start the process for cache maintenance and syncronization according with his passed parameters: time between checks, cache size and filtering criteria. getResourcesID permits not only update but also delete cache elements in the course of a transaction.

The module is far from finised. The syncronization mechanism must be refined for example with cache invalidation machanisms. I welcome anyone interested in to improve it.



NOTES:


I just realized that my transactional cache has been used in this excelent paper: Dissecting Transactional Executions in Haskell.


The test case used in the study is more or less the same example described above. Since all the parallel treads update the same two objects all the time, this is the worst case. TCache uses in-memory non blocking transactions . Each transaction do not block, but instead he try to perform the task and rollback at the very end if things have been changed by other thread in the meantime. Just like databases. The bad thing is that the more CPU cores are executing the example, the more work being rolled back is done.

Fortunately this is not the case in real life. TCache and the STM transactions are designed with the assumption that the application manages many objects and the threads update simultaneously different objects most of the time. A user does not buy eleven times simultaneously the same product. And non blocking transactions scale better than blocking ones in real life scenarios, just because they don´t block.
Post a Comment