NOTE: full text search is incorporated to TCache in the Data.TCache.IndexText module where you can find an example.
In
the first post of this Blog , I described what would be the
core or a search engine. Now it´s time to extend it a little more. Up to now,
the search return a list of URI strings (unique keys in the case of database
records, file-names in the case of files on a file-system, URL´s in
the case of Web pages.
But it is meaningless to present this list to the user. An excerpt of
each object, with the paragraphs that contains the keywords searched
for may be something very valuable. At the end, if I take a little
more effort and runtime resources to present something with meaning, it is
probable that the user will not repeat the search many times and browse many
files. This is something that, no doubt, would be far more
costly in every aspect.
So the purpose is to extract an excerpt from the list of URIs returned
by the core search procedure.
showExcerpts:: [String]-> String-> [(String,Int)] showExcerpts xs r= se xs text [] 0 [] where text= show r se:: [String]->String->String->Int->[(String,Int)]->[(String,Int)] se _ [] _ _ fr= fr se qs t@(x:xs) f hasKeys fr | (not.null) $ filter (==True)$ map ((flip isPrefixOf) t) qs= se qs xs (append f x) (hasKeys+1) fr --some word found | x `elem` ".,"= -- limit of fragment case hasKeys of 0 -> se qs xs [] 0 fr --the previous fragment was --void of keys _ -> se qs xs [] 0 ((f,hasKeys):fr) --added the old, --new fragment starts | otherwise= se qs xs (append f x) hasKeys fr append l c= l ++ [c] |
|
ShowExcerpts take a set of keywords, a Show-able something and return
the paragraphs that contains at least one of the
keywords and the number of keywords found in each paragraphs.
for example, In this test I use text from
Paul
Graham the genius :
main= print $ showExcerpts ["Daddy","model"] text where text= "When people care enough about something to do it well, those who do it best tend to be far better than everyone else. There's a huge gap between Leonardo and second-rate contemporaries like Borgognone. You see the same gap between Raymond Chandler and the average writer of detective novels. A top-ranked professional chess player could play ten thousand games against an ordinary club player without losing once.Like chess or painting or writing novels, making money is a very specialized skill. But for some reason we treat this skill differently. No one complains when a few people surpass all the rest at playing chess or writing novels, but when a few people make more money than the rest, we get editorials saying this is wrong.Why? The pattern of variation seems no different than for any other skill. What causes people to react so strongly when the skill is making money?I think there are three reasons we treat making money as different: the misleading model of wealth we learn as children; the disreputable way in which, till recently, most fortunes were accumulated; and the worry that great variations in income are somehow bad for society. As far as I can tell, the first is mistaken, the second outdated, and the third empirically false. Could it be that, in a modern democracy, variation in income is actually a sign of health?The Daddy Model of WealthBecause kids are unable to create wealth, whatever they have has to be given to them. And when wealth is something you're given, then of course it seems that it should be distributed equally. [2] As in most families it is. The kids see to that. \"Unfair,\" they cry, when one sibling gets more than another.In the real world, you can't keep living off your parents. If you want something, you either have to make it, or do something of equivalent value for someone else, in order to get them to give you enough money to buy it. In the real world, wealth is (except for a few specialists like thieves and speculators) something you have to create, not something that's distributed by Daddy. And since the ability and desire to create it vary from person to person, it's not made equally.It may seem unlikely in principle that one individual could really generate so much more wealth than another. The key to this mystery is to revisit that question, are they really worth 100 of us? Would a basketball team trade one of their players for 100 random people? What would Apple's next product look like if you replaced Steve Jobs with a committee of 100 random people? [6] These things don't scale linearly. Perhaps the CEO or the professional athlete has only ten times (whatever that means) the skill and determination of an ordinary person. But it makes all the difference that it's concentrated in one individual. When we say that one kind of work is overpaid and another underpaid, what are we really saying? In a free market, prices are determined by what buyers want. People like baseball more than poetry, so baseball players make more than poets. To say that a certain kind of work is |
the results are:
C:\Documents and Settings\Propietario\Escritorio\haskell>main
[(" Why else would this idea occur in this odd context? Whereas if the speaker were still operating on the Daddy Model",2),(" The appearance of the word unjust here is the unmistakable spectral signature of the Daddy Model",2),(" not something that's distributed by Daddy",1),(" variation in income is actually a sign of health?The Daddy Model of Wealth Because kids are unable to create wealth",2)]
Things to note:
-
The paragraphs are not sorted by relevance, but the relevance is calculated (the numbers attached to each paragraph).
-
There is no limit in the number of paragraphs to return
This is by design. It is a task of the consumer to decide either to
sort it, limit the number of paragraphs or whatever. That is so
because this algorithm do it all in one single pass, so that the
lazy nature of Haskell permit getExcerpts stop wasting resources when
the consumer decides that there are enough paragraphs extracted. Great!.
there are many many cases like this in web programming where
laziness pays a lot: HTML Pagination for example. This is something that
happens too in the presentation of search results.
Note that there are hard coded thinks, for example, I take
the dot and the comma as the only separators of paragraphs. By
the way, How the object can be obtained realistically ?. This
is a mook-up. it is certainly not general and hardly reusable.
For that matter I have enhanced
the Type class Indexable for retrieving, convert to
string and give the separators to split the text in words and paragraphs.
class Indexable a where listIze :: a -> [String] getFromURI:: String -> IO a --get the object as s String toString :: a-> String expSeparators :: [Char] wordSeparators:: [Char] |
The additions are not orthogonal with listIze, but at this moment, it is enough
So how looks like the integration of this stuff with the previous search engine described in the first post ?
-- return a list of URIs and excerpts for each document that match the search criteria searchResults keys= do let xs=listIze $ Words str uris<- search xs rs <- map getFromURI uris return $ zip uris (map (showExcerpts xs) rs) |
searchResults is, basically, a map to call showExcerpts
for each URI returned by search.
Again, there is no limit in the number of the results
returned. The lazy nature of Haskell permits me to give to the
consumer total freedom and responsibility,
without complicating the code with clumsy parameters and
buffering.
The caching of document content for extracting excerpts is not the search
engine responsibility up to now. Google do it, so it may
be a good idea to cache the content if possible.
The
transactional cache can be a good candidate for this role.
Specially when the objects indexed are being updated
transactionally and the index engine has to take care of the changes. I´m
working on it.