Case Study: Sphinx-based Search
Sphinx is a search server, and powers the search feature on many sites, including Yesod's own site. While the actual code necessary to integrate Yesod with Sphinx is relatively short, it touches on a number of complicated topics, and is therefore a great case study in how to play with some of the under-the-surface details of Yesod.
There are essentially three different pieces at play here:
- Storing the content we wish to search. This is fairly straight-forward Persistent code, and we won't dwell on it much in this chapter.
- Accessing Sphinx search results from inside Yesod. Thanks to the sphinx package, this is actually very easy.
- Providing the document content to Sphinx. This is where the interesting stuff happens, and will show how to deal with streaming content from a database directly to XML, which gets sent directly over the wire to the client.
Sphinx Setup
Unlike many of our other examples, to start with here we'll need to actually configure and run our external Sphinx server. I'm not going to go into all the details of Sphinx, partly because it's not relevant to our point here, and mostly because I'm not an expert on Sphinx.
Sphinx provides three main command line utilities: searchd
is the actual search daemon that receives requests from the client (in this case, our web app)
and returns the search results. indexer
parses the set of documents and
creates the search index. search
is a debugging utility that will run
simple queries against Sphinx.
There are two important settings: the source and the index. The source tells Sphinx where to read document information from. It has direct support for MySQL and PostgreSQL, as well as a more general XML format known as xmlpipe2. We're going to use the last one. This not only will give us more flexibility with choosing Persistent backends, but will also demonstrate some more powerful Yesod concepts.
The second setting is the index. Sphinx can handle multiple indices simultaneously, which allows it to provide search for multiple services at once. Each index will have a source it pulls from.
In our case, we're going to provide a URL from our application (/search/xmlpipe) that provides the XML file required by Sphinx, and then pipe that through to the indexer. So we'll add the following to our Sphinx config file:
source searcher_src { type = xmlpipe2 xmlpipe_command = curl http://localhost:3000/search/xmlpipe } index searcher { source = searcher_src path = /var/data/searcher docinfo = extern charset_type = utf-8 }
In order to build your search index, you would run indexer searcher
. Obviously
this won't work until you have your web app running. For a production site, it would make sense
to run this command via a crontab script so the index is regularly updated.
Basic Yesod Setup
Let's get our basic Yesod setup going. We're going to have a single table in the database for holding documents, which consist of a title and content. We'll store this in a SQLite database, and provide routes for searching, adding documents, viewing documents and providing the xmlpipe file to Sphinx.
share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persist| Doc title Text content Textarea |] data Searcher = Searcher ConnectionPool mkYesod "Searcher" [parseRoutes| / RootR GET /doc/#DocId DocR GET /add-doc AddDocR POST /search SearchR GET /search/xmlpipe XmlpipeR GET |] instance Yesod Searcher instance YesodPersist Searcher where type YesodPersistBackend Searcher = SqlPersist runDB action = do Searcher pool <- getYesod runSqlPool action pool instance RenderMessage Searcher FormMessage where renderMessage _ _ = defaultFormMessage
Hopefully all of this looks pretty familiar by now. Next we'll define some forms: one for creating documents, and one for searching:
addDocForm :: Html -> MForm Searcher Searcher (FormResult Doc, Widget) addDocForm = renderTable $ Doc <$> areq textField "Title" Nothing <*> areq textareaField "Contents" Nothing searchForm :: Html -> MForm Searcher Searcher (FormResult Text, Widget) searchForm = renderDivs $ areq (searchField True) "Query" Nothing
The True
parameter to searchField makes the field auto-focus on page load.
Finally, we have some standard handlers for the homepage (shows the add document form and the
search form), the document display, and adding a document.
getRootR :: Handler RepHtml getRootR = do docCount <- runDB $ count ([] :: [Filter Doc]) ((_, docWidget), _) <- runFormPost addDocForm ((_, searchWidget), _) <- runFormGet searchForm let docs = if docCount == 1 then "There is currently 1 document." else "There are currently " ++ show docCount ++ " documents." defaultLayout [whamlet| <p>Welcome to the search application. #{docs} <form method=post action=@{AddDocR}> <table> ^{docWidget} <tr> <td colspan=3> <input type=submit value="Add document"> <form method=get action=@{SearchR}> ^{searchWidget} <input type=submit value=Search> |] postAddDocR :: Handler RepHtml postAddDocR = do ((res, docWidget), _) <- runFormPost addDocForm case res of FormSuccess doc -> do docid <- runDB $ insert doc setMessage "Document added" redirect $ DocR docid _ -> defaultLayout [whamlet| <form method=post action=@{AddDocR}> <table> ^{docWidget} <tr> <td colspan=3> <input type=submit value="Add document"> |] getDocR :: DocId -> Handler RepHtml getDocR docid = do doc <- runDB $ get404 docid defaultLayout $ [whamlet| <h1>#{docTitle doc} <div .content>#{docContent doc} |]
Searching
Now that we've got the boring stuff out of the way, let's jump into the actual searching. We're going to need three pieces of information for displaying a result: the document ID it comes from, the title of that document, and the excerpts. Excerpts are the highlighted portions of the document which contain the search term.
So let's start off by defining a Result datatype:
data Result = Result { resultId :: DocId , resultTitle :: Text , resultExcerpt :: Html }
Next we'll look at the search handler:
getSearchR :: Handler RepHtml getSearchR = do ((formRes, searchWidget), _) <- runFormGet searchForm searchResults <- case formRes of FormSuccess qstring -> getResults qstring _ -> return [] defaultLayout $ do addLucius [lucius| .excerpt { color: green; font-style: italic } .match { background-color: yellow; } |] [whamlet| <form method=get action=@{SearchR}> ^{searchWidget} <input type=submit value=Search> $if not $ null searchResults <h1>Results $forall result <- searchResults <div .result> <a href=@{DocR $ resultId result}>#{resultTitle result} <div .excerpt>#{resultExcerpt result} |]
Nothing magical here, we're just relying on the searchForm
defined
above, and the getResults
function which hasn't been defined yet. This
function just takes a search string, and returns a list of results. This is where we
first interact with the Sphinx API. We'll be using two functions: query
will return a list of matches, and buildExcerpts
will return the
highlighted excerpts. Let's first look at query
:
getResults :: Text -> Handler [Result] getResults qstring = do sphinxRes' <- liftIO $ S.query config "searcher" (unpack qstring) case sphinxRes' of ST.Ok sphinxRes -> do let docids = map (Key . PersistInt64 . ST.documentId) $ ST.matches sphinxRes fmap catMaybes $ runDB $ forM docids $ \docid -> do mdoc <- get docid case mdoc of Nothing -> return Nothing Just doc -> liftIO $ Just <$> getResult docid doc qstring _ -> error $ show sphinxRes' where config = S.defaultConfig { S.port = 9312 , S.mode = ST.Any }
query
takes three parameters: the configuration options, the index
to search against (searcher in this case) and the search string. It returns a list of
document IDs that contain the search string. The tricky bit here is that those documents
are returned as Int64
values, whereas we need DocId
s. We're taking advantage of the fact that the SQL
Persistent backends use a PersistInt64
constructor for their IDs, and
simply wrap up the values appropriately.
We then loop over the resulting IDs to get a [Maybe Result]
value, and use
catMaybes
to turn it into a [Result]
. In the where clause, we
define our local settings, which override the default port and set up the search to work when
any term matches the document.
Let's finally look at the getResult
function:
getResult :: DocId -> Doc -> Text -> IO Result getResult docid doc qstring = do excerpt' <- S.buildExcerpts excerptConfig [T.unpack $ escape $ docContent doc] "searcher" (unpack qstring) let excerpt = case excerpt' of ST.Ok bss -> preEscapedLazyText $ decodeUtf8With ignore $ L.concat bss _ -> return () return Result { resultId = docid , resultTitle = docTitle doc , resultExcerpt = excerpt } where excerptConfig = E.altConfig { E.port = 9312 } escape :: Textarea -> Text escape = T.concatMap escapeChar . unTextarea where escapeChar '<' = "<" escapeChar '>' = ">" escapeChar '&' = "&" escapeChar c = T.singleton c
buildExcerpts
takes four parameters: the configuration options, the
textual contents of the document, the search index and the search term. The interesting
bit is that we entity escape the text content. Sphinx won't automatically escape these
for us, so we must do it explicitly.
Similarly, the result from Sphinx is a list of lazy ByteStrings. But of course, we'd rather have Html. So we concat that list into a single lazy ByteString, decode it to a lazy text (ignoring invalid UTF-8 character sequences), and use preEscapedLazyText to make sure that the tags inserted for matches are not escaped. A sample of this HTML is:
… Departments. The President shall have <span class='match'>Power</span> to fill up all Vacancies … people. Amendment 11 The Judicial <span class='match'>power</span> of the United States shall … jurisdiction. 2. Congress shall have <span class='match'>power</span> to enforce this article by … 5. The Congress shall have <span class='match'>power</span> to enforce, by appropriate legislation …
Streaming xmlpipe output
We've saved the best for last. For the majority of Yesod handlers, the recommended approach is to load up the database results into memory and then produce the output document based on that. It's simpler to work with, but more importantly it's more resilient to exceptions. If there's a problem loading the data from the database, the user will get a proper 500 response code.
However, generating the xmlpipe output is a perfect example of the alternative. There are potentially a huge number of documents (the yesodweb.com code handles tens of thousands of these), and documents could easily be several hundred kilobytes. If we take a non-streaming approach, this can lead to huge memory usage and slow response times.
So how exactly do we create a streaming response? As we cover in the WAI chapter, we have a ResponseSource
constructor that
uses a stream of blaze-builder Builder
s. From the Yesod side, we can
avoid the normal Yesod response procedure and send a WAI response directly using the sendWaiResponse
function. So there are at least two of the pieces of this
puzzle.
Now we know we want to create a stream of Builder
s from some XML
content. Fortunately, the xml-conduit package provides this
interface directly. xml-conduit
provides some high-level interfaces for dealing
with documents as a whole, but in our case, we're going to need to use the low-level Event
interface to ensure minimal memory impact. So the function we're interested in is:
renderBuilder :: Resource m => RenderSettings -> Conduit Event m Builder b
In plain English, that means renderBytes takes some settings (we'll just use the
defaults), and will then convert a stream of Event
s to a stream of
Builder
s. This is looking pretty good, all we need now is a stream of
Event
s.
Speaking of which, what should our XML document actually look like? It's pretty
simple, we have a sphinx:docset
root element, a sphinx:schema
element containing a single sphinx:field
(which defines the content field), and then a sphinx:document
for each
document in our database. That last element will have an id
attribute and a
child content
element.
Every document is going to start off with the same events (start the docset, start the schema, etc) and end with the same event (end the docset). We'll start off by defining those:
toName :: Text -> X.Name toName x = X.Name x (Just "http://sphinxsearch.com/") (Just "sphinx") docset, schema, field, document, content :: X.Name docset = toName "docset" schema = toName "schema" field = toName "field" document = toName "document" content = "content" -- no prefix startEvents, endEvents :: [X.Event] startEvents = [ X.EventBeginDocument , X.EventBeginElement docset [] , X.EventBeginElement schema [] , X.EventBeginElement field [("name", [X.ContentText "content"])] , X.EventEndElement field , X.EventEndElement schema ] endEvents = [ X.EventEndElement docset ]
Now that we have the shell of our document, we need to get the Event
s for each individual document. This is actually a fairly simple function:
entityToEvents :: (Entity Doc) -> [X.Event] entityToEvents (Entity docid doc) = [ X.EventBeginElement document [("id", [X.ContentText $ toPathPiece docid])] , X.EventBeginElement content [] , X.EventContent $ X.ContentText $ unTextarea $ docContent doc , X.EventEndElement content , X.EventEndElement document ]
We start the document element with an id
attribute, start the
content, insert the content, and then close both elements. We use toPathPiece
to
convert a DocId
into a Text
value. Next, we need to be able to
convert a stream of these entities into a stream of events. For this, we can use the built-in
concatMap
function from Data.Conduit.List
: CL.concatMap entityToEvents
.
But what we really want is to stream those events directly from the
database. For most of this book, we've used the selectList
function, but
Persistent also provides the (more powerful) selectSourceConn
function. So we
end up with the function:
docSource :: Connection -> C.Source IO X.Event docSource conn = selectSourceConn conn [] [] C.$= CL.concatMap entityToEvents
The $= operator joins together a source and a conduit into a new source. Now that we
have our Event
source, all we need to do is surround it with the document start
and end events. With Source
's Monoid
instance, this is a piece
of cake:
fullDocSource :: Connection -> C.Source IO X.Event fullDocSource conn = mconcat [ CL.sourceList startEvents , docSource conn , CL.sourceList endEvents ]
We're almost there, now we just need to tie it together in
getXmlpipeR
. We need to get a database connection to be used. Normally,
database connections are taken and returned automatically via the runDB
function. In our case, we want to check out a connection and keep it available until the response
body is completely sent. To do this, we use the takeResource
function, which
registers a cleanup action with the ResourceT
monad.
By default, a resource will not be returned to the pool. This has to do with proper exception handling, but is not relevant for our use case. Therefore, we need to force the connection to be returned to the pool.
getXmlpipeR :: Handler RepXml getXmlpipeR = do Searcher pool <- getYesod let headers = [("Content-Type", "text/xml")] managedConn <- lift $ takeResource pool let conn = mrValue managedConn lift $ mrReuse managedConn True let source = fullDocSource conn C.$= renderBuilder def sendWaiResponse $ ResponseSource status200 headers source
We get our connection pool from the foundation variable, then send a WAI response. We
use the ResponseSource
constructor, and provide it the status code, response
headers, and body.
Full code
{-# LANGUAGE OverloadedStrings, TypeFamilies, TemplateHaskell, QuasiQuotes, MultiParamTypeClasses, GADTs, FlexibleContexts #-} import Yesod import Data.Text (Text, unpack) import Control.Applicative ((<$>), (<*>)) import Database.Persist.Sqlite import Database.Persist.Query.GenericSql (selectSourceConn) import Database.Persist.Store (PersistValue (PersistInt64)) import qualified Text.Search.Sphinx as S import qualified Text.Search.Sphinx.Types as ST import qualified Text.Search.Sphinx.ExcerptConfiguration as E import qualified Data.ByteString.Lazy as L import Data.Text.Lazy.Encoding (decodeUtf8With) import Data.Text.Encoding.Error (ignore) import Data.Maybe (catMaybes) import Control.Monad (forM) import qualified Data.Text as T import Text.Blaze (preEscapedLazyText) import qualified Data.Conduit as C import qualified Data.Conduit.List as CL import qualified Data.XML.Types as X import Network.Wai (Response (ResponseSource)) import Network.HTTP.Types (status200) import Text.XML.Stream.Render (renderBuilder, def) import Data.Monoid (mconcat) import Data.Conduit.Pool (takeResource, mrValue, mrReuse) share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persist| Doc title Text content Textarea |] data Searcher = Searcher ConnectionPool mkYesod "Searcher" [parseRoutes| / RootR GET /doc/#DocId DocR GET /add-doc AddDocR POST /search SearchR GET /search/xmlpipe XmlpipeR GET |] instance Yesod Searcher instance YesodPersist Searcher where type YesodPersistBackend Searcher = SqlPersist runDB action = do Searcher pool <- getYesod runSqlPool action pool instance RenderMessage Searcher FormMessage where renderMessage _ _ = defaultFormMessage addDocForm :: Html -> MForm Searcher Searcher (FormResult Doc, Widget) addDocForm = renderTable $ Doc <$> areq textField "Title" Nothing <*> areq textareaField "Contents" Nothing searchForm :: Html -> MForm Searcher Searcher (FormResult Text, Widget) searchForm = renderDivs $ areq (searchField True) "Query" Nothing getRootR :: Handler RepHtml getRootR = do docCount <- runDB $ count ([] :: [Filter Doc]) ((_, docWidget), _) <- runFormPost addDocForm ((_, searchWidget), _) <- runFormGet searchForm let docs = if docCount == 1 then "There is currently 1 document." else "There are currently " ++ show docCount ++ " documents." defaultLayout [whamlet| <p>Welcome to the search application. #{docs} <form method=post action=@{AddDocR}> <table> ^{docWidget} <tr> <td colspan=3> <input type=submit value="Add document"> <form method=get action=@{SearchR}> ^{searchWidget} <input type=submit value=Search> |] postAddDocR :: Handler RepHtml postAddDocR = do ((res, docWidget), _) <- runFormPost addDocForm case res of FormSuccess doc -> do docid <- runDB $ insert doc setMessage "Document added" redirect $ DocR docid _ -> defaultLayout [whamlet| <form method=post action=@{AddDocR}> <table> ^{docWidget} <tr> <td colspan=3> <input type=submit value="Add document"> |] getDocR :: DocId -> Handler RepHtml getDocR docid = do doc <- runDB $ get404 docid defaultLayout $ [whamlet| <h1>#{docTitle doc} <div .content>#{docContent doc} |] data Result = Result { resultId :: DocId , resultTitle :: Text , resultExcerpt :: Html } getResult :: DocId -> Doc -> Text -> IO Result getResult docid doc qstring = do excerpt' <- S.buildExcerpts excerptConfig [T.unpack $ escape $ docContent doc] "searcher" (unpack qstring) let excerpt = case excerpt' of ST.Ok bss -> preEscapedLazyText $ decodeUtf8With ignore $ L.concat bss _ -> return () return Result { resultId = docid , resultTitle = docTitle doc , resultExcerpt = excerpt } where excerptConfig = E.altConfig { E.port = 9312 } escape :: Textarea -> Text escape = T.concatMap escapeChar . unTextarea where escapeChar '<' = "<" escapeChar '>' = ">" escapeChar '&' = "&" escapeChar c = T.singleton c getResults :: Text -> Handler [Result] getResults qstring = do sphinxRes' <- liftIO $ S.query config "searcher" (unpack qstring) case sphinxRes' of ST.Ok sphinxRes -> do let docids = map (Key . PersistInt64 . ST.documentId) $ ST.matches sphinxRes fmap catMaybes $ runDB $ forM docids $ \docid -> do mdoc <- get docid case mdoc of Nothing -> return Nothing Just doc -> liftIO $ Just <$> getResult docid doc qstring _ -> error $ show sphinxRes' where config = S.defaultConfig { S.port = 9312 , S.mode = ST.Any } getSearchR :: Handler RepHtml getSearchR = do ((formRes, searchWidget), _) <- runFormGet searchForm searchResults <- case formRes of FormSuccess qstring -> getResults qstring _ -> return [] defaultLayout $ do addLucius [lucius| .excerpt { color: green; font-style: italic } .match { background-color: yellow; } |] [whamlet| <form method=get action=@{SearchR}> ^{searchWidget} <input type=submit value=Search> $if not $ null searchResults <h1>Results $forall result <- searchResults <div .result> <a href=@{DocR $ resultId result}>#{resultTitle result} <div .excerpt>#{resultExcerpt result} |] getXmlpipeR :: Handler RepXml getXmlpipeR = do Searcher pool <- getYesod let headers = [("Content-Type", "text/xml")] managedConn <- lift $ takeResource pool let conn = mrValue managedConn lift $ mrReuse managedConn True let source = fullDocSource conn C.$= renderBuilder def flushSource = fmap C.Chunk source sendWaiResponse $ ResponseSource status200 headers flushSource entityToEvents :: (Entity Doc) -> [X.Event] entityToEvents (Entity docid doc) = [ X.EventBeginElement document [("id", [X.ContentText $ toPathPiece docid])] , X.EventBeginElement content [] , X.EventContent $ X.ContentText $ unTextarea $ docContent doc , X.EventEndElement content , X.EventEndElement document ] fullDocSource :: Connection -> C.Source IO X.Event fullDocSource conn = mconcat [ CL.sourceList startEvents , docSource conn , CL.sourceList endEvents ] docSource :: Connection -> C.Source IO X.Event docSource conn = selectSourceConn conn [] [] C.$= CL.concatMap entityToEvents toName :: Text -> X.Name toName x = X.Name x (Just "http://sphinxsearch.com/") (Just "sphinx") docset, schema, field, document, content :: X.Name docset = toName "docset" schema = toName "schema" field = toName "field" document = toName "document" content = "content" -- no prefix startEvents, endEvents :: [X.Event] startEvents = [ X.EventBeginDocument , X.EventBeginElement docset [] , X.EventBeginElement schema [] , X.EventBeginElement field [("name", [X.ContentText "content"])] , X.EventEndElement field , X.EventEndElement schema ] endEvents = [ X.EventEndElement docset ] main :: IO () main = withSqlitePool "searcher.db3" 10 $ \pool -> do runSqlPool (runMigration migrateAll) pool warpDebug 3000 $ Searcher pool