There's more than one way to skin a cat, and certainly more than one way to
write code. The various options can sometimes be confusing. And in the case of
the conduit
library, there are also some routes that you shouldn't take.
You'll see what I mean through the examples.
For the most part, using existing Source
s, Sink
s, and Conduit
s is
straight-forward. The problem comes from writing them in the first place. Let's
take a simple example: we want a Source
that will enumerate the Int
s 1 to
1000. For testing purposes, we'll connect it to a Sink
that sums up all of
its input. I came up with six different ways to write the Source
, though two
of those are using functions I haven't yet released.
import Criterion.Main
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.List
import Data.Functor.Identity (runIdentity)
sourceList, unfold, enumft, yielder, raw, state
:: Monad m
=> Int -- ^ stop
-> Source m Int
sourceList stop = CL.sourceList [1..stop]
unfold stop =
CL.unfold f 1
where
f i
| i > stop = Nothing
| otherwise = Just (i, i + 1)
enumft stop = CL.enumFromTo 1 stop
yielder stop =
go 1
where
go i
| i > stop = return ()
| otherwise = do
yield i
go $ i + 1
raw stop =
go 1
where
go i
| i > stop = Done Nothing ()
| otherwise = HaveOutput (go $ i + 1) (return ()) i
state stop =
sourceState 1 pull
where
pull i
| i > stop = return StateClosed
| otherwise = return $ StateOpen (i + 1) i
main :: IO ()
main = do
mapM_ test sources
defaultMain $ map bench' sources
where
sink :: Monad m => Sink Int m Int
sink = CL.fold (+) 0
bench' (name, source) = bench name $ whnf (\i -> runIdentity $ source i $$ sink) 1000
sources =
[ ("sourceList", sourceList)
, ("unfold", unfold)
, ("enumFromTo", enumft)
, ("yield", yielder)
, ("raw", raw)
, ("sourceState", state)
]
test (name, source) = do
let res = runIdentity $ source 1000 $$ sink
putStrLn $ name ++ ": " ++ show res
sourceList
is probably the approach most of us- myself included- would
actually use in real life. It let's us take advantage of all of the
list-processing functions and special syntax that Haskell already provides.
unfold
and enumFromTo
are both new functions for 0.4.2 (in fact, I wrote
them for the purpose of this comparison). They correspond very closely to their
Data.List
and Prelude
counterparts.
yield
is a new option we have starting with conduit
0.4. Due to the unified
datatypes, Source
has inherited a Monad
instance. This allows us to fairly
easily compose together different Source
s, and the yield
function provides
the simplest of all Source
s. In previous versions of conduit
, we could have
used Source
's Monoid
instance instead of do
-notation.
raw
goes directly against the datatypes. I find it interesting that the raw
version isn't really much more complicated than yield
or sourceState
,
though you do have to understand some of the extra fields on the constructors.
Finally, we use sourceState
. This is one of the oldest approaches, since this
function has been available since the first release of conduit
. I think that
this function would compile and run perfectly on conduit 0.0.
The Criterion benchmarks are very informative. Thanks to Bryan's cool new report, let's look at the graph:
unfold
, enumFromTo
, and raw
all perform equally well. sourceList
comes
in behind them: the need to allocate the extra list is the culprit. Behind that
is yield
. To see why, look at the difference between yielder
and raw
.
They're structure almost identically. For the i > stop
case, we have return
()
versus Done Nothing ()
. But in reality, those are the same thing!
return
is defined as Done Nothing
.
The performance gap comes from the otherwise
branch. If we fully expand the
do
-notation, we end up with:
yield i >>= (go $ i + 1)
==> HaveOutput (Done Nothing ()) (return ()) i >> (go $ i + 1)
==> HaveOutput (Done Nothing () >> (go $ i + 1)) (return ()) i
==> HaveOutput (go $ i + 1) (return ()) i
Which is precisely what raw
says. However, without adding aggressive inlining
to conduit
, most of this transformation will occur at runtime, not compile
time. Still, the performance gap is relatively minor, and in most real-world
applications should be dwarfed by the actual computations being performed, so I
think the yield
approach definitely has merit.
What might be shocking is the abysmal performance of sourceState
. It's a full
8 times slower than raw
! There are two major contributing factors here:
- Each step goes through a monadic bind. This is necessitated by the API of
sourceState
. - We have to unwrap the
SourceStateResult
type.
sourceState
was great when it first came out. When conduit
's internals were
ugly and based on mutable variables, it provided a clean, simple approach to
creating Source
s. However, conduit
has moved on: the internals are pure and
easy to work with and we have alternatives like yield
for high-level stuff.
And performance wise, the types now distinguish between pure and impure
actions. sourceState
forces usage of an extra PipeM
constructor at each
step of output generation, which kills GHC's ability to optimize.
So our main takeaway should be: don't use sourceState
. It's there for API
compatibility with older versions, but is no longer the best approach to the
problem. Similarly, we can improve upon sourceIO
, but we have to be a bit
careful here, since we have to ensure that all of our finalizers are called
correctly. Let's take a look at a simple Char
-based file source, comparing a
sourceIO
implementation to the raw constructors.
import Data.Conduit
import qualified Data.Conduit.List as CL
import Control.Monad.Trans.Resource
import System.IO
import Control.Monad.IO.Class (liftIO)
import Criterion.Main
sourceFileOld :: MonadResource m => FilePath -> Source m Char
sourceFileOld fp = sourceIO
(openFile fp ReadMode)
hClose
(\h -> liftIO $ do
eof <- hIsEOF h
if eof
then return IOClosed
else fmap IOOpen $ hGetChar h)
sourceFileNew :: MonadResource m => FilePath -> Source m Char
sourceFileNew fp = PipeM
(allocate (openFile fp ReadMode) hClose >>= go)
(return ())
where
go (key, h) =
pull
where
self = PipeM pull close
pull = do
eof <- liftIO $ hIsEOF h
if eof
then do
release key
return $ Done Nothing ()
else do
c <- liftIO $ hGetChar h
return $ HaveOutput self close c
close = release key
main :: IO ()
main =
defaultMain [bench "old" $ go sourceFileOld, bench "new" $ go sourceFileNew]
where
go src = whnfIO $ runResourceT $ src "source-io.hs" $$ CL.sinkNull
The results are much closer here:
We're no longer getting the benefit of avoiding monadic binds, since by its
very nature this function has to call IO
actions constantly. In fact, I
believe that the performance gap here doesn't warrant avoiding sourceIO
in
normal user code, though it's likely a good idea to look at optimizing the
Data.Conduit.Binary
functions. Perhaps even better is if we can get some
combinators that make it easier to express this kind of control flow.
The story is much the same with Sink
s and Conduit
s, so I won't bore you
with too many details. Let's jump into the code first, and then explain what we
want to notice.
import Criterion.Main
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.List
import Data.Functor.Identity
main :: IO ()
main = defaultMain
[ bench "mapOutput" $ flip whnf 2 $ \i -> runIdentity $ mapOutput (* i) source $$ sink
, bench "map left" $ flip whnf 2 $ \i -> runIdentity $ source $= CL.map (* i) $$ sink
, bench "map right" $ flip whnf 2 $ \i -> runIdentity $ source $$ CL.map (* i) =$ sink
, bench "await-yield left" $ flip whnf 2 $ \i -> runIdentity $ source $= awaitYield i $$ sink
, bench "await-yield right" $ flip whnf 2 $ \i -> runIdentity $ source $$ awaitYield i =$ sink
]
where
source :: Monad m => Source m Int
source = CL.sourceList [1..1000]
sink :: Monad m => Sink Int m Int
sink = CL.fold (+) 0
awaitYield :: Monad m => Int -> Conduit Int m Int
awaitYield i =
self
where
self = do
mx <- await
case mx of
Nothing -> return ()
Just x -> do
yield $ x * i
self
There are five different ways presented to multiple each number in a stream by
2. CL.map
is likely the most obvious choice, since it's a natural analogue to
the list-based map
function. But we have two different ways to use it: we can
either left-fuse the source to the conduit, and then connect the new source to
the sink, or right-fuse the conduit to the sink, and connect the source to the
new sink.
We also have an awaitYield
function, which uses the await
and yield
functions and leverages the Monad
instance of Conduit
. Like map
, we have
both a left and a right version.
We also have a mapOutput
function. In that case, we're not actually using a
Conduit
at all. Instead, we're modifying the output values being produced by
the source directly, without needing to pipe through an extra component. Let's
see our benchmark results:
There are three things worth noticing:
- Like previously, the high-level approach (using
await
andyield
) was slower than using the more highly optimized function fromData.Conduit.List
. - There no clear winner between left and right fusing.
mapOutput
is significantly faster than using aConduit
. The reason is that we're able to eliminate an entire extraPipe
in the pipeline.
mapOutput
will not be an option in the general case. You're restricted in a number of ways:
- It can only be applied to a
Source
, not aSink
. - You have to have transformations which produce one output for one input.
- You can perform any monadic actions.
However, if your use case matches, mapOutput
can be a very convenient optimization.