I've gotten a few emails recently (most recently from Nathan Howell) about a shortcoming of the Conduit
datatype. The issue is that a Conduit
can only produce output when it is pushed to. However, if you have a Conduit
that could produce a large amount of output for a single input (e.g., a decompressor), this could become memory inefficient.
I came up with a simple solution: allow a Conduit
to return a stream of outputs for a single input. In code, this turns into just a single additional constructor for the ConduitResult
type:
HaveMore (m (ConduitResult input m output)) (m ()) [output]
I'll go into more detail on the m ()
bit below, but it says how to close the Conduit
early.
Any time you push to a conduit, it can now say "here's some output, and more is on the way." I've implemented this, and I'm happy with this solution. However, I want to make it better.
There's one way to do it
With the previous API, there was only one way to encode each operation. If you wanted to implement a map
, you had to use the Producing
constructor with a single element list for the output
. A concatMap
would look something like:
push input = return $ Producing (Conduit push close) (f input)
However, we now have at least two other ways to encode the same thing:
Return a
HaveMore
constructor which contains all of the output, which will then return thenProducing
constructor to allow theConduit
to continue.Return the elements one at a time via
HaveMore
.
Having these multiple approaches makes the internals of the library a bit ugly, and since there are multiple codepaths, it increases the likelihood of bugs. I also think it's difficult for new users to see so many options.
There are two separate issues at play, so let's deal with them separately.
All constructors can return output
In the current setup, all three constructors can return output. This was necessary previously, but no longer. If we removed the [output]
field from both Producing
and Finished
, then a user would be forced to use HaveMore
when they want to return output.
My concern here is complicating library usage. A previously simple function like map
would now require a few extra hoops to be jumped through. We could address this by leaving the same higher-level interface we have before in conduitState
and conduitIO
. That would give the downside of having a mismatch between the low-level and high-level API.
To chunk, or not to chunk
Another question is chunking. Previously, returning a list of output
s was necessary, since we only had one chance to return output. Now, however, we could just return successive HaveMore
s. This has the downside of- once again- complicating some implementations. It has an additional downside that it might hurt performance. On the flip side, it may improve performance in some cases, since it would be impossible to return empty lists in a HaveMore
.
Should closing give a Source
?
And as long as we're on the subject of change, let's look at closing a Conduit
. This applies in two circumstances: the feeding source closed, or the consuming sink closed. If the feeding source closes, we want to have an opportunity to produce a bit more output. This is necessary, for example, in the case of compression: we want to build up large chunks of compressed data and then generate output. But the last chunk of output has to be manually flushed once we know there's no more input.
On the flip side, if the consuming sink closes, we don't need to produce any more output as it won't be used. If you look at the definition of HaveMore
above, it has a field m ()
, which is how it's closed. This doesn't allow for any new output to be produced, because a HaveMore
would only ever be closed if the consuming Sink
closed.
At this point, I see two problems with the way conduitClose
works:
When closing a
Conduit
, you can only return a single chunk of values, not a stream of values. I can't think of a use case where you would return a large quantity of output from closing, but this limitation does both me.In the case of a closed sink, the conduit will still try to produce some extra output which may never be used.
There's an easy solution to both problems: closing a Conduit
returns a Source
, which provides the last set of data. In the case of a closed Sink
, then the conduit
functions would simply call sourceClose
immediately. In the case of large output, we could take advantage of Source
's natural streaming abilities.
Feedback wanted
I'm writing this post in hope of getting some good feedback from people. Is my desire for one-way-to-do-things worthwhile, or is it better to complicate the internals of the library in exchange for potentially simpler user code? Does anyone have recommendations for better names for any of the constructors?
Postscript: prior art
While working on this, I reviewed two alternate approaches: enumerator and pipes. Let me explain why I can't reuse their solutions:
The
Enumeratee
type fromenumerator
is very powerful, much more so than aConduit
. It is a general purposeIteratee
-transformer, capable of doing lots of crazy stuff. That's exactly what I want to avoid forconduit
: implementing anEnumeratee
is far more complicated than implementing aConduit
, since it requires thinking directly about the innerIteratee
. The simplicity of aConduit
comes from the fact that it is a standalone unit.As usual, pipes look like a simple, elegant solution. But the big thing it's lacking is proper resource management. Notice how much thought goes into
Conduit
to ensure that all resources are closed as early as possible, even in the case of early termination. It's true that by usingResourceT
,pipes
is able to avoid completely losing scarce resources, but holding onto a file handle for too long is not much better. I see no way to adapt any of pipes's approaches toconduit
and still maintain our strict resource management.