My understanding of reducers after EuroClojure
12 Jun 2012At EuroClojure Rich Hickey gave a second, unscheduled talk about the new reducers library. The talk clarified a few points for me, that I didn't initially get from the explaining blogposts on clojure.com/blog. For the Amsterdam Clojure Meetup I wrote down these notes:
Reducers vs seq/lazy-seq api
This is mostly explained in the first
blogpost: Reducers
- A Library and Model for Collection Processing.
The reducers are partly about de-complecting representation, order
and mechanism. For example: the map reducer is only a recipe to apply a function to each item in the collection. Compare the following:
[2 3 4] === (vec (map inc [1 2 3])) === (into [] (r/map inc [1 2 3])) === (reduce conj [] (r/map inc [1 2 3]))(here map is the clojure.core/map, r/map is the new clojure.core.reducers/map)
A reducer is only the recipe step of transforming a collection. Building a collection or a result is an explicit, separate reduce step. The building of the collection and the what to do with the input collection are separate here. When you replace r/map with a composition of various reducers the actual construction of the results is only the final step. Compare this to the intermediate lazy-seq results that are produced when composing the old seq functions.
Reducers and fold for parallelism
This is mostly explained in the second
blogpost: Anatomy
of a Reducer.
The split between building results and the reducer recipes also opens up some nice possibilities for parallelism, through the fork/join framework.
(def v (vec (range 100000000000))) (reduce + (map inc v)) vs. (reduce + (r/map inc v)) vs. (r/fold + (r/map inc v))
I think the timings were: reduce+r/map 2x fast as reduce+map, and r/fold+r/map 4x fast as reduce+map.
Fold is simply the name for the new parallel reduce function, it is not meant to refer to a function with the same name in other languages.
In the example the fold function is not passed a combiner. The plus function is used for the combiner here as well. The seed for the reduce at the leafs is the combiner function called with 0 arguments.
Fold will try to do the computation in parallel using fork/join, when the input collection that is asked to apply the reducer to itself supports this and when the reducer supports this. The check for support is done through protocols: for the datastructures: PersistentVector extends CollFold, for the reducers: r/map is defined as a folder, while r/take-while is defined as a reducer (and does not support fold, because partitioning does not make sense for this computation). When parallel fold is not supported, then fold will just do a reduce. See the implementation for details: reducers.clj
A common approach for parallel processing is to use map+reduce. The work is divided into partitions and map is applied to each partition and the intermediate results are combined with reduce. In fold the approach is reduce+combine. The work on the smallest partition is done with a reduce rather than with map. Compare the two approaches by trying to express filter in map+reduce versus reduce+combine. It appeals to the functional programming sensibilities that reduce is a better fit for most operations than map.
Fold works nicely with Clojure datastructures. Many datastructures in Clojure are trees, which can be easily partitioned. As Rich explained: explicit parallel datastructures (as found in other languages) make no sense because being parallel is a property of an algorithm, not the data.
The ultimate benefit of the reducers library is this: In the past you could make your programs faster if you simply waited a while for faster hardware (Moore's law). Fold and reduce+combine bring back that promise for data processing with more cores, rather than faster cpu's.