Data wrangling with transducers for a machine learning problem28 Apr 2017
The transducers from the net.cgrand.xforms library are great to transform and analyze data with in Clojure. This blogpost shows how the xforms transducers can be used to do data analysis for a machine learning problem from Kaggle, which is a data science competition platform.
One of the competitions on Kaggle is the Titanic competition. For this competition you are given a dataset about passengers aboard the Titanic, with data such as their age and how much they paid for their ticket. In the training data you are also told if the passenger survived. The goal of the competition is to predict if a passenger survived or not for a test set of data.
This tutorial on the Kaggle site explains how to solve such a problem. The tutorial explains which steps to take and how to analyze, change or create data and how to make predictions. The tutorial uses Python to go through all these steps. In this blog we'll use Clojure instead.
Analyzing the data
| :PassengerId | :SibSp | :Fare | :Embarked | :Sex | :Survived | :Parch | :Pclass | :Age | |--------------+--------+---------+-----------+--------+-----------+--------+---------+------| | 1 | 1 | 7.25 | S | male | 0 | 0 | 3 | 22 | | 2 | 1 | 71.2833 | C | female | 1 | 0 | 1 | 38 | | 3 | 0 | 7.925 | S | female | 1 | 0 | 3 | 26 | | 4 | 1 | 53.1 | S | female | 1 | 0 | 1 | 35 | ;; etcExample data from the Titanic dataset
Lets say you would like to find out how many people from the training data set survived. With the functions from clojure.core that looks like this:
And here's how to slice the data to see how many people survived based on their gender:
With the 'by-key' transducer that looks like:
For this counting use case there is little difference between using a transducer or the basic clojure.core functions. And the benefits of transducers being useable for different types of data, such as on streams is not relevant for our use case.
But when you want to get multiple results per grouping or statistics other than counting, the transducer approach with xforms starts to come out ahead:
Or for the distribution of the Age feature in the dataset:
With the xforms transducers all the data analysis from the Titanic tutorial in Python is as easy to do in Clojure. So the next time you need to do some group-by type operation, you should check out the transducers from net.cgrand.xforms.
To replicate (most of) the charts from the tutorial you can use the Incanter library.
Predicting the test data
The code is on GitHub.