The Split-Apply-Combine Strategy
Many data analysis tasks involve splitting a data set into groups, applying some functions to each of the groups and then combining the results. A standardized framework for handling this sort of computation is described in the paper "The Split-Apply-Combine Strategy for Data Analysis", written by Hadley Wickham.
The DataValueTables package supports the Split-Apply-Combine strategy through the by
function, which takes in three arguments: (1) a DataValueTable, (2) one or more columns to split the DataValueTable on, and (3) a function or expression to apply to each subset of the DataValueTable.
We show several examples of the by
function applied to the iris
dataset below:
using DataValueTables
using CSV
iris = CSV.read(joinpath(Pkg.dir("DataValueTables"), "test/data/iris.csv"), DataValueTable)
by(iris, :Species, size)
by(iris, :Species, dt -> mean(dropna(dt[:PetalLength])))
by(iris, :Species, dt -> DataValueTable(N = size(dt, 1)))
The by
function also support the do
block form:
by(iris, :Species) do dt
DataValueTable(m = mean(dropna(dt[:PetalLength])), s² = var(dropna(dt[:PetalLength])))
end
A second approach to the Split-Apply-Combine strategy is implemented in the aggregate
function, which also takes three arguments: (1) a DataValueTable, (2) one or more columns to split the DataValueTable on, and (3) one or more functions that are used to compute a summary of each subset of the DataValueTable. Each function is applied to each column, that was not used to split the DataValueTable, creating new columns of the form $name_$function
e.g. SepalLength_mean
. Anonymous functions and expressions that do not have a name will be called λ1
.
We show several examples of the aggregate
function applied to the iris
dataset below:
aggregate(iris, :Species, sum)
aggregate(iris, :Species, [sum, x->mean(dropna(x))])
If you only want to split the data set into subsets, use the groupby
function:
for subdt in groupby(iris, :Species)
println(size(subdt, 1))
end