Don’t get lost in a forest

It’s easy to get lost with tree-based models and their many implementations in R. With the post I try to shed some light on principal methods and algorithms to exploit the power and simplicity of these models.

How to use this repo

My advice is to follow along the post on rDisorder with full code on your side.

You can clone the repo (if you need help with cloning refer to this) to have data and code on your local storage. Or you can just load data from the data folder in the repo itself.

There’s also a notebook implementation on Kaggle, you can fork the notebook, run it, try it, experiment with it.

Below an intro to machine learning pipelines

Fast & brief primer on dplyr + intubate

This is a short intro to dplyr and intubate packages. I didn’t want to load too much the Don’t get lost in a forest post this intro is referring to.

Why dplyr and not base R?

With dplyr we can avoid the creation of temporary datasets saving computation time and memory. It might look trivial, but you’ll realize this approach not only will save a ton of time in the long-run, but it will also improve code clarity and simplicity.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now let’s say I want to see the mean of every species, how would I do?

tapply(iris$Petal.Width, iris[,5], mean)

##     setosa versicolor  virginica 
##      0.246      1.326      2.026

This is a very nice one-liner and for basic stuff like this is actually the best way to deal with grouping. But what if I want the mean of every variable at the same time?

iris %>% # Take iris data set
    group_by(Species) %>% # group everything by Species
    summarize_each(funs(mean)) # summarize each group with the funs I want

## # A tibble: 3 × 5
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##       <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

Et voilà! And the result is a convenient tibble which is a dataframe 2.0 and you can use the same as a base dataframe.

Moreover, piping makes reading code more natural, just look at the comments in the code chunk above.

Why intubate?

Don’t think this is enough, with the intubate package we can go even further and pipe models as well.

library(intubate)

summary(lm(Petal.Width ~ ., data = iris[sample(1:nrow(iris), nrow(iris) * .7), - 5]))

## 
## Call:
## lm(formula = Petal.Width ~ ., data = iris[sample(1:nrow(iris), 
##     nrow(iris) * 0.7), -5])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37386 -0.12411 -0.01595  0.08865  0.54763 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.05771    0.19494  -0.296    0.768    
## Sepal.Length -0.25555    0.05048  -5.063 1.86e-06 ***
## Sepal.Width   0.23180    0.05419   4.277 4.30e-05 ***
## Petal.Length  0.54363    0.02651  20.503  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1818 on 101 degrees of freedom
## Multiple R-squared:  0.9462, Adjusted R-squared:  0.9446 
## F-statistic: 591.7 on 3 and 101 DF,  p-value: < 2.2e-16

The one-liner above is a mouthful and not so clear. One way to make the same thing more clearly would be to create copies and lookup variables, but we don’t like that much right?

iris %>% # Take iris data set
    select(-Species) %>% # keep all columns except Species
    sample_n(nrow(iris) * .7) %>% # take a random sample of rows
    ntbt_lm(Petal.Width ~ .) %>% # run a linear regression
    summary # show me a summary

## 
## Call:
## lm(formula = Petal.Width ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61250 -0.09824 -0.02436  0.10692  0.60443 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.25337    0.22534  -1.124 0.263506    
## Sepal.Length -0.21542    0.05656  -3.809 0.000240 ***
## Sepal.Width   0.23909    0.06015   3.975 0.000132 ***
## Petal.Length  0.52827    0.02883  18.320  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1959 on 101 degrees of freedom
## Multiple R-squared:  0.935,  Adjusted R-squared:  0.9331 
## F-statistic: 484.2 on 3 and 101 DF,  p-value: < 2.2e-16

Much clearer and makes doing transformations before running a model feel like a breeze. If you want to use a function not included in the intubate package, or if you’re not sure if it’s implemented or not you can use it with the simple ntbt framework.

iris %>% # Take iris data set
    select(-Species) %>% # keep all columns except Species
    sample_n(nrow(iris) * .7) %>% # take a random sample of rows
    ntbt(lm, Petal.Width ~ .) %>% # run a linear regression
    summary # show me a summary

## 
## Call:
## lm(formula = Petal.Width ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40603 -0.10177 -0.00889  0.08524  0.60071 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.21301    0.20300  -1.049    0.297    
## Sepal.Length -0.22614    0.05502  -4.110 8.06e-05 ***
## Sepal.Width   0.23660    0.05713   4.142 7.17e-05 ***
## Petal.Length  0.53710    0.02758  19.472  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1811 on 101 degrees of freedom
## Multiple R-squared:  0.9454, Adjusted R-squared:  0.9437 
## F-statistic: 582.4 on 3 and 101 DF,  p-value: < 2.2e-16

Don't get lost in a forest