On hyperparameter optimisation of random forest and when we use Lasso for post-processing trees

In my OpenGeoHub2020 lecture https://github.com/mengluchu/OpenGeoHub2020, I have talked about the application of Lasso as a post-processing step for random forest to shrink the tree space. This approach was originally proposed in the famous “element of statistical learning” book of Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie but I have not seen it being implemented or used. I implemented it in my air pollution spatial prediction model and in my R package APMtools. The function is here: https://github.com/mengluchu/APMtools/blob/master/R/prediction_with_pp_La.R My experience is that this improves prediction accuracy and obtains an accuracy almost as good as a meticulously-tunned XGBoost.

I received a question about hyper-parameter optimisation using this Random Forest + Lasso approach (let’s call it RFLA): If we use the Lasso for postprocessing and reduce the number of trees, how do we fit it into the hyperpparameter optimisation? In this blog I summarise on hypermeter optimsation for random forest and discuss this question.

In random forest hyperparameter optimisation (despite that random forest is not so terribly sensitive to it), we commonly optimise the number of variables to select from (mtry, the most imporatnt) and the minimize node size (min.node.size, less important) but not the number of trees (ntree), as ntree can be a very safe (i.e. large) number and increasing it doesn’t deteriorates the model. Random forest is perfectly parallelisable so we also don’t need to worry about the computational burden. This leads to the logic that in practice, we typically decide on the ntree first, then optimise for mtry and min.node.size.

Let’s firstly look at the effects of mtry and min.node.size. As we know, a lower mtry reduces model overfitting by increasing the heterogeneities in candidate predictor variables when building each tree and therefore reduce correlations between trees. At the same time, a small mtry may reduce the strength of a single tree as less predictor variables are considered. For the same reason, a large mtry is more likely to cause over-fitting of a single tree.

Suppose we have sufficient observations, if ntree is small (say, 50), the mtry may better off be small because we want to let trees be more independent to each other. But when ntree is relatively large (say 2000), the mtry may be increased as we have many trees to aggregate to reduce model over-fitting in the first place. However, as mentioned, this makes trees more correlated to each other (high redundancy). In practice, we use Cross-Validation (CV) to find the optimal mtry.

Reducing the min.node.size also increases the chance of a single tree overfitting. Especially when the mtry is large, a small min.node.size makes a more flexible tree. Anyway, this hyperparameter is not so important. In practice, it is commonly set to 5 or 10.

As the mtry, min.node.size both affect the final results, it sounds reasonable that we should optimise the hyperparameters for RFLA directly, instead of for Random Forest and then applying Lasso. However, does this worth the effort? The Lasso hyperparameter optimisation further reduces the effects of hyperparameter setting. With Lasso controls tree correlations, the mtry could be very large. When mtry is equal to the number of predictors, random forest becomes bagging. This means maybe bagging + Lasso is no less effective compared to random forest and random forest + Lasso. Or more generally speaking, bagging with an efficient shrinkage tree aggregation strategy is good enough?

The invention of random forest is overrated?