How to assess accuracy of a machine learning method, when observations are limited?

When we have nonlinear relationships, ensemble tree-based methods (e.g. xgboost, random forest, general boosting machine) may give a boost to your prediction accuracy. However, in situation where we don’t have a very big dataset (at least 10,000), such as in air pollution mapping, when we have to derive complex relationships between air pollution and predictors using several hundreds or thousands of ground monitor observations, how can we validate our model?

The reason that this poses a problem is we need to tune our hyperparameters, and the observations we leave in or out of the training set may alter the relationships we derived. For XGBoost, more than 5 parameters can be tuned and a change in one hyperparameter may give a very different prediction result. By tuning hyperparameters, we run into the argument of information leak, as the data that we fit our hyperparameters onto are then used for accuracy assessment.

The most intuitive way of solving the potential information leak is to include an external dataset, or sampling the dataset at hand additionally a completely untouched test set. A cross-validation set is used to tune hyperparameters while cross-validation, and the result is test again on a test set. However, this is also problematic if the splitting between cross-validation and the test set can lead to a different result.

I did an experiment with ground NO2 observations of Germany and Netherlands. I have altogether 413 observations, I held out 10% for independent testing. I did a grid-search for XGBoost. For two tests, I used different seeds to randomly choose the independent test set. (the test set is treated as an external dataset). The two samplings led to considerably different results, particularly in terms of the test RMSE, and also in hyperparameter values that are optimized. The differences are in learning rate and maximum tree depth, the first time the learning rate=0.05 and maximum tree depth 3, and the second time learning rate = 0.1 and maximum tree depth = 4. The first time the test RMSE is 8.4, about 1 RMSE larger than the cross-validation result, the second time 5.7, about 1.8 RMSE smaller than the cross-validation result.


I then looked at relative accuracy indicators (normalize RMSE etc. by the mean of observations) to reduce the effects of magnitudes (in case I sampled very small values for testing). For test 1, the cross-validation results are RRMSE (relative RMSE): 0.36, rIQR (relative IQR): 0.33, rMAE (relative MAE): 0.25, and R2 0.68 – closer to the test data result (see figure below). For test 2, the relative indicators are also closer between cross-validation and test data, but still, the cross-validation accuracy looks to be underestimated – the test set has an impressive R2 of 0. 79.


cross-validation result of test 1

Test results##      RMSE     RRMSE       IQR      rIQR       MAE      rMAE     rsq ## 8.4197599 0.3650937 7.9581566 0.3939681 6.0183271 0.2609639 0.7093221  cross-validation result of test 2.


Test data##      RMSE     RRMSE       IQR        rIQR     MAE      rMAE       rsq Test  5.67         0.30      3.88         0.23    3.9      0.21      0.79 CV.   7.73         0.36       6.5         0.34    5.29     0.24      0.68

This means we will conclude differently with different external datasets, the first indicates the cross-validation results are over-estimated, possibly due to hyperparameter tuning which leaks information from our validation datasets. On the contrary, the second indicates the information leak may not existing at all, we even get under-estimated accuracy assessment.

 This demonstrated several things: 1) cross-validation is very important, as how the training -test sets are split play a major role in the modelling and accuracy assessment. 2) A relative accuracy indicator is needed when an external validation set is used. 3) An external test set probably may add in more bias in accuracy assessment, possibly causing more trouble than the potential information leak.  The question goes back to: how influential is the information leak?

I then used the hyperparameter settings tuned in test 1 on the test 2, which means we don’t tune hyperparameter much in test 2. The result is the XGBoost obtained worse results on the test set (still better than the cross-validation results). This means the information leak is not as daunting as not tuning hyperparameters sufficiently.

The take home message is though there might be information leak, tuning hyperparameter is essential. With a relatively limited dataset e.g. hundreds or thousands but the relationships are complex and variables multiple (in my case 66), splitting the dataset into train-validation-test, as in deep learning for example, is not useful, but may even lead to biased interpretation of the accuracy results.