A very small number of trees may already reveal interesting patterns, learning rate is the most important

Do you know 40 trees already can brings a big leap to the prediction with complex dataset and predictor-response relationships? I recently tested this for random forest and XGboost, and am surprised by the huge progress it could made with a small increase in the number of trees.

Another lesson I learned is about the learning rate for boosting. Though the from the cross-validation process can look slow enough, it could still be too fast and be the cause of artefacts, or strange patterns in the spatial prediction. I often spend a long time to tune the XGBoost, using both grid-search cross-validation and manual tuning, but despite the prediction accuracy in the end could look completely rational, the prediction patterns always shows too much edges and inconsistency that an expert can tell they are impossible. For example, very low values on the roads but high values next to the road. I found plotting the spatial patterns extremely helpful in the hyper-parameter tuning process, as opposing to only based on a cross-validation accuracy matrix (RMSE, R2, IQR, MAE, etc.).

For this, I made an R shiny page for playing around different hyper-parameters, check it out :-).


OpenStreetMap data— Query, downloading, handling

OpenStreetMap provides tremendous amount of information. While it is becoming indispensable in urban GIS (e.g. routing), it also provides invaluable labels for supervised machine learning methods development (particularly currently burgeoning deep neural networks) in many applications, such automatic road and building delineation from very high resolution (VHR) satellite imagery, which arise hopes in high-resolution global-scale mapping. For examples, for air pollution mapping, we need global road infrastructure, but the transportation network provided by OSM is incomplete in many countries, such as in China and African countries. VHR satellite imagery (e.g. worldview2) and machine learning (particularly deep learning neural networks) are promising techniques to complement the OSM, and evaluate the consequences of directly using the OSM with missing roads to predict air pollution. Here I provide some useful tools/ways/notes for handling OSM for analysis.

Method 1: Download everything and query

  • Use a mirror to download:

wget https://ftp.fau.de/osm-planet/pbf/planet-200413.osm.pbfusing

  • Filter roads: (With osmium):

cmd = “{0} tags-filter {1}.osm.pbf nwr/{2}={3} -o gap_{4}.osm.pbf”.format(osmium, fname, keyword, value, out_fname)

  • Convert to gpkg: (with GDAL)

cmd = “ogr2ogr -f GPKG gap_{}.gpkg gap_{}.osm.pbf”.format(out_fname, out_fname)

Pros: store everything on hard disk, if you know exactly what you want, this maybe the most straight-forward option.

Cons: cannot explore data before downloading. Requires other tools for data processing and analysis. May download lots of redundant data.

Method 2: Interactive query, download, and analysis (using OSMnx):

OSMnx is a brilliant tool, the paper is also very well written. https://www.researchgate.net/publication/309738462_OSMnx_New_Methods_for_Acquiring_Constructing_Analyzing_and_Visualizing_Complex_Street_Networks

Two ways of installing: conda or pip, and docker

  1. You can create a conda environment to install the OSMnx, as indicated in the manuscript:

conda config –prepend channels conda-forge

conda create -n ox –strict-channel-priority osmnx

Pros: very convenient to query, download, visualise, and analyse (e.g. find the fastest routes) data, easily reproducible. Well documented and lots of examples. Reproducible.

Cons: currently only with Python

Method 3: Query (using QGIS) osm package, with SQL query

Pros: the most convenient tool for QGIS users,

Cons: not reproducible.

Reducing the data size to 1/10, only by reshaping!

On the way to the big data concept and burgeoning software, don’t forget to inspect if the data frame or spreadsheet has stored too many redundant records. Just by reshaping, I got the data size reduced to 1/10! 

Since I got my hands on global air pollution mapping, I managed to gather a most comprehensive set of station measurements, through collaborations and investigating on the open science community. My ambition was to do it time-resolved, so I gathered hourly data of a year (8760 hours). I got 6 datasets, some dumped into 365 spread sheets, with several air pollutants, with a total size of 22 Gb; some stored in 13 spread sheets, sorted by space-time, some stored wide table, some long, some has UTM time, some local time. In short, all of them need to be wrangled for a consistent structure for querying and analysis. 

Though focused on array data management during my Ph.D., I’ve never thought point data would bring me trouble in storage. I thought it is easy, I just need a relation database or HDF5-based tools.

I never thought a meeting with my software developer colleagues would be useful, as I thought I know a bit about HDF-5 or a relational database, just to put them in use shouldn’t be hard.

But by a mere look at the column names I provided, my colleague said ” there is lots of replications in the data, isn’t it?”

I have the columns: time, longitude, latitude, sensor type, values, urban type,  … The longitude, latitude, etc. are replicated multiple times when the time integrates.

I have thought this is the format that I’m most familiar for the subsequent analysis, this is the first time I started to care how much space are wasted.

I went back to build a relational database with two tables, one (data table) with

time, value, UID ( a unique id linking to each coordinates)

The other (attribute table) with

UID, Longitude, Latitude, …. 

I firstly investigated on a small dataset, which has 70 stations, that reduced the file size from around 75 Mb to 17 Mb.

Now since I’m more aware of the storage, I got uneasy by the left-over replications: the UID is repeating for time iterations, and time repeating among UID.

I then rearranged the long table to the wide table, with each column a station. This makes the size almost a third of the data table. Compared with the original data, it shrinked to 1/10 of the size. 

Before going for “sophisticated” big data solutions, thinking about “never repeating data”.

How to assess accuracy of a machine learning method, when observations are limited?

When we have nonlinear relationships, ensemble tree-based methods (e.g. xgboost, random forest, general boosting machine) may give a boost to your prediction accuracy. However, in situation where we don’t have a very big dataset (at least 10,000), such as in air pollution mapping, when we have to derive complex relationships between air pollution and predictors using several hundreds or thousands of ground monitor observations, how can we validate our model?

The reason that this poses a problem is we need to tune our hyperparameters, and the observations we leave in or out of the training set may alter the relationships we derived. For XGBoost, more than 5 parameters can be tuned and a change in one hyperparameter may give a very different prediction result. By tuning hyperparameters, we run into the argument of information leak, as the data that we fit our hyperparameters onto are then used for accuracy assessment.

The most intuitive way of solving the potential information leak is to include an external dataset, or sampling the dataset at hand additionally a completely untouched test set. A cross-validation set is used to tune hyperparameters while cross-validation, and the result is test again on a test set. However, this is also problematic if the splitting between cross-validation and the test set can lead to a different result.

I did an experiment with ground NO2 observations of Germany and Netherlands. I have altogether 413 observations, I held out 10% for independent testing. I did a grid-search for XGBoost. For two tests, I used different seeds to randomly choose the independent test set. (the test set is treated as an external dataset). The two samplings led to considerably different results, particularly in terms of the test RMSE, and also in hyperparameter values that are optimized. The differences are in learning rate and maximum tree depth, the first time the learning rate=0.05 and maximum tree depth 3, and the second time learning rate = 0.1 and maximum tree depth = 4. The first time the test RMSE is 8.4, about 1 RMSE larger than the cross-validation result, the second time 5.7, about 1.8 RMSE smaller than the cross-validation result.


I then looked at relative accuracy indicators (normalize RMSE etc. by the mean of observations) to reduce the effects of magnitudes (in case I sampled very small values for testing). For test 1, the cross-validation results are RRMSE (relative RMSE): 0.36, rIQR (relative IQR): 0.33, rMAE (relative MAE): 0.25, and R2 0.68 – closer to the test data result (see figure below). For test 2, the relative indicators are also closer between cross-validation and test data, but still, the cross-validation accuracy looks to be underestimated – the test set has an impressive R2 of 0. 79.


cross-validation result of test 1

Test results##      RMSE     RRMSE       IQR      rIQR       MAE      rMAE     rsq ## 8.4197599 0.3650937 7.9581566 0.3939681 6.0183271 0.2609639 0.7093221  cross-validation result of test 2.


Test data##      RMSE     RRMSE       IQR        rIQR     MAE      rMAE       rsq Test  5.67         0.30      3.88         0.23    3.9      0.21      0.79 CV.   7.73         0.36       6.5         0.34    5.29     0.24      0.68

This means we will conclude differently with different external datasets, the first indicates the cross-validation results are over-estimated, possibly due to hyperparameter tuning which leaks information from our validation datasets. On the contrary, the second indicates the information leak may not existing at all, we even get under-estimated accuracy assessment.

 This demonstrated several things: 1) cross-validation is very important, as how the training -test sets are split play a major role in the modelling and accuracy assessment. 2) A relative accuracy indicator is needed when an external validation set is used. 3) An external test set probably may add in more bias in accuracy assessment, possibly causing more trouble than the potential information leak.  The question goes back to: how influential is the information leak?

I then used the hyperparameter settings tuned in test 1 on the test 2, which means we don’t tune hyperparameter much in test 2. The result is the XGBoost obtained worse results on the test set (still better than the cross-validation results). This means the information leak is not as daunting as not tuning hyperparameters sufficiently.

The take home message is though there might be information leak, tuning hyperparameter is essential. With a relatively limited dataset e.g. hundreds or thousands but the relationships are complex and variables multiple (in my case 66), splitting the dataset into train-validation-test, as in deep learning for example, is not useful, but may even lead to biased interpretation of the accuracy results.



Farewell Muenster

This is my last week in Muenster. The summer seems to be on its way but it already feels autumn.

I am so fortunate to be in this group for almost four years, and supervised by a great professor, perhaps the best supervisor ever. His mind is so great and I always felt mine so tiny. Sometimes I felt really blessed to be able to work with and guided by him, and I am afraid no one can understand what I am talking about so well. I am so bad at expressing myself and explaining things. I love my colleagues and friends here, they are so nice, each of them! Looking back in these years, I had so much fun. Maybe this is another best 4 years of my life.