A very small number of trees may already reveal interesting patterns, learning rate is the most important

Do you know 40 trees already can brings a big leap to the prediction with complex dataset and predictor-response relationships? I recently tested this for random forest and XGboost, and am surprised by the huge progress it could made with a small increase in the number of trees.

Another lesson I learned is about the learning rate for boosting. Though the from the cross-validation process can look slow enough, it could still be too fast and be the cause of artefacts, or strange patterns in the spatial prediction. I often spend a long time to tune the XGBoost, using both grid-search cross-validation and manual tuning, but despite the prediction accuracy in the end could look completely rational, the prediction patterns always shows too much edges and inconsistency that an expert can tell they are impossible. For example, very low values on the roads but high values next to the road. I found plotting the spatial patterns extremely helpful in the hyper-parameter tuning process, as opposing to only based on a cross-validation accuracy matrix (RMSE, R2, IQR, MAE, etc.).

For this, I made an R shiny page for playing around different hyper-parameters, check it out :-).

https://lumeng0312.shinyapps.io/xgboost/?_ga=2.229717724.1995623365.1592166857-2130394652.1592166857

OpenStreetMap data— Query, downloading, handling

OpenStreetMap provides tremendous amount of information. While it is becoming indispensable in urban GIS (e.g. routing), it also provides invaluable labels for supervised machine learning methods development (particularly currently burgeoning deep neural networks) in many applications, such automatic road and building delineation from very high resolution (VHR) satellite imagery, which arise hopes in high-resolution global-scale mapping. For examples, for air pollution mapping, we need global road infrastructure, but the transportation network provided by OSM is incomplete in many countries, such as in China and African countries. VHR satellite imagery (e.g. worldview2) and machine learning (particularly deep learning neural networks) are promising techniques to complement the OSM, and evaluate the consequences of directly using the OSM with missing roads to predict air pollution. Here I provide some useful tools/ways/notes for handling OSM for analysis.

Method 1: Download everything and query

  • Use a mirror to download:

wget https://ftp.fau.de/osm-planet/pbf/planet-200413.osm.pbfusing

  • Filter roads: (With osmium):

cmd = “{0} tags-filter {1}.osm.pbf nwr/{2}={3} -o gap_{4}.osm.pbf”.format(osmium, fname, keyword, value, out_fname)

  • Convert to gpkg: (with GDAL)

cmd = “ogr2ogr -f GPKG gap_{}.gpkg gap_{}.osm.pbf”.format(out_fname, out_fname)

Pros: store everything on hard disk, if you know exactly what you want, this maybe the most straight-forward option.

Cons: cannot explore data before downloading. Requires other tools for data processing and analysis. May download lots of redundant data.

Method 2: Interactive query, download, and analysis (using OSMnx):

OSMnx is a brilliant tool, the paper is also very well written. https://www.researchgate.net/publication/309738462_OSMnx_New_Methods_for_Acquiring_Constructing_Analyzing_and_Visualizing_Complex_Street_Networks

Two ways of installing: conda or pip, and docker

  1. You can create a conda environment to install the OSMnx, as indicated in the manuscript:

conda config –prepend channels conda-forge

conda create -n ox –strict-channel-priority osmnx

Pros: very convenient to query, download, visualise, and analyse (e.g. find the fastest routes) data, easily reproducible. Well documented and lots of examples. Reproducible.

Cons: currently only with Python

Method 3: Query (using QGIS) osm package, with SQL query

Pros: the most convenient tool for QGIS users,

Cons: not reproducible.