R, Python, or both in Machine Learning?

As an R user for almost ten years, I’m gradually switching to Python for machine learning, and pretty much everything. Not to mention deep learning, which has the community almost exclusively in Python, for other data science methods python sees a community growing faster than R. There are no doubt lots of developments in R that are not available yet in Python, like distributional forest, empirical fluctuation process-based time series structural change detection methods (party, efp, bfast, etc.) and the ggplot is extremely powerful. But that’s becoming less and less. Relatively more recent methods, such as catboost or XGBoost, have better Python APIs. classical geospatial analysis methods such as GWR (geographically weighted regression) and Gaussian processes also see lots of developments in Python. The tidyr tools also come naturally in Python (I.e. the pipes are not really needed as the programming is already fully object-based).

The Python array, data frame, geodata handling are becoming more powerful every day, but in R slower. I quite often implement things in R first as I’m most accustomed to it, but always I found a solution in Python that is more neat and simpler.

I won’t say bye to R though as there are still lots of complementary tools, it is convenient for me sometimes and several of my collaborations are still based in R, the community in R is still growing and upcoming. Programming languages function as communication tools. Emerging ideas from R are as exciting as in Python. Just to say for anyone serious about machine learning and is still a faithful R user, you will benefit from being bilingual.


How does spatial or spatiotemporal correlation affect cross-validation results?

Ever since I attended opengeohub summer school https://opengeohub.org/summer_school_2020 from 2018-2019, where in a lecture it has been stated that the spatial correlations lead to overly optimistic cross-validation results. I have appreciated the thoughts it brought to me about the effects of spatial and spatiotemporal correlations in ensemble tree (random forest or boosting)-based classification methods. Is it Ok to use it for spatial prediction? Is it a problem when the covariates (predictors) are spatially correlated, or when the response is spatially correlated, or both?

My answer is: it is not OK to use spatial validation! And it is not a problem if the samples are spatially correlated!

There is a very important fact related to the problem: spatial is nothing so special, it is one of the features, and the so called “dependency” is in every dimension! Not just in the spatial dimension. After understanding this, the reason to my answer is already almost clear.

In the lecture from opengeohub summer school, the example is not very illustrative to the problem, as it used latitude and longitude as predictors. Having said redundant predictor variables usually won’t affect random forest or boosting kind of method, these two variables are a common pitfall to be threw in a model. The reason is that these variable can easily cause extrapolation! Think about they exactly register the areas of each class. pixels are commonly clustered in a class and are far away from pixels in another class. The pixels that consists a cropland are in general far away from pixels that consists of forest. When the latitude and longitude are used, the model is going to say ok, this area is crop and other forest, no need to used any predictor variables.

The solution it gives, is to use spatial validation, which means to divide data into spatial blocks, and each time use one block of data for validation. In this way, the model has a harder time to predict because the model never see the test area before, and the longitude and latitude of course won’t know what’s there. They can only make the best guess based on the areas closing by.

Then in the lecture, a solution that can automatically get rid of this kind of predictors is given by using spatial validation in model training process, and only select 2 variables each time for growing the tree. This, of course, has a high potential of getting rid of longitude and latitude, also the elevation, for the same reason.

However, till now we have not talked about the influence of correlated data. The key point is that we don’t want to the training and test dataset to overlap.

Let’s firstly look at spatially correlated response. There is another literature, by Brenning (2012), implemented in https://mlr-org.com/docs/2018-07-25-visualize-spatial-cv/, stressing on correlated response. Notably, it also used spatial correlation. This paper, as well as the implementation, again, does not give a very good quantification of how exactly and to what extent the correlation in response causes a biased error estimation.

To delve into the effects of spatiotemporal correlations to an ensemble tree based method, let’s say random forest for regression, we can break it into three parts, the data is split into training and test, the training set is used in 1 and 2 and test in 3:

  1. model fitting using bootstrapped samples from the training set,
  2. obtaining the OOB (out of bag) error from the rest of the training set,
  3. cross validation (CV).
  • For 1: Thinking about growing a single tree using highly correlated data, what problem will it bring? 1) The universal Kriging and spatial autoregression model concerns spatially correlated residuals, but the tree model is very flexible and this may come out a smaller concern. 2) The three model concerns the best splitting point by mean of (y – y_mean)2 within each chunk divided by covariates, if lots of points are sampled close together, the split point may be missed and the best variables may not be chosen. This means even though we will grow hundreds or thousands of trees, it might be beneficial that the data are sampled more randomly over space.
  • For 2: If the bootstraps used for training correlates with OOB samples, it means the OOB error is not accurate. This, however, is less important because in practice we rely on CV to evaluate the model and less on the OOB, as the OOB is for each tree and CV is for the forest.
  • For 3: It is absolutely normal to have correlations between training and test data! Otherwise we have the extrapolating problem! In the another word, the test and training may come from different distributions! As the mentioned studies are happy that the CV result is now looking more realistic as it drops down, they may be unfairly asking the model to test on things it may has never seen before. Though we do many folds, remember we will take the average. The harm the spatial validation can do may be much more than the response is spatially correlated.


Brenning, A. (2012). Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. In 2012 IEEE International Geoscience and Remote Sensing Symposium. IEEE. https://doi.org/10.1109/igarss.2012.6352393

About choosing a Ph.D supervisor

I have been asked about my experience studying in different countries (I did my bachelor in China, MSc in US, PhD in Germany, Postdoc in Netherlands), which do I like the best and what are the differences. This, of course, depends on which university, institute, and the group of people I worked with. There are some general patterns, but the variability is way lower than institutes or research groups within a country. I have worked close together with four professors, the dissimilarities (in terms of how they do Science and supervise students) between them are huge.

Many PhDs grow to be extremely similar to their PhD supervisors, if their supervisors are respectful to them. This is at least shown on some of my PhD and Postdoc colleagues. (For not-very-good supervisors, I heard their (former) students told me they learned how not to be a not-very-good supervisor and are really aware of being responsive. )

In short, positive or negative, the influence from a Ph.D. supervisor is on so many aspects of a researcher. So here are some of the tips I want to share. When choosing a Ph.D. supervisor, do:

  1. look at if a supervisor is still leading the field, is he /she still being active, kept updating his/her blogs, Github/Gitlab, does he/she still read literatures or watch lectures?
  2. talk to his/her current Ph.D. and M.Sc students, how much time does he/she have, how bossy and caring is he/she, does he/she really have great insights or just a list of publications?
  3. know what the supervisor is best known for, what extraordinary/interesting things he/she did and is doing?
  4. be as critical as you can. This is hard for a M.Sc., but the research capability of a professor varies greatly. Some of them are, in the Chinese saying “the master of old science“, and worse, they stick on the piece of dying science that they have worked on for so long and once brought them honour. Be their students may likely to continue building on that piece of dying work.


  1. be fooled by the supervisor’s long co-authored publication list, that means a bit more than NOTHING, if not negative. If a supervisor has more than 15 publications a year, you can think of how is this possible. Is it possible to have 15 breakthroughs per year? I’d prefer two years, a really good paper. I once saw someone has 5 first-authored paper a year, my first sense is these papers may not be so good, instead of this person is amazing. We know we need time to think in-depth, to discuss, to implement, analyse and refine.
  2. choose too quick at what at a glance looks interesting. Some research groups may have some models or software under development, this is a very cool thing, but be careful not to be constrained to a certain model or software.

How to prevent the XGBoost from overfitting

Most people using XGBoost got the experience of model over-fitting. I earlier wrote a blog about how cross-validation can be misleading and the importance of prediction patterns https://wordpress.com/view/tomatofox.wordpress.com . This time I just want to note down some very practical tips, something that cross-validation (e.g. with grid search) can’t tell us.

  1. The model overfitting is likely caused by the learning rate being too high. The default, 0.3 is usually too high. You can try 0.005, with a dataset of more than 300 observations.
  2. A very misleading statement in many publications and tutorials is that too many trees in XGBoost (or boosting in general) causes over-fitting. This is ambiguous. Many trees will NOT decrease your model performance. Increasing from the optimum setting of trees will only adding to your computational burden, but it is as safe as trees in random forest! With as many trees you can imagine, the model is as general as with less trees because the gradient just got stuck! The cross-validation result won’t change, neither will the prediction pattern.
  3. So the advice is to use a very low learning rate, say 0.001, and set as many trees as you like, say 3000, for the best fitting. Then for faster achievement of the results, reducing the number of trees. If you use a learning rate of 0.001, 1000 trees should be enough to find the global minimum, as the 0.001*1000 is already 1, the same as doing the gradient descent once with learning rate 1.

Bias in judgements of a grant system

Veni is an dutch science foundation grant for researchers who are within 3 years of the acquisition of their Ph.D.. This is the first personal grant I have applied for. I spent about 3 months working day and nights, working on the proposal, application, and rebuttal. The result is that the three reviewer reports are overall very promising, leading to great hopes, but the funding committees don’t find the proposal (let’s call it proposal A) that appealing. I am completely fine with it, but I do want to speak out my opinions about the bias in the evaluation process.

In comparison, I have a proposal B written afterwards which is with extended methodology and data sources, but a much confined study area — global-scale for the Veni proposal vs. a European country for the proposal B.

My collaborator told me the proposal B was ranked the second, almost in, only my relationship with University is slightly weaker compared to the top ranked, and most people reading it found it strong.

The proposal A even fall out of the “very good” category, that is a shock. The most negative criticism for proposal A is that it is too ambitious and they would prefer a more confined study to fully consider all the uncertainties and interpretations, and this “too ambitious” is then followed in the entire evaluation, even saying my academic performance does not match up with this ambitious goal. This is a very unfair point but this is not what I want to discuss.

I want to share my opinion on this comment of “too ambitious”. I acknowledge that in the beginning I shared the same view with the reviewers, and even argued emotionally with my current supervisor. As a data person in core, I care about uncertainty and interpretability as much as any serious data scientist, which means I also preferred a small study area with lots of data to fully evaluate all the uncertainties and the physical meanings of the atmospheric processes. However, with time passing by and with more knowledges, I do grow a convincing view of the global-scale study, as besides the social and academic impacts, this is the only way of appreciating the heterogeneities between regions, and importantly, with global coverage satellite imagery and more and more geospatial data, it is a matter of time that we make high quality, high resolution global maps, then the next step? A better exposure assessment. I want to shout out that all these evolutions will be happening within the next three years! However, for researchers never gave the global-scale modelling a deep thought, “too ambitious” is going to be the intuition, based on their many years of experiences as professors.

The other point that the committee raised is about the uncertainty from data sources, this part was in the beginning elaborated but suggested to be removed by my supervisor. This is clearly due to different perspectives from researchers from different background caring different aspects. My supervisor thinks no-one would care about the uncertainty. The proposal is limited to less than 1800 words, I agreed purely because I got to elaborate my plans, otherwise the critics would be the plan is not clear.

Reflecting on the two points, before working on the global mapping, I am completely in-line with the committees. If I were the committee, I would reach to the same comments and would also prefer the proposal B, but the key point is, my opinion changed after setting hands on the global mapping — I like both proposals equally! For the globals-scale study, I was not stating the accuracy is the same everywhere, but I did promised uncertainty assessment everywhere. Importantly, if the global mapping is not started, it won’t ever started because the committee will always find it too ambitious. And for the challenging areas there will never be an estimation, and the estimation will never be improved!

Here comes my critics for the grant system: for something no-one has done before, the judgement may be greatly biased! This problem is exacerbated by the word limit. It is hard to both convince the committee and write a plan in limited words. Some changes are needed. At least the rebuttal should be allowed to be longer. Second of all, the committee should analyse the opinions from all the reviewers carefully. It is now clearly indicated in the letter that the committee is not obliged to follow the referees’ report completely.

A very small number of trees may already reveal interesting patterns, learning rate is the most important

Do you know 40 trees already can brings a big leap to the prediction with complex dataset and predictor-response relationships? I recently tested this for random forest and XGboost, and am surprised by the huge progress it could made with a small increase in the number of trees.

Another lesson I learned is about the learning rate for boosting. Though the from the cross-validation process can look slow enough, it could still be too fast and be the cause of artefacts, or strange patterns in the spatial prediction. I often spend a long time to tune the XGBoost, using both grid-search cross-validation and manual tuning, but despite the prediction accuracy in the end could look completely rational, the prediction patterns always shows too much edges and inconsistency that an expert can tell they are impossible. For example, very low values on the roads but high values next to the road. I found plotting the spatial patterns extremely helpful in the hyper-parameter tuning process, as opposing to only based on a cross-validation accuracy matrix (RMSE, R2, IQR, MAE, etc.).

For this, I made an R shiny page for playing around different hyper-parameters, check it out :-).


OpenStreetMap data— Query, downloading, handling

OpenStreetMap provides tremendous amount of information. While it is becoming indispensable in urban GIS (e.g. routing), it also provides invaluable labels for supervised machine learning methods development (particularly currently burgeoning deep neural networks) in many applications, such automatic road and building delineation from very high resolution (VHR) satellite imagery, which arise hopes in high-resolution global-scale mapping. For examples, for air pollution mapping, we need global road infrastructure, but the transportation network provided by OSM is incomplete in many countries, such as in China and African countries. VHR satellite imagery (e.g. worldview2) and machine learning (particularly deep learning neural networks) are promising techniques to complement the OSM, and evaluate the consequences of directly using the OSM with missing roads to predict air pollution. Here I provide some useful tools/ways/notes for handling OSM for analysis.

Method 1: Download everything and query

  • Use a mirror to download:

wget https://ftp.fau.de/osm-planet/pbf/planet-200413.osm.pbfusing

  • Filter roads: (With osmium):

cmd = “{0} tags-filter {1}.osm.pbf nwr/{2}={3} -o gap_{4}.osm.pbf”.format(osmium, fname, keyword, value, out_fname)

  • Convert to gpkg: (with GDAL)

cmd = “ogr2ogr -f GPKG gap_{}.gpkg gap_{}.osm.pbf”.format(out_fname, out_fname)

Pros: store everything on hard disk, if you know exactly what you want, this maybe the most straight-forward option.

Cons: cannot explore data before downloading. Requires other tools for data processing and analysis. May download lots of redundant data.

Method 2: Interactive query, download, and analysis (using OSMnx):

OSMnx is a brilliant tool, the paper is also very well written. https://www.researchgate.net/publication/309738462_OSMnx_New_Methods_for_Acquiring_Constructing_Analyzing_and_Visualizing_Complex_Street_Networks

Two ways of installing: conda or pip, and docker

  1. You can create a conda environment to install the OSMnx, as indicated in the manuscript:

conda config –prepend channels conda-forge

conda create -n ox –strict-channel-priority osmnx

Pros: very convenient to query, download, visualise, and analyse (e.g. find the fastest routes) data, easily reproducible. Well documented and lots of examples. Reproducible.

Cons: currently only with Python

Method 3: Query (using QGIS) osm package, with SQL query

Pros: the most convenient tool for QGIS users,

Cons: not reproducible.

Reducing the data size to 1/10, only by reshaping!

On the way to the big data concept and burgeoning software, don’t forget to inspect if the data frame or spreadsheet has stored too many redundant records. Just by reshaping, I got the data size reduced to 1/10! 

Since I got my hands on global air pollution mapping, I managed to gather a most comprehensive set of station measurements, through collaborations and investigating on the open science community. My ambition was to do it time-resolved, so I gathered hourly data of a year (8760 hours). I got 6 datasets, some dumped into 365 spread sheets, with several air pollutants, with a total size of 22 Gb; some stored in 13 spread sheets, sorted by space-time, some stored wide table, some long, some has UTM time, some local time. In short, all of them need to be wrangled for a consistent structure for querying and analysis. 

Though focused on array data management during my Ph.D., I’ve never thought point data would bring me trouble in storage. I thought it is easy, I just need a relation database or HDF5-based tools.

I never thought a meeting with my software developer colleagues would be useful, as I thought I know a bit about HDF-5 or a relational database, just to put them in use shouldn’t be hard.

But by a mere look at the column names I provided, my colleague said ” there is lots of replications in the data, isn’t it?”

I have the columns: time, longitude, latitude, sensor type, values, urban type,  … The longitude, latitude, etc. are replicated multiple times when the time integrates.

I have thought this is the format that I’m most familiar for the subsequent analysis, this is the first time I started to care how much space are wasted.

I went back to build a relational database with two tables, one (data table) with

time, value, UID ( a unique id linking to each coordinates)

The other (attribute table) with

UID, Longitude, Latitude, …. 

I firstly investigated on a small dataset, which has 70 stations, that reduced the file size from around 75 Mb to 17 Mb.

Now since I’m more aware of the storage, I got uneasy by the left-over replications: the UID is repeating for time iterations, and time repeating among UID.

I then rearranged the long table to the wide table, with each column a station. This makes the size almost a third of the data table. Compared with the original data, it shrinked to 1/10 of the size. 

Before going for “sophisticated” big data solutions, thinking about “never repeating data”.

How to assess accuracy of a machine learning method, when observations are limited?

When we have nonlinear relationships, ensemble tree-based methods (e.g. xgboost, random forest, general boosting machine) may give a boost to your prediction accuracy. However, in situation where we don’t have a very big dataset (at least 10,000), such as in air pollution mapping, when we have to derive complex relationships between air pollution and predictors using several hundreds or thousands of ground monitor observations, how can we validate our model?

The reason that this poses a problem is we need to tune our hyperparameters, and the observations we leave in or out of the training set may alter the relationships we derived. For XGBoost, more than 5 parameters can be tuned and a change in one hyperparameter may give a very different prediction result. By tuning hyperparameters, we run into the argument of information leak, as the data that we fit our hyperparameters onto are then used for accuracy assessment.

The most intuitive way of solving the potential information leak is to include an external dataset, or sampling the dataset at hand additionally a completely untouched test set. A cross-validation set is used to tune hyperparameters while cross-validation, and the result is test again on a test set. However, this is also problematic if the splitting between cross-validation and the test set can lead to a different result.

I did an experiment with ground NO2 observations of Germany and Netherlands. I have altogether 413 observations, I held out 10% for independent testing. I did a grid-search for XGBoost. For two tests, I used different seeds to randomly choose the independent test set. (the test set is treated as an external dataset). The two samplings led to considerably different results, particularly in terms of the test RMSE, and also in hyperparameter values that are optimized. The differences are in learning rate and maximum tree depth, the first time the learning rate=0.05 and maximum tree depth 3, and the second time learning rate = 0.1 and maximum tree depth = 4. The first time the test RMSE is 8.4, about 1 RMSE larger than the cross-validation result, the second time 5.7, about 1.8 RMSE smaller than the cross-validation result.


I then looked at relative accuracy indicators (normalize RMSE etc. by the mean of observations) to reduce the effects of magnitudes (in case I sampled very small values for testing). For test 1, the cross-validation results are RRMSE (relative RMSE): 0.36, rIQR (relative IQR): 0.33, rMAE (relative MAE): 0.25, and R2 0.68 – closer to the test data result (see figure below). For test 2, the relative indicators are also closer between cross-validation and test data, but still, the cross-validation accuracy looks to be underestimated – the test set has an impressive R2 of 0. 79.


cross-validation result of test 1

Test results##      RMSE     RRMSE       IQR      rIQR       MAE      rMAE     rsq ## 8.4197599 0.3650937 7.9581566 0.3939681 6.0183271 0.2609639 0.7093221  cross-validation result of test 2.


Test data##      RMSE     RRMSE       IQR        rIQR     MAE      rMAE       rsq Test  5.67         0.30      3.88         0.23    3.9      0.21      0.79 CV.   7.73         0.36       6.5         0.34    5.29     0.24      0.68

This means we will conclude differently with different external datasets, the first indicates the cross-validation results are over-estimated, possibly due to hyperparameter tuning which leaks information from our validation datasets. On the contrary, the second indicates the information leak may not existing at all, we even get under-estimated accuracy assessment.

 This demonstrated several things: 1) cross-validation is very important, as how the training -test sets are split play a major role in the modelling and accuracy assessment. 2) A relative accuracy indicator is needed when an external validation set is used. 3) An external test set probably may add in more bias in accuracy assessment, possibly causing more trouble than the potential information leak.  The question goes back to: how influential is the information leak?

I then used the hyperparameter settings tuned in test 1 on the test 2, which means we don’t tune hyperparameter much in test 2. The result is the XGBoost obtained worse results on the test set (still better than the cross-validation results). This means the information leak is not as daunting as not tuning hyperparameters sufficiently.

The take home message is though there might be information leak, tuning hyperparameter is essential. With a relatively limited dataset e.g. hundreds or thousands but the relationships are complex and variables multiple (in my case 66), splitting the dataset into train-validation-test, as in deep learning for example, is not useful, but may even lead to biased interpretation of the accuracy results.