Why a very large number of trees won’t overfit Boosting?

We know that boosting fit residuals from each of the previous trees subsequently, then here comes with a question– is boosting then resembles a very large tree in some sense as it is growing vertically? And if yes, it would be affected by the number of trees, i.e. too many trees would cause overfitting. We can do a small experiments: I know 1000 trees give me an optimal model, then I grow 10,000 trees and found the results almost the same, just like random forest.

If you think about the problem as it origins — “a gradient descent solution”, then it seems quite straight forward: Boosting each time use residuals from all of the observations to build the next tree, if the gradient does not descend any more (get stuck in a minimum), then the predictions stay the same. This is the main difference of it from a very large tree, which do not descend the gradient but keep splitting at each nodes using a “local optimiser, i.e. find the split that lead to the least variance in each segments”. The segments are becoming smaller and smaller, until you completely overfit.

Niche in Geoscience? No.

“Finding a niche” is sort of a “holy grail” that a senior researcher would mentor a young researcher. Many professors believed that being able to find their niche led to their success. But they forgot that that probably held decades ago, and in the modern information time, a “niche” doesn’t exist in Geoscience and should not exist. Anyone can and should be able to build on top of other’s work. Open-science told us this trend lead to the fastest acceleration of science. Twitter (a pioneer to completely open their development platform at the development stage) demonstrated it with how it becomes a giant today.

I was mentored by professors I truly trusted, respected, and appreciated that I should find my niche and I wrote this short blog because I heard people telling others “you should find your niche” or “we should find our niche” a few times recently. I know they are sharing their precious experience and out of the most sincereness, but useful experience has an expiration date. Trying to find a niche, one may go to an extreme of doing things others won’t, invest in the opposite side of open-science, or simply be discouraged and loses the motivation.

Better ways — I just draw from what I saw and want to say to myself– don’t be afraid to choose a very hot topic, sit down but heads up, keep eyes open and keep moving on, surpass the years-long hard-works from others and let others does it in return.

Deep learning resources

There is no better era to self-teach deep learning! Besides the well-known resource platforms such as Kaggle, the machine learning roadmap, I recommend several resources that I found really amazing and the course sequences to follow. The very initial start is still the courses offered on Coursera, or the Standford cn231n (see below the item 4) on Youtube, great accelerators!

  1. A dive into deep learning https://d2l.ai/: “Interactive deep learning book with code, math, and discussions.” This learning material is classi! I got to know this late but anyone can benefit from it at any stage. All the scripts can be ran in google colab. The interpretation is amazing. The Chinese version of it is the top1 seller in the Chinese bookstore market. The Chinese version is great, read as originates from Chinese authors, not as many books with very rough translation. Many university used it for classes already and an AWS space can be applied for free for teaching purposes. This book, interactive as it suggests, may be a better start compared to the two classical deep learning books, namely DEEP LEARNING with PYTHON and DEEP LEARNING, as it is up-to-date, very practical with real-life scripts, and enables discussions.
  2. https://distill.pub/ A fantastic online journal with great visualisations.
  3. https://paperswithcode.com/: Paper and code as the names suggests, this is the great trend pushing by the field of machine learning. In the same vein is the OpenReview.
  4. Courses on Youtube: The sequence to watch I recommend is (1) standford cn231 (the winter or summer semester), which is the most detailed and classical course; (2) MIT 6.S191 which is quite a good introduction of the deep learning realms, less detailed; (3) Unsupervised deep learning by Pieter Abbeel at UC Berkeley for people interested in deep learning or would like to dive deeper. (4) DeepMind x UCL | Deep Learning Lectures, which is more fast-space and advanced, and let the audience glimpse into the newest developments till 2020.
  5. For people who can read Chinese, the CSDN for numerous insightful blogs and resources. The CSDN has been around for ages, but I just got to know it, the articles published deepens my understandings greatly!! I am inspired by the enthusiasm of the community.
  6. Maybe needless to mention, following people’s researchgate, github, twitter, linkins, subscribe to the youtube channels so that you will always be updated.

I will keep updating the list, enjoy learning!

Sharing a great explanation of PCA

PCA analysis was the beginning of my spatiotemporal data analysis journey and went all the way through my PhD study. It can be understood simply as an orthogonal, eigen-decomposition of covaraince matrix, with the variance of each component arranged in decreasing order, however, the links between it and linear regression, ANOVA, etc. are not imprinted in mind and it turned out I kept feeling not understanding it completely and trying to demystify it. Now I found the best illustration that explains my confusion, enjoy reading!

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Deep learning in remote sensing image segmentation problems 1 — boundary delineation

Deep learning has been used in building footprint delineation, road extraction, coastline delineation, among others, and the focus is on accurate boundary delineation. Below are the main-stream directions I am aware of, several of them appear in the SOA, some of them in daily experiments. Convincingly suggesting the most optimal methods or the combination of them for an ultimate solution may come soon.

Credit: figure from Su and Zhange 2017, ISPRS journal of remote sensing and photogrammetry.

1. Use Learning Attraction Field Representation.

The method is proposed for line segments detection Learning Attraction Field Representation for Robust Line Segment Detection, which reformulate the problem as a “coupled region colouring problem” [1].

2. Use more boundary-specific loss function.
Loss functions play an essential role in machine learning, lots of loss functions have been proposed, but it is still needed to comprehensive evaluate them:

As a first attempt and for binary segmentation, one can try RMSE on distance metric. Boundary-specific loss functions are proposed in:

2.1 Boundary Loss for Remote Sensing Imagery Semantic

2.2 Boundary loss for highly unbalanced segmentation

3. Extract boundary first with a conventional edge detection algorithm, use it as a feature input for training.

This simple addition has been proposed by a colleague and he obtained an incredible improvement in IOU, from around 55% to 62% in his study case of building detection. This really calls for a comparison between all the other more complex methods: what are the REAL reasons behind the improvements? Many people get increasingly disappointed by current publications as new methods are published with improvements seemingly a matter of chance and without linking to other possibilities.

4. Binary segmentation as an edge detection problem

Current deep learning applications in remote sensing image classification is mostly with image segmentation. Vector labels are commonly rasterised for training, this does NOT have to be the case! For binary problems such as building footprint delineation, one can turn the problem back to the edge detection solutions, this opens a new door of opportunities. For example, crisp edge detection below:

Credit: figure from Huan et al., Unmixing convolusional features for crisp edge detection.

R, Python, or both in Machine Learning?

As an R user for almost ten years, I’m gradually switching to Python for machine learning, and pretty much everything. Not to mention deep learning, which has the community almost exclusively in Python, for other data science methods python sees a community growing faster than R. There are no doubt lots of developments in R that are not available yet in Python, like distributional forest, empirical fluctuation process-based time series structural change detection methods (party, efp, bfast, etc.) and the ggplot is extremely powerful. But that’s becoming less and less. Relatively more recent methods, such as catboost or XGBoost, have better Python APIs. classical geospatial analysis methods such as GWR (geographically weighted regression) and Gaussian processes also see lots of developments in Python. The tidyr tools also come naturally in Python (I.e. the pipes are not really needed as the programming is already fully object-based).

The Python array, data frame, geodata handling are becoming more powerful every day, but in R slower. I quite often implement things in R first as I’m most accustomed to it, but always I found a solution in Python that is more neat and simpler.

I won’t say bye to R though as there are still lots of complementary tools, it is convenient for me sometimes and several of my collaborations are still based in R, the community in R is still growing and upcoming. Programming languages function as communication tools. Emerging ideas from R are as exciting as in Python. Just to say for anyone serious about machine learning and is still a faithful R user, you will benefit from being bilingual.

How does spatial or spatiotemporal correlation affect cross-validation results in ensemble tree-based classification?

Ever since I attended opengeohub summer school https://opengeohub.org/summer_school_2020 from 2018-2019, where in a lecture it has been stated that the spatial correlations lead to overly optimistic cross-validation results. I have appreciated the thoughts it brought to me about the effects of spatial and spatiotemporal correlations in ensemble tree (random forest or boosting)-based classification methods. Is it Ok to use it for spatial prediction? Is it a problem when the covariates (predictors) are spatially correlated, or when the response is spatially correlated, or both?

In the lecture from opengeohub summer school, however, the example is not very illustrative, as it used latitude and longitude as predictors. Having said redundant predictor variables usually won’t affect random forest or boosting kind of method, these two variables are a common pitfall to be threw in a model. The reason that these variable can only be harmful is that they exactly register the areas of each class. pixels are commonly clustered in a class and are far away from pixels in another class. The pixels that consists a cropland are in general far away from pixels that consists of forest. When the latitude and longitude are used, the model is going to say ok, this area is crop and other forest, no need to used any predictor variables.

The solution it gives, is to use spatial validation, which means to divide data into spatial blocks, and each time use one block of data for validation. In this way, the model has a harder time to predict because the model never see the test area before, and the longitude and latitude of course won’t know what’s there. They can only make the best guess based on the areas closing by.

Then in the lecture, a solution that can automatically get rid of this kind of predictors is given by using spatial validation in model training process, and only select 2 variables each time for growing the tree. This, of course, has a high potential of getting rid of longitude and latitude, also the elevation, for the same reason.

However, till now we have not talked about the influence of correlated data. The key point is that we don’t want to the training and test dataset to overlap.

Let’s firstly look at spatially correlated response. There is another literature, by Brenning (2012), implemented in https://mlr-org.com/docs/2018-07-25-visualize-spatial-cv/, stressing on correlated response. Notably, it also used spatial correlation. This paper, as well as the implementation, again, does not give a very good quantification of how exactly and to what extent the correlation in response causes a biased error estimation.

To delve into the effects of spatiotemporal correlations to an ensemble tree based method, let’s say random forest for regression, we can break it into three parts, the data is split into training and test, the training set is used in 1 and 2 and test in 3:

  1. model fitting using bootstrapped samples from the training set,
  2. obtaining the OOB (out of bag) error from the rest of the training set,
  3. cross validation (CV).
  • For 1: Thinking about growing a single tree using highly correlated data, what problem will it bring? 1) The universal Kriging and spatial autoregression model concerns spatially correlated residuals, but the tree model is very flexible and this may come out a smaller concern. 2) The three model concerns the best splitting point by mean of (y – y_mean)2 within each chunk divided by covariates, if lots of points are sampled close together, the split point may be missed and the best variables may not be chosen. This means even though we will grow hundreds or thousands of trees, it might be beneficial that the data are sampled more randomly over space.
  • For 2: If the bootstraps used for training correlates with OOB samples, it means the OOB error is not accurate. This, however, is less important because in practice we rely on CV to evaluate the model and less on the OOB, as the OOB is for each tree and CV is for the forest.
  • For 3: If there is correlation between what is used in training and what is used in the test for CV, we risk that the training and test dataset are overlapped, this comes to what the OpenGeoHub summer school brings. However, the problem of spatial validation can cause a serious problem: the test and training may come from different distributions! As the mentioned studies are happy that the CV result is now looking more realistic as it drops down, they may be unfairly asking the model to test on things it may has never seen before. Though we do many folds, remember we will take the average. The harm the spatial validation can do may be much more than the response is spatially correlated.

This article is to be continued.

References:

Brenning, A. (2012). Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. In 2012 IEEE International Geoscience and Remote Sensing Symposium. IEEE. https://doi.org/10.1109/igarss.2012.6352393

About choosing a Ph.D supervisor

I have been asked about my experience studying in different countries (I did my bachelor in China, MSc in US, PhD in Germany, Postdoc in Netherlands), which do I like the best and what are the differences. This, of course, depends on which university, institute, and the group of people I worked with. There are some general patterns, but the variability is way lower than institutes or research groups within a country. I have worked close together with four professors, the dissimilarities (in terms of how they do Science and supervise students) between them are huge.

Many PhDs grow to be extremely similar to their PhD supervisors, if their supervisors are respectful to them. This is at least shown on some of my PhD and Postdoc colleagues. (For not-very-good supervisors, I heard their (former) students told me they learned how not to be a not-very-good supervisor and are really aware of being responsive. )

In short, positive or negative, the influence from a Ph.D. supervisor is on so many aspects of a researcher. So here are some of the tips I want to share. When choosing a Ph.D. supervisor, do:

  1. look at if a supervisor is still leading the field, is he /she still being active, kept updating his/her blogs, Github/Gitlab, does he/she still read literatures or watch lectures?
  2. talk to his/her current Ph.D. and M.Sc students, how much time does he/she have, how bossy and caring is he/she, does he/she really have great insights or just a list of publications?
  3. know what the supervisor is best known for, what extraordinary/interesting things he/she did and is doing?
  4. be as critical as you can. This is hard for a M.Sc., but the research capability of a professor varies greatly. Some of them are, in the Chinese saying “the master of old science“, and worse, they stick on the piece of dying science that they have worked on for so long and once brought them honour. Be their students may likely to continue building on that piece of dying work.

Don’t

  1. be fooled by the supervisor’s long co-authored publication list, that means a bit more than NOTHING, if not negative. If a supervisor has more than 15 publications a year, you can think of how is this possible. Is it possible to have 15 breakthroughs per year? I’d prefer two years, a really good paper. I once saw someone has 5 first-authored paper a year, my first sense is these papers may not be so good, instead of this person is amazing. We know we need time to think in-depth, to discuss, to implement, analyse and refine.
  2. choose too quick at what at a glance looks interesting. Some research groups may have some models or software under development, this is a very cool thing, but be careful not to be constrained to a certain model or software.

How to prevent the XGBoost from overfitting

Most people using XGBoost got the experience of model over-fitting. I earlier wrote a blog about how cross-validation can be misleading and the importance of prediction patterns https://wordpress.com/view/tomatofox.wordpress.com . This time I just want to note down some very practical tips, something that cross-validation (e.g. with grid search) can’t tell us.

  1. The model overfitting is likely caused by the learning rate being too high. The default, 0.3 is usually too high. You can try 0.005, with a dataset of more than 300 observations.
  2. A very misleading statement in many publications and tutorials is that too many trees in XGBoost (or boosting in general) causes over-fitting. This is ambiguous. Many trees will NOT decrease your model performance. Increasing from the optimum setting of trees will only adding to your computational burden, but it is as safe as trees in random forest! With as many trees you can imagine, the model is as general as with less trees because the gradient just got stuck! The cross-validation result won’t change, neither will the prediction pattern.
  3. So the advice is to use a very low learning rate, say 0.001, and set as many trees as you like, say 3000, for the best fitting. Then for faster achievement of the results, reducing the number of trees. If you use a learning rate of 0.001, 1000 trees should be enough to find the global minimum, as the 0.001*1000 is already 1, the same as doing the gradient descent once with learning rate 1.

Bias in judgements of a grant system

Veni is an dutch science foundation grant for researchers who are within 3 years of the acquisition of their Ph.D.. This is the first personal grant I have applied for. I spent about 3 months working day and nights, working on the proposal, application, and rebuttal. The result is that the three reviewer reports are overall very promising, leading to great hopes, but the funding committees don’t find the proposal (let’s call it proposal A) that appealing. I am completely fine with it, but I do want to speak out my opinions about the bias in the evaluation process.

In comparison, I have a proposal B written afterwards which is with extended methodology and data sources, but a much confined study area — global-scale for the Veni proposal vs. a European country for the proposal B.

My collaborator told me the proposal B was ranked the second, almost in, only my relationship with University is slightly weaker compared to the top ranked, and most people reading it found it strong.

The proposal A even fall out of the “very good” category, that is a shock. The most negative criticism for proposal A is that it is too ambitious and they would prefer a more confined study to fully consider all the uncertainties and interpretations, and this “too ambitious” is then followed in the entire evaluation, even saying my academic performance does not match up with this ambitious goal. This is a very unfair point but this is not what I want to discuss.

I want to share my opinion on this comment of “too ambitious”. I acknowledge that in the beginning I shared the same view with the reviewers, and even argued emotionally with my current supervisor. As a data person in core, I care about uncertainty and interpretability as much as any serious data scientist, which means I also preferred a small study area with lots of data to fully evaluate all the uncertainties and the physical meanings of the atmospheric processes. However, with time passing by and with more knowledges, I do grow a convincing view of the global-scale study, as besides the social and academic impacts, this is the only way of appreciating the heterogeneities between regions, and importantly, with global coverage satellite imagery and more and more geospatial data, it is a matter of time that we make high quality, high resolution global maps, then the next step? A better exposure assessment. I want to shout out that all these evolutions will be happening within the next three years! However, for researchers never gave the global-scale modelling a deep thought, “too ambitious” is going to be the intuition, based on their many years of experiences as professors.

The other point that the committee raised is about the uncertainty from data sources, this part was in the beginning elaborated but suggested to be removed by my supervisor. This is clearly due to different perspectives from researchers from different background caring different aspects. My supervisor thinks no-one would care about the uncertainty. The proposal is limited to less than 1800 words, I agreed purely because I got to elaborate my plans, otherwise the critics would be the plan is not clear.

Reflecting on the two points, before working on the global mapping, I am completely in-line with the committees. If I were the committee, I would reach to the same comments and would also prefer the proposal B, but the key point is, my opinion changed after setting hands on the global mapping — I like both proposals equally! For the globals-scale study, I was not stating the accuracy is the same everywhere, but I did promised uncertainty assessment everywhere. Importantly, if the global mapping is not started, it won’t ever started because the committee will always find it too ambitious. And for the challenging areas there will never be an estimation, and the estimation will never be improved!

Here comes my critics for the grant system: for something no-one has done before, the judgement may be greatly biased! This problem is exacerbated by the word limit. It is hard to both convince the committee and write a plan in limited words. Some changes are needed. At least the rebuttal should be allowed to be longer. Second of all, the committee should analyse the opinions from all the reviewers carefully. It is now clearly indicated in the letter that the committee is not obliged to follow the referees’ report completely.