reinstalling gdal, and R raster, sf, stars

As a geoscientist working a lot with R, I just had another worst-day-ever installing GDAL and PROJ. Python is less problematic because you can always have everything clean in your conda environment.

With R, you can do a docker, but not as convenient.

if you managed with gdal, e.g. sf, stars, rgdal all works, but not raster, then you probably need to reinstall the package “terra”! I tried many things with my gdal but this comes to a rescue.

The gdal probelm happens probably more often with LINUX because with windows the cleaning up/ uninstalling is easier.

The problem was I had a gdal 3.4, it is incompatible with some software, so I installed a gdal 2.4, then of course the PROJ will complain. Then I started a loop of removing – reinstalling – manually fixing gdal-config file – manually removing and downloading things – searching for solutions – try – try -try

But the problems are just endless. And things just got all messed up. Well, you still learn something, you learn more about gdal, but nothing useful.

All I need is a clean-wash of all gdal things. This blog comes to rescue.

But sudo apt-get --reinstall install gdal-bin never cleans things up.

So what you need to do is:
sudo apt-get purge --auto-remove libgdal
sudo apt-get purge --auto-remove libgdal29
sudo apt-get purge --auto-remove gdal-bin
sudo apt-get purge --auto-remove libkml-dev
sudo apt-get purge --auto-remove libproj-dev


Then  
apt-get install libgdal-dev gdal-bin libproj15 libproj19 libproj-dev

a trick for the "sf' package is 
sudo ln -s /usr/lib/x86_64-linux-gnu/libproj.so.15 /usr/lib/libproj.so.12

I gave full credit to the excellent post: https://bertelsen.ca/post/gdal-3-3-1-on-ubuntu/

I hope now it is the end of my gdal-headache.

news: Spatial modelling and efficient computation workshops by Joaquin Cavieres 

We will be hosting two very interesting workshops in spatial modelling and efficient computation, the first in spatiotemporal modelling with INLA and the second in Template Model Builder. The workshops are on 19-01 and 26-01, respectively, from 2 pm (Berlin time), lectured by Joaquin Cavieres, PhD candidate in statistics from Universidad de Valparaíso. 

You are welcomed to join online! Please find details below.

Workshops

Time: Jan 19, 2022 02:00 PM Amsterdam, Berlin, Rome, Stockholm, Vienna

Topic: Beyond the classical Kriging and the least square fit: Spatial and Spatio temporal modelling with INLA

Summary: INLA is a methodology codified in a package of R to solve different statistical models in a specific class of latent models, called “latent Gaussian models”. It will be an introductory class to R-INLA and the main idea is to show how we can fit different statistical (spatial) models through this methodology.

Zoom Link:

https://uni-bayreuth.zoom.us/j/63888521495?pwd=aFI2OEJ5UnpRQW4rOGdFWTB5aWhRZz09

Preparation for the workshop:

  1. Install R
  2. Install the packages below:

install.packages (“devtools”)

library(devtools)

install.packages(“INLA”,repos=c(getOption(“repos”),INLA=”https://inla.r-inla-download.org/R/stable”), dep=TRUE)

devtools::install_github(“julianfaraway/brinla”)

Install R-INLA:

install.packages(“INLA”,repos=c(getOption(“repos”),INLA=”https://inla.r-inla-download.org/R/testing”), dep=TRUE)

For more information, please refer to: https://www.r-inla.org/download-install

Main topics:

  1. Install R-INLA
  2. Theoretical description of INLA
  3. Types of models to fit
  4. Examples in R

References to R-INLA web:

https://www.r-inla.org/

Materials:

Bayesian linear regression with INLA

https://julianfaraway.github.io/brinla/

Spatial modelling with INLA:

https://becarioprecario.bitbucket.io/spde-gitbook/ (advanced)

Time: Jan 26, 2022 02:00 PM Amsterdam, Berlin, Rome, Stockholm, Vienna

Topic: A short introduction to Template Model Builder (TMB)

Zoom Link:

https://uni-bayreuth.zoom.us/j/65600578121?pwd=MkhObmE4SDNXRmU4OU9PY2VmNTU4QT09

Summary: TMB (Template Model Builder) is an R package for fitting statistical latent variable models to data. Unlike most other R packages the model is formulated in C++. This provides great flexibility, but requires some familiarity with the C/C++ programming language.

It will be an introductory class to TMB and the main idea is to show how we can fit different statistical models through this software. 

Install TMB:

install.packages(“TMB”)

Please, refer to: https://github.com/kaskr/adcomp

Additionally, you need to install: Rtools (compatible for your Rstudio and R version!).

Main topics:

  1. Install TMB
  2. Theoretical description of TMB and how it works. 
  3. Types of models to fit
  4. Examples in R

References to TMB web:

https://kaskr.github.io/adcomp/Introduction.html

Materials:

https://github.com/kaskr/adcomp/tree/master/tmb_examples

Lecturer:

Joaquin Cavieres PhD (c) in Statistics

Universidad de Valparaíso

On hyperparameter optimisation of random forest and when we use Lasso for post-processing trees

In my OpenGeoHub2020 lecture https://github.com/mengluchu/OpenGeoHub2020, I have talked about the application of Lasso as a post-processing step for random forest to shrink the tree space. This approach was originally proposed in the famous “element of statistical learning” book of Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie but I have not seen it being implemented or used. I implemented it in my air pollution spatial prediction model and in my R package APMtools. The function is here: https://github.com/mengluchu/APMtools/blob/master/R/prediction_with_pp_La.R My experience is that this improves prediction accuracy and obtains an accuracy almost as good as a meticulously-tunned XGBoost.

I received a question about hyper-parameter optimisation using this Random Forest + Lasso approach (let’s call it RFLA): If we use the Lasso for postprocessing and reduce the number of trees, how do we fit it into the hyperpparameter optimisation? In this blog I summarise on hypermeter optimsation for random forest and discuss this question.

In random forest hyperparameter optimisation (despite that random forest is not so terribly sensitive to it), we commonly optimise the number of variables to select from (mtry, the most imporatnt) and the minimize node size (min.node.size, less important) but not the number of trees (ntree), as ntree can be a very safe (i.e. large) number and increasing it doesn’t deteriorates the model. Random forest is perfectly parallelisable so we also don’t need to worry about the computational burden. This leads to the logic that in practice, we typically decide on the ntree first, then optimise for mtry and min.node.size.

Let’s firstly look at the effects of mtry and min.node.size. As we know, a lower mtry reduces model overfitting by increasing the heterogeneities in candidate predictor variables when building each tree and therefore reduce correlations between trees. At the same time, a small mtry may reduce the strength of a single tree as less predictor variables are considered. For the same reason, a large mtry is more likely to cause over-fitting of a single tree.

Suppose we have sufficient observations, if ntree is small (say, 50), the mtry may better off be small because we want to let trees be more independent to each other. But when ntree is relatively large (say 2000), the mtry may be increased as we have many trees to aggregate to reduce model over-fitting in the first place. However, as mentioned, this makes trees more correlated to each other (high redundancy). In practice, we use Cross-Validation (CV) to find the optimal mtry.

Reducing the min.node.size also increases the chance of a single tree overfitting. Especially when the mtry is large, a small min.node.size makes a more flexible tree. Anyway, this hyperparameter is not so important. In practice, it is commonly set to 5 or 10.

As the mtry, min.node.size both affect the final results, it sounds reasonable that we should optimise the hyperparameters for RFLA directly, instead of for Random Forest and then applying Lasso. However, does this worth the effort? The Lasso hyperparameter optimisation further reduces the effects of hyperparameter setting. With Lasso controls tree correlations, the mtry could be very large. When mtry is equal to the number of predictors, random forest becomes bagging. This means maybe bagging + Lasso is no less effective compared to random forest and random forest + Lasso. Or more generally speaking, bagging with an efficient shrinkage tree aggregation strategy is good enough?

The invention of random forest is overrated?

HOW AN IMMIGRATION OFFICE IN BAYREUTH COULD DESTROY A FOREIGN HEART

Meng Lu

I spent more than five years of my best times in Göttingen, Köln, and Münster. I probably had the best supervisor and colleagues and thought I love this country almost as much as my homeland. When I just moved to the Netherlands for postdoc, I missed many things in Germany. After my postdoc, I decided to settle in Germany and I thought it feels a bit like going home. However, if someone tells me now he or she loves Germany so much and decide to settle there, I started to feel a bit differently.

I got a position at Bayreuth University in Dec. 2020 and accepted the offer in Feb. 2021. Everyone in Bayreuth has been super kind to me and I met really great professors. But one thing happened that completely changed my feelings to Germany as a whole despite I tried to stay objective. I know the negative…

View original post 899 more words

Why a very large number of trees won’t overfit Boosting?

We know that boosting fit residuals from each of the previous trees subsequently, then here comes with a question– is boosting then resembles a very large tree in some sense as it is growing vertically? And if yes, it would be affected by the number of trees, i.e. too many trees would cause overfitting. We can do a small experiments: I know 1000 trees give me an optimal model, then I grow 10,000 trees and found the results almost the same, just like random forest.

If you think about the problem as it origins — “a gradient descent solution”, then it seems quite straight forward: Boosting each time use residuals from all of the observations to build the next tree, if the gradient does not descend any more (get stuck in a minimum), then the predictions stay the same. This is the main difference of it from a very large tree, which do not descend the gradient but keep splitting at each nodes using a “local optimiser, i.e. find the split that lead to the least variance in each segments”. The segments are becoming smaller and smaller, until you completely overfit.

Niche in Geoscience? No.

“Finding a niche” is sort of a “holy grail” that a senior researcher would mentor a young researcher. Many professors believed that being able to find their niche led to their success. But they forgot that that probably held decades ago, and in the modern information time, a “niche” doesn’t exist in Geoscience and should not exist. Anyone can and should be able to build on top of other’s work. Open-science told us this trend lead to the fastest acceleration of science. Twitter (a pioneer to completely open their development platform at the development stage) demonstrated it with how it becomes a giant today.

I was mentored by professors I truly trusted, respected, and appreciated that I should find my niche and I wrote this short blog because I heard people telling others “you should find your niche” or “we should find our niche” a few times recently. I know they are sharing their precious experience and out of the most sincereness, but useful experience has an expiration date. Trying to find a niche, one may go to an extreme of doing things others won’t, invest in the opposite side of open-science, or simply be discouraged and loses the motivation.

Better ways — I just draw from what I saw and want to say to myself– don’t be afraid to choose a very hot topic, sit down but heads up, keep eyes open and keep moving on, surpass the years-long hard-works from others and let others does it in return.

Deep learning resources

There is no better era to self-teach deep learning! Besides the well-known resource platforms such as Kaggle, the machine learning roadmap, I recommend several resources that I found really amazing and the course sequences to follow. The very initial start is still the courses offered on Coursera, or the Standford cn231n (see below the item 4) on Youtube, great accelerators!

  1. A dive into deep learning https://d2l.ai/: “Interactive deep learning book with code, math, and discussions.” This learning material is classi! I got to know this late but anyone can benefit from it at any stage. All the scripts can be ran in google colab. The interpretation is amazing. The Chinese version of it is the top1 seller in the Chinese bookstore market. The Chinese version is great, read as originates from Chinese authors, not as many books with very rough translation. Many university used it for classes already and an AWS space can be applied for free for teaching purposes. This book, interactive as it suggests, may be a better start compared to the two classical deep learning books, namely DEEP LEARNING with PYTHON and DEEP LEARNING, as it is up-to-date, very practical with real-life scripts, and enables discussions.
  2. https://distill.pub/ A fantastic online journal with great visualisations.
  3. https://paperswithcode.com/: Paper and code as the names suggests, this is the great trend pushing by the field of machine learning. In the same vein is the OpenReview.
  4. Courses on Youtube: The sequence to watch I recommend is (1) standford cn231 (the winter or summer semester), which is the most detailed and classical course; (2) MIT 6.S191 which is quite a good introduction of the deep learning realms, less detailed; (3) Unsupervised deep learning by Pieter Abbeel at UC Berkeley for people interested in deep learning or would like to dive deeper. (4) DeepMind x UCL | Deep Learning Lectures, which is more fast-space and advanced, and let the audience glimpse into the newest developments till 2020.
  5. For people who can read Chinese, the CSDN for numerous insightful blogs and resources. The CSDN has been around for ages, but I just got to know it, the articles published deepens my understandings greatly!! I am inspired by the enthusiasm of the community.
  6. Maybe needless to mention, following people’s researchgate, github, twitter, linkins, subscribe to the youtube channels so that you will always be updated.

I will keep updating the list, enjoy learning!

Sharing a great explanation of PCA

PCA analysis was the beginning of my spatiotemporal data analysis journey and went all the way through my PhD study. It can be understood simply as an orthogonal, eigen-decomposition of covaraince matrix, with the variance of each component arranged in decreasing order, however, the links between it and linear regression, ANOVA, etc. are not imprinted in mind and it turned out I kept feeling not understanding it completely and trying to demystify it. Now I found the best illustration that explains my confusion, enjoy reading!

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Deep learning in remote sensing image segmentation problems 1 — boundary delineation

Deep learning has been used in building footprint delineation, road extraction, coastline delineation, among others, and the focus is on accurate boundary delineation. Below are the main-stream directions I am aware of, several of them appear in the SOA, some of them in daily experiments. Convincingly suggesting the most optimal methods or the combination of them for an ultimate solution may come soon.

Credit: figure from Su and Zhange 2017, ISPRS journal of remote sensing and photogrammetry.

1. Use Learning Attraction Field Representation.

The method is proposed for line segments detection Learning Attraction Field Representation for Robust Line Segment Detection, which reformulate the problem as a “coupled region colouring problem” [1].

2. Use more boundary-specific loss function.
Loss functions play an essential role in machine learning, lots of loss functions have been proposed, but it is still needed to comprehensive evaluate them:

As a first attempt and for binary segmentation, one can try RMSE on distance metric. Boundary-specific loss functions are proposed in:

2.1 Boundary Loss for Remote Sensing Imagery Semantic

2.2 Boundary loss for highly unbalanced segmentation

3. Extract boundary first with a conventional edge detection algorithm, use it as a feature input for training.

This simple addition has been proposed by a colleague and he obtained an incredible improvement in IOU, from around 55% to 62% in his study case of building detection. This really calls for a comparison between all the other more complex methods: what are the REAL reasons behind the improvements? Many people get increasingly disappointed by current publications as new methods are published with improvements seemingly a matter of chance and without linking to other possibilities.

4. Binary segmentation as an edge detection problem

Current deep learning applications in remote sensing image classification is mostly with image segmentation. Vector labels are commonly rasterised for training, this does NOT have to be the case! For binary problems such as building footprint delineation, one can turn the problem back to the edge detection solutions, this opens a new door of opportunities. For example, crisp edge detection below:

Credit: figure from Huan et al., Unmixing convolusional features for crisp edge detection.

R, Python, or both in Machine Learning?

As an R user for almost ten years, I’m gradually switching to Python for machine learning, and pretty much everything. Not to mention deep learning, which has the community almost exclusively in Python, for other data science methods python sees a community growing faster than R. There are no doubt lots of developments in R that are not available yet in Python, like distributional forest, empirical fluctuation process-based time series structural change detection methods (party, efp, bfast, etc.) and the ggplot is extremely powerful. But that’s becoming less and less. Relatively more recent methods, such as catboost or XGBoost, have better Python APIs. classical geospatial analysis methods such as GWR (geographically weighted regression) and Gaussian processes also see lots of developments in Python. The tidyr tools also come naturally in Python (I.e. the pipes are not really needed as the programming is already fully object-based).

The Python array, data frame, geodata handling are becoming more powerful every day, but in R slower. I quite often implement things in R first as I’m most accustomed to it, but always I found a solution in Python that is more neat and simpler.

I won’t say bye to R though as there are still lots of complementary tools, it is convenient for me sometimes and several of my collaborations are still based in R, the community in R is still growing and upcoming. Programming languages function as communication tools. Emerging ideas from R are as exciting as in Python. Just to say for anyone serious about machine learning and is still a faithful R user, you will benefit from being bilingual.