reinstalling gdal, and R raster, sf, stars

As a geoscientist working a lot with R, I just had another worst-day-ever installing GDAL and PROJ. Python is less problematic because you can always have everything clean in your conda environment.

With R, you can do a docker, but not as convenient.

if you managed with gdal, e.g. sf, stars, rgdal all works, but not raster, then you probably need to reinstall the package “terra”! I tried many things with my gdal but this comes to a rescue.

The gdal probelm happens probably more often with LINUX because with windows the cleaning up/ uninstalling is easier.

The problem was I had a gdal 3.4, it is incompatible with some software, so I installed a gdal 2.4, then of course the PROJ will complain. Then I started a loop of removing – reinstalling – manually fixing gdal-config file – manually removing and downloading things – searching for solutions – try – try -try

But the problems are just endless. And things just got all messed up. Well, you still learn something, you learn more about gdal, but nothing useful.

All I need is a clean-wash of all gdal things. This blog comes to rescue.

But sudo apt-get --reinstall install gdal-bin never cleans things up.

So what you need to do is:
sudo apt-get purge --auto-remove libgdal
sudo apt-get purge --auto-remove libgdal29
sudo apt-get purge --auto-remove gdal-bin
sudo apt-get purge --auto-remove libkml-dev
sudo apt-get purge --auto-remove libproj-dev


Then  
apt-get install libgdal-dev gdal-bin libproj15 libproj19 libproj-dev

a trick for the "sf' package is 
sudo ln -s /usr/lib/x86_64-linux-gnu/libproj.so.15 /usr/lib/libproj.so.12

I gave full credit to the excellent post: https://bertelsen.ca/post/gdal-3-3-1-on-ubuntu/

I hope now it is the end of my gdal-headache.

news: Spatial modelling and efficient computation workshops by Joaquin Cavieres 

We will be hosting two very interesting workshops in spatial modelling and efficient computation, the first in spatiotemporal modelling with INLA and the second in Template Model Builder. The workshops are on 19-01 and 26-01, respectively, from 2 pm (Berlin time), lectured by Joaquin Cavieres, PhD candidate in statistics from Universidad de Valparaíso. 

You are welcomed to join online! Please find details below.

Workshops

Time: Jan 19, 2022 02:00 PM Amsterdam, Berlin, Rome, Stockholm, Vienna

Topic: Beyond the classical Kriging and the least square fit: Spatial and Spatio temporal modelling with INLA

Summary: INLA is a methodology codified in a package of R to solve different statistical models in a specific class of latent models, called “latent Gaussian models”. It will be an introductory class to R-INLA and the main idea is to show how we can fit different statistical (spatial) models through this methodology.

Zoom Link:

https://uni-bayreuth.zoom.us/j/63888521495?pwd=aFI2OEJ5UnpRQW4rOGdFWTB5aWhRZz09

Preparation for the workshop:

  1. Install R
  2. Install the packages below:

install.packages (“devtools”)

library(devtools)

install.packages(“INLA”,repos=c(getOption(“repos”),INLA=”https://inla.r-inla-download.org/R/stable”), dep=TRUE)

devtools::install_github(“julianfaraway/brinla”)

Install R-INLA:

install.packages(“INLA”,repos=c(getOption(“repos”),INLA=”https://inla.r-inla-download.org/R/testing”), dep=TRUE)

For more information, please refer to: https://www.r-inla.org/download-install

Main topics:

  1. Install R-INLA
  2. Theoretical description of INLA
  3. Types of models to fit
  4. Examples in R

References to R-INLA web:

https://www.r-inla.org/

Materials:

Bayesian linear regression with INLA

https://julianfaraway.github.io/brinla/

Spatial modelling with INLA:

https://becarioprecario.bitbucket.io/spde-gitbook/ (advanced)

Time: Jan 26, 2022 02:00 PM Amsterdam, Berlin, Rome, Stockholm, Vienna

Topic: A short introduction to Template Model Builder (TMB)

Zoom Link:

https://uni-bayreuth.zoom.us/j/65600578121?pwd=MkhObmE4SDNXRmU4OU9PY2VmNTU4QT09

Summary: TMB (Template Model Builder) is an R package for fitting statistical latent variable models to data. Unlike most other R packages the model is formulated in C++. This provides great flexibility, but requires some familiarity with the C/C++ programming language.

It will be an introductory class to TMB and the main idea is to show how we can fit different statistical models through this software. 

Install TMB:

install.packages(“TMB”)

Please, refer to: https://github.com/kaskr/adcomp

Additionally, you need to install: Rtools (compatible for your Rstudio and R version!).

Main topics:

  1. Install TMB
  2. Theoretical description of TMB and how it works. 
  3. Types of models to fit
  4. Examples in R

References to TMB web:

https://kaskr.github.io/adcomp/Introduction.html

Materials:

https://github.com/kaskr/adcomp/tree/master/tmb_examples

Lecturer:

Joaquin Cavieres PhD (c) in Statistics

Universidad de Valparaíso

On hyperparameter optimisation of random forest and when we use Lasso for post-processing trees

In my OpenGeoHub2020 lecture https://github.com/mengluchu/OpenGeoHub2020, I have talked about the application of Lasso as a post-processing step for random forest to shrink the tree space. This approach was originally proposed in the famous “element of statistical learning” book of Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie but I have not seen it being implemented or used. I implemented it in my air pollution spatial prediction model and in my R package APMtools. The function is here: https://github.com/mengluchu/APMtools/blob/master/R/prediction_with_pp_La.R My experience is that this improves prediction accuracy and obtains an accuracy almost as good as a meticulously-tunned XGBoost.

I received a question about hyper-parameter optimisation using this Random Forest + Lasso approach (let’s call it RFLA): If we use the Lasso for postprocessing and reduce the number of trees, how do we fit it into the hyperpparameter optimisation? In this blog I summarise on hypermeter optimsation for random forest and discuss this question.

In random forest hyperparameter optimisation (despite that random forest is not so terribly sensitive to it), we commonly optimise the number of variables to select from (mtry, the most imporatnt) and the minimize node size (min.node.size, less important) but not the number of trees (ntree), as ntree can be a very safe (i.e. large) number and increasing it doesn’t deteriorates the model. Random forest is perfectly parallelisable so we also don’t need to worry about the computational burden. This leads to the logic that in practice, we typically decide on the ntree first, then optimise for mtry and min.node.size.

Let’s firstly look at the effects of mtry and min.node.size. As we know, a lower mtry reduces model overfitting by increasing the heterogeneities in candidate predictor variables when building each tree and therefore reduce correlations between trees. At the same time, a small mtry may reduce the strength of a single tree as less predictor variables are considered. For the same reason, a large mtry is more likely to cause over-fitting of a single tree.

Suppose we have sufficient observations, if ntree is small (say, 50), the mtry may better off be small because we want to let trees be more independent to each other. But when ntree is relatively large (say 2000), the mtry may be increased as we have many trees to aggregate to reduce model over-fitting in the first place. However, as mentioned, this makes trees more correlated to each other (high redundancy). In practice, we use Cross-Validation (CV) to find the optimal mtry.

Reducing the min.node.size also increases the chance of a single tree overfitting. Especially when the mtry is large, a small min.node.size makes a more flexible tree. Anyway, this hyperparameter is not so important. In practice, it is commonly set to 5 or 10.

As the mtry, min.node.size both affect the final results, it sounds reasonable that we should optimise the hyperparameters for RFLA directly, instead of for Random Forest and then applying Lasso. However, does this worth the effort? The Lasso hyperparameter optimisation further reduces the effects of hyperparameter setting. With Lasso controls tree correlations, the mtry could be very large. When mtry is equal to the number of predictors, random forest becomes bagging. This means maybe bagging + Lasso is no less effective compared to random forest and random forest + Lasso. Or more generally speaking, bagging with an efficient shrinkage tree aggregation strategy is good enough?

The invention of random forest is overrated?

HOW AN IMMIGRATION OFFICE IN BAYREUTH COULD DESTROY A FOREIGN HEART

Meng Lu

I spent more than five years of my best times in Göttingen, Köln, and Münster. I probably had the best supervisor and colleagues and thought I love this country almost as much as my homeland. When I just moved to the Netherlands for postdoc, I missed many things in Germany. After my postdoc, I decided to settle in Germany and I thought it feels a bit like going home. However, if someone tells me now he or she loves Germany so much and decide to settle there, I started to feel a bit differently.

I got a position at Bayreuth University in Dec. 2020 and accepted the offer in Feb. 2021. Everyone in Bayreuth has been super kind to me and I met really great professors. But one thing happened that completely changed my feelings to Germany as a whole despite I tried to stay objective. I know the negative…

View original post 899 more words

Niche in Geoscience? No.

“Finding a niche” is sort of a “holy grail” that a senior researcher would mentor a young researcher. Many professors believed that being able to find their niche led to their success. But they forgot that that probably held decades ago, and in the modern information time, a “niche” doesn’t exist in Geoscience and should not exist. Anyone can and should be able to build on top of other’s work. Open-science told us this trend lead to the fastest acceleration of science. Twitter (a pioneer to completely open their development platform at the development stage) demonstrated it with how it becomes a giant today.

I was mentored by professors I truly trusted, respected, and appreciated that I should find my niche and I wrote this short blog because I heard people telling others “you should find your niche” or “we should find our niche” a few times recently. I know they are sharing their precious experience and out of the most sincereness, but useful experience has an expiration date. Trying to find a niche, one may go to an extreme of doing things others won’t, invest in the opposite side of open-science, or simply be discouraged and loses the motivation.

Better ways — I just draw from what I saw and want to say to myself– don’t be afraid to choose a very hot topic, sit down but heads up, keep eyes open and keep moving on, surpass the years-long hard-works from others and let others does it in return.

Sharing a great explanation of PCA

PCA analysis was the beginning of my spatiotemporal data analysis journey and went all the way through my PhD study. It can be understood simply as an orthogonal, eigen-decomposition of covaraince matrix, with the variance of each component arranged in decreasing order, however, the links between it and linear regression, ANOVA, etc. are not imprinted in mind and it turned out I kept feeling not understanding it completely and trying to demystify it. Now I found the best illustration that explains my confusion, enjoy reading!

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

R, Python, or both in Machine Learning?

As an R user for almost ten years, I’m gradually switching to Python for machine learning, and pretty much everything. Not to mention deep learning, which has the community almost exclusively in Python, for other data science methods python sees a community growing faster than R. There are no doubt lots of developments in R that are not available yet in Python, like distributional forest, empirical fluctuation process-based time series structural change detection methods (party, efp, bfast, etc.) and the ggplot is extremely powerful. But that’s becoming less and less. Relatively more recent methods, such as catboost or XGBoost, have better Python APIs. classical geospatial analysis methods such as GWR (geographically weighted regression) and Gaussian processes also see lots of developments in Python. The tidyr tools also come naturally in Python (I.e. the pipes are not really needed as the programming is already fully object-based).

The Python array, data frame, geodata handling are becoming more powerful every day, but in R slower. I quite often implement things in R first as I’m most accustomed to it, but always I found a solution in Python that is more neat and simpler.

I won’t say bye to R though as there are still lots of complementary tools, it is convenient for me sometimes and several of my collaborations are still based in R, the community in R is still growing and upcoming. Programming languages function as communication tools. Emerging ideas from R are as exciting as in Python. Just to say for anyone serious about machine learning and is still a faithful R user, you will benefit from being bilingual.

About choosing a Ph.D supervisor

I have been asked about my experience studying in different countries (I did my bachelor in China, MSc in US, PhD in Germany, Postdoc in Netherlands), which do I like the best and what are the differences. This, of course, depends on which university, institute, and the group of people I worked with. There are some general patterns, but the variability is way lower than institutes or research groups within a country. I have worked close together with four professors, the dissimilarities (in terms of how they do Science and supervise students) between them are huge.

Many PhDs grow to be extremely similar to their PhD supervisors, if their supervisors are respectful to them. This is at least shown on some of my PhD and Postdoc colleagues. (For not-very-good supervisors, I heard their (former) students told me they learned how not to be a not-very-good supervisor and are really aware of being responsive. )

In short, positive or negative, the influence from a Ph.D. supervisor is on so many aspects of a researcher. So here are some of the tips I want to share. When choosing a Ph.D. supervisor, do:

  1. look at if a supervisor is still leading the field, is he /she still being active, kept updating his/her blogs, Github/Gitlab, does he/she still read literatures or watch lectures?
  2. talk to his/her current Ph.D. and M.Sc students, how much time does he/she have, how bossy and caring is he/she, does he/she really have great insights or just a list of publications?
  3. know what the supervisor is best known for, what extraordinary/interesting things he/she did and is doing?
  4. be as critical as you can. This is hard for a M.Sc., but the research capability of a professor varies greatly. Some of them are, in the Chinese saying “the master of old science“, and worse, they stick on the piece of dying science that they have worked on for so long and once brought them honour. Be their students may likely to continue building on that piece of dying work.

Don’t

  1. be fooled by the supervisor’s long co-authored publication list, that means a bit more than NOTHING, if not negative. If a supervisor has more than 15 publications a year, you can think of how is this possible. Is it possible to have 15 breakthroughs per year? I’d prefer two years, a really good paper. I once saw someone has 5 first-authored paper a year, my first sense is these papers may not be so good, instead of this person is amazing. We know we need time to think in-depth, to discuss, to implement, analyse and refine.
  2. choose too quick at what at a glance looks interesting. Some research groups may have some models or software under development, this is a very cool thing, but be careful not to be constrained to a certain model or software.

Bias in judgements of a grant system

Veni is an dutch science foundation grant for researchers who are within 3 years of the acquisition of their Ph.D.. This is the first personal grant I have applied for. I spent about 3 months working day and nights, working on the proposal, application, and rebuttal. The result is that the three reviewer reports are overall very promising, leading to great hopes, but the funding committees don’t find the proposal (let’s call it proposal A) that appealing. I am completely fine with it, but I do want to speak out my opinions about the bias in the evaluation process.

In comparison, I have a proposal B written afterwards which is with extended methodology and data sources, but a much confined study area — global-scale for the Veni proposal vs. a European country for the proposal B.

My collaborator told me the proposal B was ranked the second, almost in, only my relationship with University is slightly weaker compared to the top ranked, and most people reading it found it strong.

The proposal A even fall out of the “very good” category, that is a shock. The most negative criticism for proposal A is that it is too ambitious and they would prefer a more confined study to fully consider all the uncertainties and interpretations, and this “too ambitious” is then followed in the entire evaluation, even saying my academic performance does not match up with this ambitious goal. This is a very unfair point but this is not what I want to discuss.

I want to share my opinion on this comment of “too ambitious”. I acknowledge that in the beginning I shared the same view with the reviewers, and even argued emotionally with my current supervisor. As a data person in core, I care about uncertainty and interpretability as much as any serious data scientist, which means I also preferred a small study area with lots of data to fully evaluate all the uncertainties and the physical meanings of the atmospheric processes. However, with time passing by and with more knowledges, I do grow a convincing view of the global-scale study, as besides the social and academic impacts, this is the only way of appreciating the heterogeneities between regions, and importantly, with global coverage satellite imagery and more and more geospatial data, it is a matter of time that we make high quality, high resolution global maps, then the next step? A better exposure assessment. I want to shout out that all these evolutions will be happening within the next three years! However, for researchers never gave the global-scale modelling a deep thought, “too ambitious” is going to be the intuition, based on their many years of experiences as professors.

The other point that the committee raised is about the uncertainty from data sources, this part was in the beginning elaborated but suggested to be removed by my supervisor. This is clearly due to different perspectives from researchers from different background caring different aspects. My supervisor thinks no-one would care about the uncertainty. The proposal is limited to less than 1800 words, I agreed purely because I got to elaborate my plans, otherwise the critics would be the plan is not clear.

Reflecting on the two points, before working on the global mapping, I am completely in-line with the committees. If I were the committee, I would reach to the same comments and would also prefer the proposal B, but the key point is, my opinion changed after setting hands on the global mapping — I like both proposals equally! For the globals-scale study, I was not stating the accuracy is the same everywhere, but I did promised uncertainty assessment everywhere. Importantly, if the global mapping is not started, it won’t ever started because the committee will always find it too ambitious. And for the challenging areas there will never be an estimation, and the estimation will never be improved!

Here comes my critics for the grant system: for something no-one has done before, the judgement may be greatly biased! This problem is exacerbated by the word limit. It is hard to both convince the committee and write a plan in limited words. Some changes are needed. At least the rebuttal should be allowed to be longer. Second of all, the committee should analyse the opinions from all the reviewers carefully. It is now clearly indicated in the letter that the committee is not obliged to follow the referees’ report completely.

Reducing the data size to 1/10, only by reshaping!

On the way to the big data concept and burgeoning software, don’t forget to inspect if the data frame or spreadsheet has stored too many redundant records. Just by reshaping, I got the data size reduced to 1/10! 

Since I got my hands on global air pollution mapping, I managed to gather a most comprehensive set of station measurements, through collaborations and investigating on the open science community. My ambition was to do it time-resolved, so I gathered hourly data of a year (8760 hours). I got 6 datasets, some dumped into 365 spread sheets, with several air pollutants, with a total size of 22 Gb; some stored in 13 spread sheets, sorted by space-time, some stored wide table, some long, some has UTM time, some local time. In short, all of them need to be wrangled for a consistent structure for querying and analysis. 

Though focused on array data management during my Ph.D., I’ve never thought point data would bring me trouble in storage. I thought it is easy, I just need a relation database or HDF5-based tools.

I never thought a meeting with my software developer colleagues would be useful, as I thought I know a bit about HDF-5 or a relational database, just to put them in use shouldn’t be hard.

But by a mere look at the column names I provided, my colleague said ” there is lots of replications in the data, isn’t it?”

I have the columns: time, longitude, latitude, sensor type, values, urban type,  … The longitude, latitude, etc. are replicated multiple times when the time integrates.

I have thought this is the format that I’m most familiar for the subsequent analysis, this is the first time I started to care how much space are wasted.

I went back to build a relational database with two tables, one (data table) with

time, value, UID ( a unique id linking to each coordinates)

The other (attribute table) with

UID, Longitude, Latitude, …. 

I firstly investigated on a small dataset, which has 70 stations, that reduced the file size from around 75 Mb to 17 Mb.

Now since I’m more aware of the storage, I got uneasy by the left-over replications: the UID is repeating for time iterations, and time repeating among UID.

I then rearranged the long table to the wide table, with each column a station. This makes the size almost a third of the data table. Compared with the original data, it shrinked to 1/10 of the size. 

Before going for “sophisticated” big data solutions, thinking about “never repeating data”.