About choosing a Ph.D supervisor

I have been asked about my experience studying in different countries (I did my bachelor in China, MSc in US, PhD in Germany, Postdoc in Netherlands), which do I like the best and what are the differences. This, of course, depends on which university, institute, and the group of people I worked with. There are some general patterns, but the variability is way lower than institutes or research groups within a country. I have worked close together with four professors, the dissimilarities (in terms of how they do Science and supervise students) between them are huge.

Many PhDs grow to be extremely similar to their PhD supervisors, if their supervisors are respectful to them. This is at least shown on some of my PhD and Postdoc colleagues. (For not-very-good supervisors, I heard their (former) students told me they learned how not to be a not-very-good supervisor and are really aware of being responsive. )

In short, positive or negative, the influence from a Ph.D. supervisor is on so many aspects of a researcher. So here are some of the tips I want to share. When choosing a Ph.D. supervisor, do:

  1. look at if a supervisor is still leading the field, is he /she still being active, kept updating his/her blogs, Github/Gitlab, does he/she still read literatures or watch lectures?
  2. talk to his/her current Ph.D. and M.Sc students, how much time does he/she have, how bossy and caring is he/she, does he/she really have great insights or just a list of publications?
  3. know what the supervisor is best known for, what extraordinary/interesting things he/she did and is doing?
  4. be as critical as you can. This is hard for a M.Sc., but the research capability of a professor varies greatly. Some of them are, in the Chinese saying “the master of old science“, and worse, they stick on the piece of dying science that they have worked on for so long and once brought them honour. Be their students may likely to continue building on that piece of dying work.


  1. be fooled by the supervisor’s long co-authored publication list, that means a bit more than NOTHING, if not negative. If a supervisor has more than 15 publications a year, you can think of how is this possible. Is it possible to have 15 breakthroughs per year? I’d prefer two years, a really good paper. I once saw someone has 5 first-authored paper a year, my first sense is these papers may not be so good, instead of this person is amazing. We know we need time to think in-depth, to discuss, to implement, analyse and refine.
  2. choose too quick at what at a glance looks interesting. Some research groups may have some models or software under development, this is a very cool thing, but be careful not to be constrained to a certain model or software.


How to prevent the XGBoost from overfitting

Most people using XGBoost got the experience of model over-fitting. I earlier wrote a blog about how cross-validation can be misleading and the importance of prediction patterns https://wordpress.com/view/tomatofox.wordpress.com . This time I just want to note down some very practical tips, something that cross-validation (e.g. with grid search) can’t tell us.

  1. The model overfitting is likely caused by the learning rate being too high. The default, 0.3 is usually too high. You can try 0.005, with a dataset of more than 300 observations.
  2. A very misleading statement in many publications and tutorials is that too many trees in XGBoost (or boosting in general) causes over-fitting. This is ambiguous. Many trees will NOT decrease your model performance. Increasing from the optimum setting of trees will only adding to your computational burden, but it is as safe as trees in random forest! With as many trees you can imagine, the model is as general as with less trees because the gradient just got stuck! The cross-validation result won’t change, neither will the prediction pattern.
  3. So the advice is to use a very low learning rate, say 0.001, and set as many trees as you like, say 3000, for the best fitting. Then for faster achievement of the results, reducing the number of trees. If you use a learning rate of 0.001, 1000 trees should be enough to find the global minimum, as the 0.001*1000 is already 1, the same as doing the gradient descent once with learning rate 1.