The Statistical Society of Canada has posted a few weeks ago its Case Studies (a grad data science competition) for the 2019 annual meeting held in Calgary on May 26 to 29. One of the case study is about counting cells in microscopic images which look like this:
Unfortunately, the organizers forgot to remove from the test set of images the actual cell counts.
Ok, that’s not quite fair. Truth is that they tried to remove the true cell counts, but didn’t quite manage to do so.
So here’s what’s going on. The file names of the images in the training set take forms such as A01_C1_F1_s01_w2, and the number following the letter “C” in the name indicates the true cell count in the image. While they removed this number, they forgot the remove the number following the letter “A”, which is in a simple bijection with the true cell count… The file names in the test set look like this: A01_F1_s01_w1.
Now even if that number following the letter “A” was removed, there would still be other problems: the number following the letter “s” in the file name also carries quite a bit of information… I don’t know why they left all that in.
I’ve contacted the organizers about this, but they don’t see it as being an important problem for the competition, even when 60% of the team’s scoring will be based on a RMSE prediction score.
Another fun fact about this case study: it is possible to get a root mean square error (RMSE) of about 1-2 cells through linear regression with only one covariate. Try to guess what predictor I used (hint: it’s roughly invariant under the type of blurring that they applied to some of the images.)