Most papers in Artificial Intelligence (AI) research these days have an accompanying source code. Sometimes authors will release their code with their paper after publication (our research group always does) and sometimes for various reasons they don’t (e.g., the code is built at an industry using another source code which is proprietary; code requires a lengthy internal peer review before being allowed to release).
Issues with using Released Source Code
Even when the code is released, you can experience the following problem:
- Code doesn’t compile or run.
You cannot run the code. You even email the authors but they don’t reply. This is a common issue and the main reason is that unfortunately writing a high quality code doesn’t align very well with incentives for a PhD student. You can spend 3 months improving your code but your advisor may want to see another research paper. Further, the student may have graduated and now works in quant and has no interest in replying to your email about their code from 10 years ago.
I strongly advocate for incentivizing the quality of the code and giving strong preference to papers with accompanying code when accepting a paper. The standard argument against this is that additional burden will be placed on reviewers who don’t even have time to read the supplementary paper but I feel that we could automate parts of it (in any case we need to address it more).“Accepting a paper without a source code is to an approximation like accepting a cheque (check) without a signature (it feels like a cheque, it has the right number, the memo looks beautiful but there is no guarantee).”Doing this will however slow down the research cause for whatever reasons embedded in human nature most researchers end up submitting on the last day of the conference deadline and often times the code is just not ready to be submitted (unfortunately this has happened to me and to almost everyone I know). Discouraging such behaviour will be a good thing for the community. The plot below shows that most papers for NAACL 2018 were submitted on the last day (special thanks to program chairs for this analysis). Full analysis is available here.
- Code gives wrong results.
Even when you can compile and run the code, you find that results are different from what the paper reports or it outputs several numbers without telling which one to use. I argue that errors of this type are much more common and difficult to deal with. It can happen for reasons that authors cannot foresee e.g., the version of the underline library changed (DyNet is particularly susceptible to giving different result based on different versions). Remember that most codes are developed by PhD students who aren’t necessarily the best code developers (this is one skill that we need to incentivize as a community). This kind of dependency hell like situation is common in software developments and solutions have been discovered to handle these issues. Releasing Amazon AMI images or docker files is one way to handle this situation. I have used both of these solutions in the past and they greatly helped me. For example, if you are running experiments across different machines you can get different results even with the random seed fixed and using synchronous computation (due to different libraries or the way they compile on that system). Docker can help fix this issue and also ensures that performance changes you see are not cause of using a different machine but due to the changes you made).
Code performs poorly on a different but related task
Most research papers using deep learning these days are reporting increasingly complicated neural networks for smaller and more specific task. A main problem with this approach could be a form of community overfitting. I define this as:
Community Overfitting:“Phenomenon where a research community as a whole overfits on a single or a few tasks by trying many solutions even though an individual researcher may have only evaluated on the test set once.”
A colleague of mine for example, recently found that several datasets that were released for visual question answering perform around chance (i.e. 50% accuracy) when used for a new dataset that she collected. This is in stark contrast to the general feeling around (particularly on social media) that say that we have made great improvements on tasks requiring understanding in vision and language.
A solution to this might be to stop using very old datasets (like the Penn Tree Bank), maintain a held out test set that is not revealed to the public (there is a contradiction here to keeping everything public) and on which authors cannot evaluate more than once a month (or a relevant long time frame). We also need to be careful when stating our results. Instead of saying, “we present a model that improves on visual question answering problem.”, say “we present a model that improves on the visual question answering dataset XYZ (cite ABC)”. I believe it is generally useful to be cautious and under claim and then enjoy being proven right for new datasets or tasks, than over claim and find your claims proven wrong.
Three Main Crimes in an AI Source Code
When developing big code bases for research, it is helpful to debug them carefully to avoid claiming wrongly. In general, however one cannot guard against all possible sort of errors (e.g., an error in the underline library is extremely hard to detect). I have a feeling that there is an underline power law like distribution where making sure that your code has no error with probability at least 1-p requires number of debugging hours that grow polynomially with 1/p. There is a fair bit of research now on writing provable code (for example, see this paper from Selsam, Liang and Dill). However, today most research code don’t use these automated proving techniques. I have come up with the following quick checks to make sure that your code doesn’t contain the gravest type of errors:
- Train-Test Dilution.
This is the classic machine learning error where you use information from the test data when training. This could involve training on the test data or more subtle information leakage from the test data. The standard guard against it to make sure your dataset is split into 4 disjoint files: train, tune, dev and test and making sure that these files are indeed disjoint. Then using only the train file for training and the tune file for finding optimal hyperparameters. Dev file should only be used for occasionally checking for performance and the test file should only be used once (generally two weeks before the deadline). Using different files as opposed to using indices is better as using wrong indices accidentally won’t let you see the wrong split during training (or evaluate on train data when testing). I do want to point out that this error can happen is very subtle form. E.g., once I was using a dataset with train, dev and test splits. I was using a model for transfer learning that was publically available for download from an existing code base. I was getting very high dev performance but then I got very low test performance during testing (generally you expect a ~1% drop in performance when running on the test set but this was a ~15% drop). I then realized that the publically available model used labels from the dev set when training. Fortunately, we were able to fix this error and improve our model in other ways to gain back our performance before submitting our results.
- Label Leakage.
Label leakage is the phenomenon where your model uses the gold labels (or some prohibited information) for an input when evaluating. For example, a classifier that has to predict some label for an image but that label is encoded in the image itself say as a watermark, or in more direct case given as input to the classifier. The standard guard against label leakage is to separate prohibited information in your class. E.g., if you you are using Java then all labels that are prohibited should be made private and once is only allowed access to a scoring function which evaluates to a score using the gold label and the predicted label.
- Wrong Evaluation Metric.
Wrong evaluation metric will give you wrong results. E.g., one can compute F1 score wrongly or write their own buggy functions for BLEU score. I remember that using Python NLTK BLEU scoring function used to give wrong results. In general, when using a complicated evaluation metric, use the test scripts that previous publications have used or when using a dataset with an accompanying evaluation script use that. When writing your own implementation for an evaluation metric, I suggest manually computing the score and then checking it with the evaluation metric for a set of randomly chosen inputs.
When supervising an intern I ask them to make sure that error type 1 to 3 doesn’t exist in their code. This has the advantage of ensuring that their code is trustable by which I roughly mean that the results from the code can be reported for that task. Following argument makes it more clear.
Theorem: If your code doesn’t contain error 1 to 3 then the result from your code can be reported.
Proof: Say you train your model on this task and then evaluate this model. From ensuring against error 1 and 2 you make sure that you haven’t seen the test data or labels from the test examples when training. The model that you train is then evaluated on the test set and not on train set since you don’t have error 1. So now you are evaluating a model that is not using the test labels on the test set and hasn’t seen this test set before. Since your evaluation metric is correct, therefore your results are also correct.
This is only a proof against high level errors. If you save the model wrongly, give wrong inputs to the evaluation metric or use other information in the world that you are not allowed to use even though you are not using the test set, then the above proof doesn’t hold. But I hope I am able to get the rough idea of “trustable results” general idea across.
In this blog post, I talked about certain errors in running released code and how to make sure that your code doesn’t have the gravest error before submitting these results. I also talked about a rough idea behind trustable results. I haven’t talked about debuggability or whether what one claims in the paper is what their code does For example, one can not make error of type 1 to 3 but their model could be incorrectly implemented or different from what they describe in the paper. In general, it is good to see that community is talking more about these issues and I hope that my blog post will help the discourse.