Why Can't I Reproduce Their Results?


At the start of my PhD I remember well the experience of reading state-of-the-art papers and trying to re-implement them to reproduce their results.

To say this was a frustrating experience was an understatement, and I consistently achieved the same result: it didn't work. Whatever I tried, I couldn't reproduce their results, and even when I did it wasn't the same - the quality was worse and it broke all the time.

Along with this, I'd seen the skepticism which had destroyed the motivation of many of my friends and colleagues. I started to wonder if some of the things they were saying were true: "it's all cronyism and favors at the top" - "authors purposefully remove details to maintain their competitive advantage" - "results are all cherry picked" - "it's just nonsense hidden behind mathematical jargon". Well - it's hard to deny that sometimes some of these things do have an element of truth to them.

But what I didn't realize as a graduate student is this: reproducing papers is the learning-by-rote of academia. If you're feeling the pain it's because you're doing it right. The goal of the fake-it-until-you-make-it school of learning isn't to actually succeed - it's to build up enough familiarity with something so that the next time you attempt to do it, you're not blinded by fear or undermined by a lack of confidence.

Nonetheless, the pain and frustration you can feel as a graduate student is real and if I could go back in time to give myself some advice, at least to ease the pain a bit, this is what I might say:

Why doesn't it work?

I'm about to tell you something which can sometimes be harder to believe than conspiracy theories about academia: you've got a bug in your code.

I can't tell you what it is, because it could literally be one of a million things - ranging from the mundane to the fundamental. Perhaps you have a typo? Perhaps you used the wrong variable name? Perhaps you called a function with the arguments in the wrong order? Perhaps you're calling the wrong function? Perhaps you misunderstood what a particular function does? Perhaps you have an off by one error? Or you indexed an array incorrectly? Perhaps you have a bug in your data pre-processing? Perhaps your data isn't clean in the first place? Perhaps it has outliers or invalid entries? Perhaps there is a bug in your visualization? Perhaps you're visualizing the wrong thing? Perhaps you need to transpose your matrices? Perhaps you're using the path to the wrong file? Perhaps you have numerical issues? Perhaps you need to add a small epsilon value to some equation? The list is endless...

Debugging research code is extremely difficult, and requires a different mind set - the attention to detail you need to adopt will be beyond anything you've done before. With research code, and numerical or data-driven code in particular, bugs will not manifest themselves in crashes. Often bugged code will not only run, but produce some kind of broken result. It is up to you to validate that every line of code you write is correct. No one is going to do that for you. And yes, sometimes that means doing things the hard way: examining data by hand, and staring at lines of code one by one until you spot what is wrong.

Why are my results worse?

One thing that I think attracts a lot of us to academia is the thought that it may be a domain where ideas, rather than other factors, can triumph. We all want to work in a field where a good idea, a good way of doing something, or solving something, or thinking about something, is all that is required for being recognized.

But we also work in the field of computer science, and that means people don't just want an idea - they want a proof - they want the idea implemented in code on a computer, with experiments, and evaluations, and comparisons.

And here is where it gets tricky, because programming is undeniably a skill to be practiced and improved, and no good idea will produce good results if you don't have the skill to implement it properly. Experience matters too. Things like how quickly you can iterate, how intuitively you can work out what is wrong, how easily you can fix it, how deeply you understand the concepts you're using, how many times you've programmed this sort of thing before, all make a massive difference in the manifestation of the idea.

In fact, most often research in computer science is not at all the meritocracy of ideas we imagine. An average idea executed well tends to produce better results than a good idea executed poorly. And I'm sorry to say, but this is most likely why your results are worse - the original authors just have more practice and experience doing this sort of thing than you - that's all.

So give it time, with experience and practice your results will improve, and eventually you will be ready to combine them with an excellent idea - ready for the perfect slam-dunk.

Why is the notation so difficult and imprecise?

Have you ever read a mathematical paper from before the time mathematical notation was invented? Take a look at this quote from a mathematical paper of 1575 which introduces the equality sign:

mathematical notation

"Howbeit, for easy alteration of equations, I will propound a few examples, because the extraction of their roots, may the more aptly be wrought. And to avoid the tedious repetition of these words: is equal to: I will set as I do often in work use, a pair of parallels, or Gemini lines of one length, thus =====, because no 2 things, can be more equal."

It's easy to forget that there was once a time where mathematics was communicated without notation, in prose, as if spoken out-loud. Just imagine what a modern 13-page SIGGRAPH paper might look like under those constraints...

And history is important in this case, because the mistake most graduate students make when it comes to mathematical notation is in believing that its primary purpose is for precision. Mathematical notation has and always will be for conciseness and understanding first. Precision and computation are secondary.

At the same time, most systems presented in modern computer science papers are vastly complex and difficult to understand - built upon countless other systems, decisions, and assumptions. What you read in a paper is just a fraction of the actual details that go into what you see. Authors are of course aware of this, and writing a paper requires a delicate balance between making sure all the details are included and avoiding it becoming tedious or impossible to read. It means that often in describing the algorithms and code used to produce results authors are left with a difficult choice:

A) To describe the implementation in a way which is more precise, but much more complex, lengthy, and difficult for the reader to understand.

Or

B) To describe the implementation in a way which is less precise, but simpler, shorter, and easier for the reader to understand.

Given this lose-lose situation I believe that most researchers (including myself) tend toward the second option (perhaps to the disappointment of many graduate students). The idea is that if a reader can be made to understand the concepts well, they will hopefully be able to fill in the less important details themselves.

So remember this: mathematics was once prose, and just because it is symbols now it does not mean you can treat it like code. Difficult as it is, try to use your intuition. Study notation conventions and the related work. Delve into the mathematics and the history as much as you can, because truly that will make writing the code easy.

Why do papers use inconsistent terminology?

One of the most basic rules for good, fluid, easy to read prose is to not repeat words too frequently. Ultimately, this doesn't change for academic writing and subconsciously many academics will switch terminology mid-flow to avoid the heavy, clinical feel that too much repetition can introduce. This is particularly true in the abstract and introduction section of papers which are meant to set the scene in an easy-to-read way.

But to graduate students, who will most likely not be familiar with every named variation of the same concept, this frequent switching in terminology can make papers almost impossible to read without tiresome cross referencing or frequent checks to indecipherable Wikipedia articles.

I would try to forget about precise definitions, cross referencing, and Wikipedia for now. Try to infer as much as possible from the context as you can. More precise understanding will come later as you implement things and many terms in academia are vague and overloaded anyway. Papers will typically tighten up their terminology use around the methodology section so pay attention there and don't hesitate to do a full context switch and invest significant time in understanding a tangential method, concept, or term if it looks important.

How can I catch up with everyone else?

Have you ever been trying to implement a paper just to find a line like this: "We start from [Smith et al. 2020]", only to find out that [Smith et al. 2020] is actually the latest paper in a chain of papers compounding five years of work done by a PhD student with an inaccessible code base, consisting of thousands of lines of code?

Yeah... that's frustrating. But research by definition is on the brink of human knowledge, and unfortunately that means that more often than not it cannot be completed in a few months and put nicely into a stand-alone accessible package.

The bad news is that depending on how much engineering work really went into their research, it might well be as impossible as it appears to reproduce their results without the code and data.

The good news is that you now have the chance to repeat what they've done, but with the renewed fearlessness of a fresh graduate student, five years of additional new developments in the field, and a fair amount of gained hindsight. You could well end up doing it better...

How did authors decide on the value of parameter X?

Anyone who has grown up in London will know that when riding the underground there is a simple way to determine who is a tourist and who is a Londoner. Simply wait until the train stops at a station and see who presses the bright orange "Open Doors" button. It probably wont be a Londoner, because most of them know this is just a placebo button - it doesn't actually do anything.

Of course, the tourists don't know that, and to them it must seem like everyone else is crazy, as the train rolls to a stop, and people queue to get out, no one reaches to press the big flashing orange button to open the doors. And unless someone tells them directly, (or they run a detailed scientific study of their own), there isn't really any easy way to find out by themselves - under normal circumstances pressing the button is almost always going to appear to make the doors open shortly after.

Now imagine that for some reason no member of the public really knew precisely the exact, complex mechanism behind the Open Doors Button (an important Transport of London secret) - as a new arrival in London whose side might you be more sympathetic to? The Londoners who say investigating it is a waste of time because it doesn't do anything? Or the tourists who say that it requires more scientific investigation?

Most people who've tried to learn about neural networks will have faced the inevitable mystery over the way people choose all the parameters involved. There are hundreds of choices, from the number of hidden units, to the number of layers, the learning rate, the optimizer, the convolutional kernel size, the loss function, etc.

Deep learning has therefore often been called a black magic and many people say there is a lack of rigor which harms understanding and reproducibility. They say it is a field dominated by money, compute, and engineers optimizing for fancy results.

But if you conclude with this you are left with an odd juxtaposition: on the one hand deep learning consistently produces impressive results, so is it really being done by people who have no understanding of what they are doing?

Perhaps people are just optimizing for benchmarks? Slowly tweaking parameters until they get the results they want without any real understanding of the affect each parameter has? As a graduate student this answer is comforting because it explains why someone else might have gotten things working while you didn't - "Ah! The only reason my stuff doesn't work is because everyone else spent a long time tweaking all their magical numbers! I'm the only one concerned about the actual science!"

Well here is a Londoner's much more disappointing take: many of these parameters just don't matter that much.

If an author doesn't focus on a parameter in their paper it could be because they don't understand it and it requires a more detailed scientific investigation - it could be that it is task-specific and requires domain knowledge to set - or it could be simply that it just doesn't matter that much. Not to say you can't set it incorrectly - probably you can - and not to say the authors didn't optimize it - they probably did spend time tweaking all those parameters to squeeze out that final one or two percent of performance - just that this parameter being set to exactly what it is set to probably is not responsible for the functioning of the system as a whole.

So faced with a parameter you don't know the affect of you have only a few options. Find out what other people say, make an educated guess and try to forget about it, or invest the time in really understanding what this parameter does. Choose wisely.

Why is the source code and data unavailable?

In many ways in graphics we're lucky to be so well connected to industry - people actually care about our research - funding is available - and progress is evident. But it can make the release of code and data close to impossible.

However, when it comes to code and data in the academic world the mantra is this: no matter how horrible, untidy, and shameful your code is, releasing any code is better than nothing. That may sound good, but have any of you ever released research code so bad it got someone kicked out the country? I have.

In one case I was contacted by a student who'd been using some of my research code to try and complete his master's project. He'd got some very early results by hacking the code into an even worse state than it was already, but now he was stuck, and his project was in such a mess it was certain he could not finish it, nor would he have the time to start again from scratch. In his e-mail to me he was angry, stressed, confused, and ashamed. While I could answer a few things in response it was clear to me that his understanding of things was poor, and to undo the knots he had tied would be beyond an e-mail chain. At some point he stopped messaging me, and a while later I found out that the stress of completing all his other courses on top of his master's project had lead him to take the very bad decision of cheating on one of his exams. He'd been caught, expelled from the university, had his Visa revoked, and ultimately been kicked out of the country.

I still occasionally see my code appearing in other academic projects and it is never in a better state than when I released it. The same weird hacks, bugs, and errors are inevitably still there unfixed, and I can't help but cringe at the pain I know it will have undoubtedly caused so many poor students. It is impossible not to wonder if letting all those people start from scratch might have been better.

The ultimate problem with having the code available is that it gives the illusion it will be easy to "just do small experiment X", when in reality with code that is bad enough this can sometimes be even more difficult than starting from scratch.

Don't get me wrong, code is extremely valuable, and very important for unambiguously reproducing results or as a reference for the exact details of what was done - and in almost all cases I think it is an excellent idea to release it if you can - but you the graduate student should be careful, because it can sometimes be detrimental to understanding and even implementation. There is no substitute for doing it yourself.

Why are everyone else's results so polished?

Most conferences in graphics are single or double blind in the review process, however when you see a Pixar, Disney, or Dreamworks character in the results you can probably be pretty sure which company some of the authors work for. Other groups re-use the same rendering environments and characters between publications, so while as a reviewer you can never be completely certain who the authors are, it often isn't difficult to make an educated guess.

As a graduate student I always wondered how much of this was done on purpose. Groups like Disney Research have an excellent reputation for research, so why wouldn't they make it obvious it was them? I remember being pleased one of the characters we were using in our paper looked cartoony - would reviewers think our paper was a Disney one?

But what I didn't understand as a graduate student is that using a Disney character isn't what gives Disney Research papers a good reputation - good presentation is what gives Disney Research (or any other well known group, I don't mean to pick on Disney Research!) papers a good reputation.

And good presentation matters in research, a lot. For most reviewers, care taken in presentation is a proxy for care taken in methodology. And why not? If you have sloppy rendering, visual artifacts, confusing graphs, poor visualization, or bad diagrams, who is to say you haven't made the same kinds of mistakes in your mathematics?

Perhaps in a perfect world everyone would be evaluated on their contribution alone, not if they had nice soft shadows and reflections in their rendering, but academic research is a process which takes place in the real world, and papers are judged not on their contributions alone, but on themselves as a whole package. Fundamentally people want to be excited by graphics research and beautiful presentation is one of the things that makes this field joyful to work in - don't undervalue that.

Why are my reviewers so lazy and unfair?

Look - there is and always will be a Reviewer 2 - but the secret about Reviewer 2 is that of all the people who will read your paper, there will be those who are even more bitter, lazy, stupid and unfair than Reviewer 2. People will read your paper and think worse, and have dumber ideas about it, and misunderstand it even more horribly. If you really want everyone to like your paper pleasing Reviewer 2 is just the first step on the path.

Reviewer 2 probably got to your paper after they had already reviewed seven others. Onto their third cup of coffee they skimmed the abstract, started reading the intro, decided they did not like the subject, and found some technical details to nit-pick in the methodology so they could reject it.

It's a good exercise to count how many reviews you've done, compared to how many you've received. If you are a net receiver of reviews it can be difficult to understand the mentality behind someone like Reviewer 2 who rejects papers without giving them a fair evaluation. By the time you've become a net giver of reviews you can start to see how a person might become like that...

So here is my advice to graduate students who have unwittingly become an enemy of Reviewer 2: become the thing you hate the most. Become Reviewer 2. Just for a little while. Write your paper as if you have zero patience, as if a missing citations drive you into an unfathomable rage. Write with the bitterness of a grapefruit, and the cynicism of a cyanide pill. Write your paper for the person who is angry simply at the fact that you would dare to try and write a paper. Dot every "i", cross every "t", and give Reviewer 2 nothing to say in their review but a caveman grunt of displeasure.

How does everyone else know so much?

Four years is a long time. The Manhattan project took four years. And I bet at the beginning of that they didn't know much about how to make an atomic bomb. But you know what would have made working on the Manhattan project completely intolerable? To be surrounded by people from four years in the future who already know everything about atomic bombs, and who speak very fast using lots of words and terminology you don't understand, and who can't give you much help because they are too busy on their own projects.

Does that sound familiar? Doing a PhD means constantly being surrounded by other very smart people who may be years ahead of you in their studies. Just try to remember that no one feels like they know anything when they first start on a PhD. No one understands all the words and terminology and what everyone else is saying - everyone just nods along until they get asked a direct question by their supervisor at which point they panic.

But equally, everyone picks it all up given time - and before you know it you will be the fast talking, terminology spewing person you hated a couple of years ago. Give it time, stay motivated, work hard, you're going to do great.