My Data Science Horror Story

Vincent Vanhoucke on 2018-11-05

Lessons I learned from a big text-to-speech model flub

Credit: RapidEye/E+/Getty

There is an apocryphal story in text-to-speech circles about a researcher spending months or years refining their speech-generation model, making the speech samples sound better and better, day after day, only to discover that they’d accidentally been listening to the same audio file the entire time and had merely been attuning themselves to it. That story still sends shivers down my spine when it is retold.

Picture this other horror story: You’re an intern, and you’re asked to build a “yes” versus “no” speech classifier. You have audio files: yes1.wav, no1.wav, yes2.wav, no2.wav, yes3.wav, and so on. You build your classifier and obtain great results. The moment you are about to present your work, you discover that the only thing your model is actually doing is reading the words “yes” or “no” in the filenames of your audio files to determine the answer, and not listening to the audio samples at all. So you cower in shame, cry a lot, and find the nearest exit.

Well. About that.

This is my (true) version of that story, and it shaped my career as a data scientist.

It was my first job as a researcher. I had a well-defined task, lots of data, and a good accuracy metric to evaluate the model against. I had a strong baseline, which was actually deployed in production with a customer. I was improving on the metric through what I thought was sheer cleverness and brilliance. It wasn’t perfect yet but was getting better every day. I could see a solid academic paper shaping up in my mind, and life was good.

This was industrial research, so before writing that paper, I had one final test to pass: evaluate it on real customer data, with the goal of deploying the improvement to production in short order. And on customer data, my model achieved exactly zero percent accuracy.

I was improving on the metric through what I thought was sheer cleverness and brilliance.

Maybe I had a bug, maybe the customer data was bad—I thought, “Oh well.” I had to move along because I had a paper to write. But I also couldn’t let it go, and I started digging. What I found was absolutely every data scientist’s worst nightmare: Zero percent accuracy was the correct figure. All my other accuracy figures were “phantom” numbers. I was in complete disbelief: These numbers looked so plausible. They were better than the baseline but not perfect. They were improving experiment after experiment like any well-behaved metric should. How could they be the incorrect figures and not the flat zero percent?

People often say that catastrophes happen not when something goes wrong, but when two things happen to go wrong at the same time because we’re generally pretty good at imagining and correcting single points of failure. To understand the sheer improbable sequence of things that had to happen for those plausible accuracy figures to come up, I had to go into the details of the task.

The goal was to improve the grammar data structure used to recognize people’s names. If your name is “Robert Moore,” a speech-recognition system will likely compile your name into a small phonetic graph that could roughly look like the regular expression “/(ˈɹɑb.əɹt|ˈbob|ˈɹɑb) mʊɹ/”—allowing for “Rob” or “Bob” as a nickname. My task was to come up with a better graph. My data was stored as records in a key->value database:

record { key (string): “robert_moore” value (Grammar): /(ˈɹɑb.əɹt|ˈbob|ˈɹɑb)mʊɹ/ }

I had a bug: Some of the phonetic symbols I used in my grammar data structure were not understood by the pronunciation engine. The system tried to compile the grammar data structure into a graph object, which was meant to represent the regular expression in graph form, and failed. Deep into the code, someone had tried to be “robust” to those failures: After all, you never want to crash in production if you can avoid it, right? The code looked like this:

Graph* graph = compile(record->value); if (!graph) { // Failed to compile. graph = compile(record->key); // Huh?!? }

That was utterly unexpected: Why would anyone think that if a database record is corrupted, the key by which this record was retrieved contained the actual payload? And how does that even work? “Value” was a serialized record of type grammar, “key” was a plain string. Digging further—oh look, more “robustness”:

Grammar* grammar = parse(record); if (!grammar) { // Failed to parse. grammar = parse(pronounce(record)); // Huh?!? }

If the data wasn’t of the expected type, we would take whatever was contained in that record and try to pronounce it as if it were a word. Why not? We’re desperate, right? Incidentally, pronunciation generation is an expensive operation. Imagine what this means for that system if it were suddenly fed long strings of garbage for whatever reason (including hostile attempts at denial of service). Instead of “failing fast,” the system would be immediately overloaded.

You may already see where this is going. In my experimental data, the key of my records was the actual name of the user, such as “robert_moore,” which the pronunciation engine was happy to interpret as an approximation (not a perfect one, some names were mangled) of “/ˈɹɑb.əɹt mʊɹ/.” And there you go: I had plausible, imperfect data, derived directly from the ground truth I was scoring against.

Most mistakes will thankfully make results look worse, but sometimes they do make them just good enough to look plausible.

This was basically the conceptual equivalent of looking up the file name to determine if “yes” or “no” was said in the scenario I mentioned earlier. The unexpected thing was that random experimentation with the pronunciation model did seem to improve it. However, that had to do with what fraction of the data failed to compile for any given experiment: The more my model failed, the more errors were generated, the more of the ground truth was used, and the better my metrics got. The production data that scored zero percent accuracy? The keys in the database were random hashes like “h4a7n6ks2l”—go ahead, clever code, try to pronounce that!

I got lucky. When I fixed the symbol-lookup problem, my improvements actually turned out real. The new system was better. Weeks of experiments were completely invalidated, and I had to explain to my colleagues how close I somehow got to launching something that would have possibly corrupted every customer record and why our offline scoring got better the more I corrupted them. To their credit, they merely laughed.

Now for the lessons learned.

1. First and foremost, trust no one and trust nothing.

The world is out to get you, especially in data science. Most mistakes will thankfully make results look worse, but sometimes they do make them just good enough to look plausible. In fact, that’s a common curse of the whole field of language modeling. Computing and comparing perplexities across experiments is fraught with pitfalls, and tiny mistakes tend to actually improve the experimental figures, not make them worse. People working in that field tend to have very high standards of proof as a result and for that reason embraced open-source evaluation tools before the practice was widely adopted elsewhere.

2. Trust yourself even less.

Throughout my academic career, I have quickly learned to obsessively question every result I obtain, which was not something that came naturally to me. I now constantly seek ways to get independent confirmation of results, preferably using a different codebase.

3. Write dumb, defensive code.

Don’t be clever. Your code should be as paranoid as you and fail quickly and loudly at the smallest whiff of a contract not being respected. Every programmer at least once has had the experience of reading through a stack trace and landing in a block of code that’s annotated with “/* this should never happen */.” Write out any large amount of data to disk, and even bit flips do happen. I had a production system crash once due to XML parsing errors. Here is what its (programmatically generated) config file looked like on disk:

<item/> <item/> … 1 million lines like that … <item/> <itel/> <item/> <item/> … a million more … <item/>

Spot the difference? I can’t wait for that next CME event to make us all better programmers. (Friends don’t let friends store XML. Use protocol buffers.)

4. Don’t trust your code, but trust your own data processing even less.

There are many other ways in which data goes wrong, and even if 1 percent of your (large) dataset is wrong, you may be making completely incorrect A/B comparisons. For example, some images in popular web datasets are not readable by some image parsers. If you use a different parser or if you’ve included those images in your denominator counts, your figures will be different from everyone else’s. For a long time, I double-counted some test images in my own evaluations, and it gave me plausible-looking figures that were all wrong.

5. Try to actively break your experiments.

Randomize your labels, and make sure you get chance accuracy. Train on one percent of your data, and make sure you can overfit to that. What’s better: Give your model to someone, and have them use it. Every lab has that one person who will break perfectly valid code the minute they touch it. Find that person.

This healthy skepticism of one’s results is possibly the biggest trait difference that I see between people who have a doctorate and those who don’t and learn it on the job. We’ve all been burned. In retrospect, I was very lucky to get spooked early in my career in an appropriately embarrassing way that would ensure I’d always be on my toes. There are, unfortunately, no such thing as happy little accidents in data science.