“The world is full of obvious things which nobody by any chance ever observes.”
— Sir Arthur Conan Doyle, The Hound of the Baskervilles*
It was your average day at the InfluxData offices. The usual sounds filled the air: the clack of a ping pong ball, the ever-present drip of an overworked coffee pot. A day just like any other — or so I thought.
It all started when a new Chronograf ticket came in, where a user had reported their graphs behaving strangely.
I give you ‘Exhibit A’, a screenshot submitted by our affected user:
Though this graph is quite the visual cacophony, the issue was clear immediately: Chronograf was plotting this user’s data out of order. For a product like Chronograf, this is about as critical as a bug can be. Because we’re in the business of time series data, we’d better be sure we’re respecting the laws of physics and that time continues to function as it has since the beginning of, well, time.
At this point, Chronograf had seen several releases over the better part of a year. How had we not seen behavior like this before?
The user also provided us with the sample dataset that resulted in the above graph, hereby dubbed ‘Exhibit B’. The dataset itself was large so I won’t reproduce it in full, but here is a small sample we’ll use for investigative purposes:
Note: InfluxDB supports multiple time formats, though in this case we’re using Unix time. Not to spoil too much, but this will be highly relevant soon!
At first glance, the data seems reasonable. The values seem OK, and the timestamps are ordered chronologically. No cause for alarm, right?
“Wait, you said `sort` does WHAT??'”
Now I had exhibit A, a graph which explained how the bug affected the end-user. And I had exhibit B, a sample dataset I used to accurately reproduce the issue.
The next step was to ask myself a few questions which I’ve found helpful when beginning the debugging process:
1) What sections of the code does this bug touch? Who is the most likely offender?
I immediately thought of a piece of code that converts raw InfluxDB responses, which use JSON formatting, into data structures that our graphing library understands. As far as I could tell, the sample dataset was ordered correctly. It had to be something farther down the line.
2) Are there any obvious anomalies?
As far as plotting data chronologically, Chronograf seemed to function properly in all other known cases. Given the graph we saw earlier, it’s also the kind of bug you feel like you’d tend to notice.
Here is where it starts to get good: the test data was noteworthy because the first point starts on day one of what Unix considers to be the dawn of the universe. Unix timestamps represent the amount of time (usually in seconds or milliseconds) that has passed since 00:00:00 UTC, January 1, 1970.
For example, the ‘Unix time’ as of me finishing this sentence is `1458851975430`. On its own, maybe this isn’t a particularly exciting or interesting detail. It’s test data, who cares if we’re plotting points from 1970 or 2016? We do, as it turns out!
I took a hard look at the data after it had gone through the conversion process, immediately before it made its way into our graphing library. I confirmed it was out of order. Using the same small subset of data from the last section, here’s a view of the test data as it appeared to me:
Wait, what? Why on earth is `1000000000` (9 zeroes) appearing before `200000000` (8 zeroes)? This new insight was enough for me locate the perpetrator, a line that looked something like this:
`// allTimestamps is a list of unique timestamps,
// e.g. [100000000, 200000000, 300000000 …]
I remembered why this line was initially added: if for any reason the data came in to the converter out of order, or merging multiple series was required, we wanted to ensure that our data was chronological before trying to plot it on a graph.
And suddenly it all made sense. Let’s try something…
Does the result surprise you at all? I imagine it does, as this is the output you should have seen:
`[2, 10, 5].sort()
// => [10, 2, 5]`
The fix itself was simple. Instead of using `sort` without any arguments, we provided what Mozilla calls a `compareFunction`
If you ran the previous example in your browser’s console, give this one a try:
`[2, 10, 5].sort((a, b) => a – b)`
OK, now we now how `sort` works. We know what to fix, but we still haven’t fully answered our original question: Why did we have to wait so long for a bug like this to make an appearance? The answer lies in the number of significant digits for each timestamp. With the provided test data, we had multiple rollovers where the timestamps not only grew in numerical size, they grew enough to require another significant digit to represent them.
We can illustrate this by using the test data, and it only requires a few points:
Try sorting these timestamps without a `compareFunction` and see the results:
`[800000000, 900000000, 1000000000].sort()
// => [1000000000, 800000000, 900000000]`
One last experiment for the console, where the results should be identical:
`// Using Three realistic InfluxDB timestamps:
[1459444040000, 1459444050000, 1459444060000].sort()
[1459444040000, 1459444050000, 1459444060000].sort((a, b) => a – b)`
Chronograf was dealing with timestamps like this in virtually all cases, explaining why it took a unique dataset for us to track down this particular bug.
Voila! Another case in the books.
“History is not a burden on the memory but an illumination of the soul.”
— Lord Acton
As a developer, this experience was fantastic. It was a fun, unique, mildly infuriating bug with no shortage of teaching potential. Here were my big takeaways:
- Critical bugs lurk beneath the surface, and often manifest themselves under the strangest of circumstances.
For us to finally see `sort` behave in a way we didn’t expect, we needed a dataset that used a specific time format and spanned over two very specific years, 1970-1972. I’ve found that software never ceases to amaze with the sheer multitude of ways in which it can break. In this regard I have a begrudging respect for it’s raw and unbridled creativity.
What’s next? Get Started With Chronograf!
Which now sorts numerical timestamps with aplomb!
If you’d like to try Chronograf, the ‘C’ in the TICK stack, you can download the latest version here. We love reading and appreciate any and all feedback, so let us know what you think. Whether it’s bug reports, feature requests, general experience improvements, or any other thoughts or ideas you have about the product, don’t hesitate to email the team: [email protected]