Citymaps — “Vietnam”: Tracking Down An Elusive Bug

Bob Matsuoka posted November 22, 2016

In our experience a serious bug — particularly one pushed in production code — tends to get found and solved quickly, and is in fact relatively straightforward to reproduce and isolate.  We encountered one recently, however, that did not fit this model, and it ended up causing us a few weeks of frustration as well as some work probably could have been avoided.

Background

About a month ago, we pushed a fairly major update to our (citymaps.com) web client (Node/Express/React/Babel).  We were pretty excited about this release — 3.33 — as it included support for Accelerated Mobile Pages (AMP).  We’d received a big Google crawl boost recently by including JSON+LD review tags in our pages, and so we were interested in seeing how Googlebot would respond to these new pages.

The Push

Our typical process is to push in the mornings (if possible) weekdays through Thursdays (Friday for emergency hotfixes only!).  We pushed the build on Thursday using our typical canary model, where the client gets pushed out to 1 of 3 production servers after it’s been QA’d on staging.  For big changes, we leave the canary configuration running for a days or so before rolling it out to the other servers.

Everything seemed good the rest of Thursday and Friday, so we pushed the build to the rest of the servers towards the end of the day on Friday.

“Vietnam”

Saturday evening, shortly after 8pm, Bob started receiving notifications from our monitors that our web response times were starting to slow, and shortly after that web client services were no longer effectively responding.  Datadog told us that the CPUs on those servers were basically pegged:

bug CPU spike

The bug manifests itself

Because this affected all of our web client servers (pages cached by our CDN continued to function), it was all hands on deck.  Several of us worked on troubleshooting the service online, but weren’t able to find anything in the logs.  The behavior of this bug was odd — the service had run without problems for several days, and then despite no noticeable external changes, had manifested itself on all three servers.  The servers themselves seemed fine – other services (API, search, etc) weren’t affected.

Pavel was first able to trigger the problem by hitting the Phú Phong city page, so we called this the “Vietnam” bug.

After an hour or so of fruitless troubleshooting and restarting, we decided to roll back to the previous version (3.31).  Volia!  The services came back and were stable.  We knocked off to regroup on Monday morning.

Regroup

This bug was very elusive.  Ben spent a number hours unsuccessfully trying to reproduce it on a dev server.  We eventually gave up trying to reproduce it, and decided to re-launch it in canary mode in order see if it would re-occur.  Sometimes bugs go away on their own right?

We ran the rest of the week with version 3.33 running on one of the servers.  Nothing.  Then Saturday night — boom!  So the bug hadn’t fixed itself.  We spent another week trying to track it down in code and reproduce it.  No luck.

The following weekend, we decided to just leave it running in Canary even after the problem occurred.  Sure enough, Saturday evening at 8pm it started again.  Then surprisingly, after 24 hours, the problem went away “on it’s own”:

bug canary mode

Bug Canary Mode

A Clue To The Bug

Someone observed that 8pm is the GMT date cutover, and the bug lasted 24 hours until the following day (Monday GMT).  Eureka!  But that’s where the problem stood for a few weeks.  We tried a few other variations, including running via docker with a more recent OS, and a newer version of Node, but nothing helped.  Each weekend we saw the same thing:

weekly CPU spikes

The CPU spikes were persistent and elusive

After a few weeks of seeing this pattern, and lots of ‘interesting’ theories as to it’s cause, we finally had enough of Pavel’s time to set up a server with flexible system time, and set it to 00:00 GMT.  The problem quickly manifested itself.  Reproducible!  And a short time after that, Ben had found the issue.

As it turns out, we display hours of operation on a business detail page:

venue detail

Citymaps Venue Detail Page

And the issue turned out to be a simple off-by-one error in our JS code that’s executed by Node/React to generate that part of the page.   This code only triggers on Sunday:



Basically, the first business detail page requested after Sunday GMT would lock up the server running that service in an infinite loop.  And once it became Monday, the loop “fixed” itself.  Problem solved.

Lessons Learned

Finding and fixing this bug definitely took longer than it should have, but it did give us a couple of valuable lessons we’ll put to practice:

  • Canary rollouts are good! We’ve never regretted extending the canary period longer, and letting it run over the weekend would have saved several of us from spending hours fixing our production site late on a Saturday night.
  • We should have “walked like a duck”.  Rather than spending time trying to bypass bug by upgrading OS and Node versions, we should have immediately gone to reproducing the conditions under which the bug manifested itself, namely the GMT week cutover.  If we didn’t have the resources to do this, we should have just waited until we did rather than try to solve the problem with simpler fixes.

Thankfully, this issue happened at a time of low dev activity on our own website, so it didn’t create any major problems.  And the lessons we learned will hopefully prevent us from being stuck in a similar quagmire in the future.

PostScript

The fix for this bug ran in Canary mode over the weekend, and no more CPU spike!  We gave it another day just to be sure,  and now the fix is live on our site.

Bob Matsuoka
Ben Gundersen
Pavel Shub

One response to “Citymaps — “Vietnam”: Tracking Down An Elusive Bug”

  1. Nevertrump says:

    Fabulous write up and summary of the situation. Thanks for sharing the lessons learned

Leave a Reply

Your email address will not be published. Required fields are marked *