Taking notes while debugging and The Fifteen Minute Rule™

I recently had an issue at work that seemed simple but took me two days to solve. The problem was a before-and-after discrepancy in a Looker dashboard whose historical values had changed after doing some migrations on our data warehouse.

Our Looker instance works off of a waterfall of derived tables, and this Looker dashboard was part of this. The particular look that was causing problems pulled data from a table called revenue_retention_and_contraction, which in turn pulled from several antecedent tables. I took the logic for revenue_retention_and_contraction‘s pre-migration and post-migration versions, and changed table references in the post-migration version to pre-migration table references until I got the two to align. This would tell me what the problematic antecedent table was, and then I could move on to that table and repeat this process.

This sounds simple, and it likely should have been, but I completely failed to keep track of my steps as I worked. It’s tremendously tempting to want to just jump in and start hacking away when a problem with real business consequences arises, and sometimes this is the proper way to go – problems that can be resolved in fifteen minutes or so, for instance, hardly need notes.

Past the fifteen-minute-barrier, though, one’s connection to reality starts to become tenuous. I couldn’t remember what I had tried and hadn’t tried and I couldn’t keep my dozen BigQuery tabs straight. I felt crushed after my first day of troubleshooting.

The next morning, I opened Notes on my Mac and laid out the situation. I wrote out each piece of knowledge I had developed, even pasting in the hyperlinks to the queries I ran to establish my knowledge. It was simple, but it improved my focus on the one thing that ended up mattering: two missing columns in an obscure (and gigantic) table called netsuite_transactions__base_with_prepay_spread. I found the row count identical in the pre- and post-migration tables, but using a Jupyter notebook to pull samples from each table through the BigQuery API, I noticed that the pre-migration version had two more columns. These columns, it turned out, produced a small fan-out that affected our aggregates and thus were the cause of our problems.

From now on, I’m operating on a Fifteen Minute Rule: if I can’t fix an issue in fifteen minutes, I’ll open up a fresh Note and jot down what I have learned so far. Even if it is a problem that I might be able to solve in twenty minutes, reiterating to myself what I have learned is invaluable for understanding the context of a problem and knowing for sure that I have quashed our bug.