How does a data-driven investigation come together? I spoke about the process at the virtual Government & Public Sector R Conference in Dec. 2021. I titled my talk “data or it didn’t happen” because, as you might suspect, without the data necessary to back-up your investigative findings, you don’t have any findings. But where do you look to get started? I look for what I call an “imprint of reality” left in data by the phenomenon that that I’m researching. My mindset is that if it’s real and it exists then there must be a way to measure it, similar to how physicists looking to confirm the existence of theoretical particles devise clever experiments to detect their presence.
I talked about some examples of how I’ve made use of this mindset over the years. One was my 2016 investigation of dodgy stock loans devised by banks to help large institutional investors avoid paying dividend withholding taxes. The idea behind the trades was to get shares of German companies off the books of investors subject to such taxes and to park them temporarily with dividend tax-exempt investors when the companies paid dividends, then reverse the trades afterwards and split the 15% tax saved on the dividend payment. I had documents which detailed how the loans were structured, but to show that it was a widespread phenomenon, I needed data. When I found it, the pattern of stock loans surging around dividend payment time was so unmistakable that we called the chart “Tax Avoidance Has a Heartbeat.” (It’s still one of my favorite data visuals that I’ve ever had a chance to work on).
Another, more recent example was my 2021 examination of unemployment insurance fraud. Fake unemployment insurance claims proliferated during the pandemic after Congress authorized a temporary boost to jobless aid with lax identify verification requirements. There had already been warnings from law enforcement and the Department of Labor’s Office of Inspector General about a surge in fake claims, along with news stories about victims of such fraud. I wanted to see if this phenomenon had left an imprint in the unemployment insurance claim data reported by states. During my talk, I walked through my R analysis of the data, which I used to spot anomalies that might suggest a surge in fake claims. I found that in five states, the initial jobless claims outnumbered the entire pool of civilian workers, which clearly didn’t make sense. And in state after state, the volume of initial jobless claims far exceeded the number of estimated job losses.
Those are just two examples; there are countless other situations where having a data-driven mindset can help you quantify your findings and back-up your reporting. And R makes the process much easier, which is why I carve out a significant chunk of my data journalism classes at Hong Kong University to teach students how to embrace R and make it an essential tool in their toolkit.
It was an honor to speak at the R Gov conference alongside so many other talented data scientists. Thanks Jared Lander for inviting me, and for writing what I still think is one of the best R books out there, R for Everyone.