A Data Scientist’s Guide to Fetching COVID-19 Data in 2022

Sometimes the best way to get something done is to do it yourself.

A Data Scientist’s Guide to Fetching COVID-19 Data in 2022
A custom Grafana dashboard for displaying counties’ data. (image by author)

During the early days of the pandemic, it was a popular exercise for data scientists and analysts to build COVID-tracking dashboards. A number of public APIs were spun up, but one by one these services are being taken offline. As of December 2021, the New York Times GitHub repository remains as one of the last high-quality COVID data sources, but even this source is not comprehensive and can be awkward to consume into your projects (it’s distributed through CSV files).

Despite aggregators ceasing to publish curated data, states’ health departments continue to diligently report counties’ COVID statistics. For example, The Weather Channel credits each state’s health department as its data source when displaying present COVID information.

Weather.com pulls its data directly from the states.
Weather.com pulls its data directly from the states. (image by author)

Rather than relying on middlemen like the New York Times, you can pretty easily write your own scripts to fetch directly from the states. In this article, I’ll show you how to quickly ingest counties’ COVID data.


Motivation

A few weeks ago, a client approached me for advice on how to automatically track COVID-19 data for specific locations. For two years, he had been manually searching and entering data into a large spreadsheet, and this process took considerable time away from his actual job.

I promised I would look into our options, and very quickly I realized most of the popular APIs had already been shut down. I began with scraping from the New York Times, though several of the counties we needed were not present in the data set. I then considered scraping The Weather Channel, but then it dawned on me:

Why can’t I just fetch data directly from the states? After all, it’s what aggregators like NYT are doing anyway.

So after a few hours of research, I wrote some plugins and threw together a Grafana dashboard (pictured above) to visualize my findings. My client was definitely pleased, so I decided to write up my findings in this article.

Fetching the Data

To get data into Pandas, let’s use the built-in SQLite database sql:local to store our data. You can always use your own database, e.g. if you want to build your own Grafana dashboards.

I’ve published my extractor scripts as the covid Meerschaum plugin ― a Python module format for extracting data.

The plugin lets you ingest COVID data by county. (image by author)

I provide usage information on the plugin’s GitHub repository, and below I go into more depth about how I decided to fetch the data. To get started, first install Meerschaum from PyPI and install the covid plugin.$ pip install -U --user meerschaum
$ mrsm install plugin covid

Register a pipe with the connector plugin:covid and on the instance sql:local, and when prompted, enter a list of FIPS codes corresponding to your counties.$ mrsm register pipe -c plugin:covid -m cases -i sql:local❓ Please enter a list of FIPS codes separated by commas: 08031,48113,45043,45045,45007,37107,37021,47157,47147

Now that we’ve set up our pipes, let’s get that data flowing. Run the sync pipes command to fetch data ― the first time may take a minute or so.mrsm sync pipes -i sql:local

Syncing nine counties’ data from Colorado, Texas, and the New York Times. (image by author)

That’s it! You can sync continuously with --loop and --min-seconds, and the -d or --daemon flag runs the command as a background job.mrsm sync pipes -i sql:local --loop --min-seconds 3600 -d

Tap into the Pipe

Here’s a snippet to demonstrate ways you can access your pipe’s data:

How It Works

This is the part of the article where I pull back the curtain and give out my secrets. The covid plugin itself only executes other plugins and combines their results; for example, the US-covid plugin fetches from the New York Times:

Fetching data from specific states varies in difficulty ― for example, California provides a CSV, so parsing is similar to the NYT:

But other states are not so straightforward. The TX-covid plugin downloads and parses an XLSX spreadsheet that contains comments and adds a new column for each day. Parsing this esoteric format was a huge headache, which demonstrates the reason why Meerschaum’s plugin system is so powerful.

Raw Texas COVID data in a spreadsheet
Texas distributes COVID data in a strangely formatted spreadsheet. (image by author)

Like California, Georgia distributes data in CSVs, but you first need to extract a ZIP file to get the data sets you’re looking for. Here’s my solution from the GA-covid plugin:

Finally, Colorado exposes a RESTful API, which feels much less jank than the other states (although the API does not seem to be entirely functional, so it is not without its share of jankiness). Below is an excerpt from the CO-covid plugin:

Conclusion

Meerschaum is already very modular, and the covid plugin works in a similar way. If you would like to contribute, you can write and publish a plugin for your state and open a PR in the covid plugin repository.

You may have noticed my use of duckdb when writing plugins. I mostly think in terms of SQL, and DuckDB beautifully bridges the gap between SQL and Pandas. To use third-party libraries in Meerschaum plugins, add the dependencies to a global required list (similar to requirements.txt), and requirements which begin with plugin: are treated as Meerschaum plugins.

I hope this article has given some insight into how to fetch COVID-19 data. There’s no telling how much longer states will maintain their COVID reporting, but going into 2022, these plugins can help power your projects, just like they did for my client.