This repo contains several utilities for wrangling COVID-19 data from the John Hopkins University COVID-19 repository.
- A working instance of TimescaleDB in PostgreSQL v10+
- csvkit
- git
- Unix/Linux operating system with bash
- for OpenRefine time-series automation
- OpenRefine - installed automatically
- openrefine-batch - included
- a Geocode.earth API key - install or free trial. Used to enrich geographic data.
- Docker with docker-compose in case you want to run it in containers (optional)
A note on cloning this repo, since the COVID19 directory is a git submodule:
- after cloning, you must initiate the submodule. In the top level directory for the project, run
git submodule init
andgit submodule update
to clone the JHU Repo as a submodule
The files in this directory and how they're used:
covid-19_ingest.sh
: script that converts the JHU COVID-19 daily-report data to a time-series database using TimescaleDB.covid-refine
: OpenRefine automation script that converts JHU COVID-19 time-series data into a normalized, enriched format and uploads it to TimescaleDB.schema.sql
: Data definition (DDL) to create the necessary tables & hypertables.environment
: Default environment values used in Docker containers.
- Create a TimescaleDB instance - download or signup
- Create a database named
covid_19
, and an application usercovid_19_user
psql
create database covid_19;
create user covid_19_user WITH PASSWORD 'your-password-here';
alter database covid_19 OWNER TO covid_19_user;
\quit
-
Run
schema.sql
as thecovid_19_user
. VACUUM/ANALYZE require owner privspsql -U covid_19_user -h <the.server.hostname> -f schema.sql covid_19
-
Install csvkit
- Ubuntu:
sudo apt-get install csvkit
- MacOS: Using homebrew run
brew install csvkit
- Ubuntu:
-
Using a text editor, replace the environment variables for
PGHOST
,PGUSER
andPGPASSWORD
incovid-19_ingest.sh
-
Run the script
bash covid-19_ingest.sh
-
(OPTIONAL) add shell script to crontab to run daily
-
Be able to slice-and-dice the data using the full power of PostgreSQL along with Timescale's time-series capabilities!
See the detailed README.
- Remember initiate the submodule, run
git submodule init
- Run
docker-compose build
- Run
docker-compose up
- That's all. You can go to Swagger or PostgREST
- the JHU COVID-19 repository is a git submodule. This was done to automate getting the latest data from their repo.
- the script will only work in *nix environment (Linux, Unix, MacOS)
- both scripts maintain a hidden directory called
~/.covid-19
in your home directory. -covid-19_ingest.sh
checkslastcsvprocessed
. Delete that file to process all daily-report files from the beginning, or change the date in the file to start processing files AFTER the entered date.
- use postgREST to add a REST API in front of TimescaleDB database
- create a Grafana dashboard
- create a Carto visualization
- create a Superset visualization
- thanks to Avtar Sewrathan (@avthars), Prashant Sridharan (@CoolAssPuppy) and Mike Freedman (@michaelfreedman) at Timescale for their help & support to implement this project from idea to implementation in 5 days!
- thanks to Julian Simioni (@orangejulius) at Geocode.earth for allowing us to use the Geocode.earth API!
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.