-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refactor import script deduplication logic
Previously this was attempting to deduplicate admin2 records that were reported twice (sometimes with slight variations on the county name). However, there are also duplicate records that are province_state level, for example on the 22nd March the report contains "District of Columbia" twice: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-22-2020.csv This commit reuses the new matching code to decide what is a duplicate and what isn't. It deduplicates cases like this where the numbers are sometimes the same, sometimes zeroed. It also refactors the code so that we keep track of all these data quality issues as we go and summarise them after each import.
- Loading branch information
Showing
3 changed files
with
184 additions
and
77 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters