Develop infrastructure for MVP version of PUDL integration #57

zschira · 2024-08-01T16:48:27Z

Overview

We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.

Components

Archival

I developed an archiver in the pudl-archiver repo, but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.

As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.

Tasks:

Decide how to handle archive updates
- Do we need a staging environment to avoid creating a bad archive environment?
- How do we handle versioning? Do we want a clear lineage of change to raw archives?
Implement backend changes based on update strategy
Decide if we want the SEC archiver to live in pudl-archiver repo or attempt to transition to a library
Set archiver to run at least once/year

Extraction infrastructure

We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.

Tasks:

Configure repo to access cloud resources as necessary #58
Should we use dagster? If yes we'll need to do the following:
- Turn cloud interface into dagster resource
- Create asset for loading raw filings
- Create asset for trained model that can be loaded from a cache
- Create assets for basic 10k/ex 21 extraction
Decide how frequently all elements need to be refreshed and implement automation
- How often should model be retrained
- How often whould we rerun

Sub-issues

Give feedback

zschira added this to Catalyst Megaproject Aug 1, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Aug 1, 2024

jdangerx added the epic label Aug 8, 2024

jdangerx moved this from Backlog to In progress in Catalyst Megaproject Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop infrastructure for MVP version of PUDL integration #57

Develop infrastructure for MVP version of PUDL integration #57

zschira commented Aug 1, 2024 •

edited by jdangerx

Loading

Sub-issues

Develop infrastructure for MVP version of PUDL integration #57

Develop infrastructure for MVP version of PUDL integration #57

Comments

zschira commented Aug 1, 2024 • edited by jdangerx Loading

Overview

Components

Archival

Tasks:

Extraction infrastructure

Tasks:

Sub-issues

zschira commented Aug 1, 2024 •

edited by jdangerx

Loading