Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop infrastructure for MVP version of PUDL integration #57

Open
1 task done
zschira opened this issue Aug 1, 2024 · 0 comments
Open
1 task done

Develop infrastructure for MVP version of PUDL integration #57

zschira opened this issue Aug 1, 2024 · 0 comments
Labels

Comments

@zschira
Copy link
Member

zschira commented Aug 1, 2024

Overview

We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.

Components

Archival

I developed an archiver in the pudl-archiver repo, but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.

As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.

Tasks:

  • Decide how to handle archive updates
    • Do we need a staging environment to avoid creating a bad archive environment?
    • How do we handle versioning? Do we want a clear lineage of change to raw archives?
  • Implement backend changes based on update strategy
  • Decide if we want the SEC archiver to live in pudl-archiver repo or attempt to transition to a library
  • Set archiver to run at least once/year

Extraction infrastructure

We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.

Tasks:

  • Configure repo to access cloud resources as necessary #58
  • Should we use dagster? If yes we'll need to do the following:
    • Turn cloud interface into dagster resource
    • Create asset for loading raw filings
    • Create asset for trained model that can be loaded from a cache
    • Create assets for basic 10k/ex 21 extraction
  • Decide how frequently all elements need to be refreshed and implement automation
    • How often should model be retrained
    • How often whould we rerun

Sub-issues

Preview Give feedback
  1. zschira
@jdangerx jdangerx added the epic label Aug 8, 2024
@jdangerx jdangerx moved this from Backlog to In progress in Catalyst Megaproject Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

No branches or pull requests

2 participants