You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.
Components
Archival
I developed an archiver in the pudl-archiver repo, but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.
As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.
Tasks:
Decide how to handle archive updates
Do we need a staging environment to avoid creating a bad archive environment?
How do we handle versioning? Do we want a clear lineage of change to raw archives?
Implement backend changes based on update strategy
Decide if we want the SEC archiver to live in pudl-archiver repo or attempt to transition to a library
Set archiver to run at least once/year
Extraction infrastructure
We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.
Overview
We've decided that for the MVP version of SEC data integration to PUDL, that we will keep this codebase separate, and extracted SEC data will feed into PUDL as a raw input. To make sure that this code remains maintainable and well tested, and the data is updated as available, we will require a certain level of infrastructure/automation development.
Components
Archival
I developed an archiver in the pudl-archiver repo, but it uses a fairly experimental GCS backend for storage that functions differently from our zenodo backed archivers. This worked for initially populating the cloud bucket with filings, but may need to be changed to enable regular updates. For example, zenodo provides versioning, staging, and a testing environment, which all help to make the archival process safe and reproducible. On GCS we will probably need to come up with our own strategy for these features.
As an aside, we somewhat regularly recreate basic archiving infrastructure in various client projects and smaller "side" projects like this. If we made the archiver a library that could be used as a dependency in any repo, we could transition to using shared tooling rather than always reinventing the wheel. This is not necessary for mozilla, as the SEC data will become a PUDL input, so it would make some sense to keep the archiver in the repo like all existing ones (with a separate backend), but if we have time, this project could be a good place to add this functionality and demonstrate it's use.
Tasks:
Extraction infrastructure
We've been mostly doing rapid prototyping in this repo and developing tooling along the way. We should start deciding what we want the final design/infrastructure of the SEC extraction to look like and start working towards that. For example, we may want to transition to using dagster in this codebase to maintain consistent tooling and design patterns with the rest of our work. We also need to decide how frequently various components of the extraction need to be run, and how automated that process needs to be.
Tasks:
Sub-issues
The text was updated successfully, but these errors were encountered: