Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Hypercane from MongoDB to PostgreSQL for storage and caching #65

Open
shawnmjones opened this issue Feb 3, 2022 · 0 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@shawnmjones
Copy link
Member

shawnmjones commented Feb 3, 2022

Hypercane uses MongoDB for caching memento content, headers, and derived data. It also uses PostgreSQL as part of its Web User Interface (WUI). Rather than having to install/maintain multiple databases for different purposes, we want to move Hypercane to PostgreSQL for the following reasons.

  1. MongoDB does not install "easily" for some users. I installed it on macOS with homebrew but reinstalled it after issues. On Ubuntu/RHEL, the admin needs to add a third-party yum/apt repository install it. Almost every distro includes PostgreSQL.
  2. I've had issues saving MongoDB data and restoring its data across versions and systems. Sometimes the BSON is corrupted. I'm sure I was supposed to do something on one end or another, but it seems like SQL databases have an easier time with this.
  3. As Hypercane has matured, I've saved more derived data in the database. The ability to query this with SQL is becoming more and more attractive over time. Such standardization may provide third-party tools with another interface for easy analysis.
  4. A point in favor of MongoDB is that I can shove any data we want into a record and not worry about creating standardized fields. We could achieve something similar with planned foreign keys and relations in SQL at the expense of planning time and schema changes. The truth is that function calls in the code have to correspond to database actions; hence, we will write some queries either way. Moving to PostgreSQL will require that we change the schema for each derived value that we want to store.
  5. For space reasons, a user may want to clear out the memento content and keep the derived data. With MongoDB, we have to save the parts we like, get rid of the whole record, and create a new record. With SQL and a decently designed schema, we can delete the records from the table storing the content.
  6. Another point in favor of MongoDB is its ability to expire records, which we do not currently use, but should. PostgreSQL does not natively support this as far as I can tell, but I can achieve something similar with triggers.
  7. MongoDB has a BSON size limit of 16MB unless I switch to GridFS. PostgreSQL has a maximum field size of 1GB. Currently, Hypercane discards anything over 16MB, which means that some images and other binary files are skipped rather than processed.
  8. Some claim that MongoDB is faster than PostgreSQL, but some studies show that PostgreSQL has caught up. (We need to add these links.) Performance depends on indexing, table structure, and the queries used in the test. We can likely get comparable performance with good database creation scripts.
  9. Some web archiving folks had suggested instead storing the data as WARC/WAT/etc. and maintaining a CDX. This was a good suggestion when we were only caching content, but it does not work as well for querying the derived data. If we store derived data in the CDX, it becomes a table.🙂
  10. The choice of MongoDB came from needing to handle concurrent writes. Writing a single WARC for each memento creates many files and addresses concurrency. Creating a CDX afterward must be timed well. Alternatives, like SQLite, don't handle concurrent writes well either. Database engines, like PostgreSQL or MongoDB, manage this with their own caching, checkpointing, and optimization.
  11. Thanks to the pilot, we have a better idea of the type of data we should store in the database, meaning that we have a better data model moving forward.

With all of this in mind, I will be using this issue to document ER diagrams and other insights as I experiment with this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant