Skip to content

Migration

Adam Wead edited this page May 5, 2021 · 67 revisions

Migration from Scholarsphere, version 3, to version 4.

Overview

Borrowing on our previous experience with migrating to Fedora 4, and other migrations, the basic strategy is:

  • perform an inventory of works and collections in Scholarsphere 3
  • migrate data by pushing from the Scholarsphere 3 application into Scholarsphere 4, most likely via an API
  • check the inventory against the content in Scholarsphere 4 to verify all content has been transferred

Testing

Local testing can be done by running instances of both SS v.3 and SS v.4 locally and creating sample records in one to be migrated over to the other. Once the initial steps are complete, and migration functions in a local context, we can move into the QA environment and use duplicated content from production. The procedures developed in the QA environment will then be replicated in Scholarsphere production, with version 4 resources that will ultimately become the version 4 production instance.

Phase 1: Before November 16th

Update Scholarsphere Client Config

Login to the production jobs server and update the scholarsphere-client.yml file.

vim /scholarsphere/config_prod_new/scholarsphere/scholarsphere-client.yml

Verify the current default settings:

SHRINE_CACHE_PREFIX:   "cache"
S3_ENDPOINT:           "https://s3.amazonaws.com"
SS_CLIENT_SSL:         "false"
SS4_ENDPOINT:          "https://scholarsphere.k8s.libraries.psu.edu/api/v1"

From another terminal session, login to the current production version 4 deployment

KUBECONFIG=~/.kube/config-oidc-prod && export KUBECONFIG
kubens scholarsphere
kubectl exec -it deployment/scholarsphere -- /app/bin/vaultshell

From the pod's shell, start a Rails console session

bundle exec rails c

Print out the AWS environment variables that correspond to the configuration settings in the client yaml file:

ENV['AWS_REGION']
ENV['AWS_SECRET_ACCESS_KEY']
ENV['AWS_ACCESS_KEY_ID']
ENV['AWS_BUCKET']

Verify that the SS_CLIENT_KEY in the yaml file matches the token in the version 4 application. If they do not match, update the application's token with the one from the yaml file. Do not change the client key in the yaml file.

To update the application's token, from the Rails console of the version 4 pod:

token = [paste client key from yaml file]
ExternalApp.find_by(name: 'Scholarsphere 4 Migration').api_tokens[0].update(token: token)

Write out any changes to scholarsphere-client.yml. If the file has changed, you will need to restart Resque, if not, you can proceed to deployment.

To restart the Resque process on the jobs server, first verify there are no currently running jobs in any queue https://scholarsphere.psu.edu/admin/queues/overview

Exit out of any current Rails console sessions on the server, and get root access to restart the process:

sudo su -
systemctl restart resque

Verify the processes have successfully restarted:

ps -aux | grep resque

You should see multiple processes, one for each worker configured in every queue.

If needed, tag and deploy the latest code from the main branch of the Scholarsphere 4 repo by following Production Deployment. Use the SHA of the commit or use a 4.0.0.betaN where N is any number greater than zero.

Run any tests, including test migrations, by following the steps below under Migration.

Phase 2: On November 16th

Deploy 4.0.0

During normal business hours, tag and deploy version 4.0.0 to production according to Production Deployment.

Perform some minimal tests, including migrating a few sample records to ensure everything is operating correctly.

Now wait until 7 pm that evening

Prepare Version 4

Open a new terminal window or tmux session and obtain a list of production pods:

KUBECONFIG=~/.kube/config-oidc-prod && export KUBECONFIG
kubens scholarsphere
kubectl get pods

Login to the Rails pod:

kubectl exec -it deployment/scholarsphere -- /app/bin/vaultshell

Remove any database data from any previous tests. This will completely delete all data from the database

DISABLE_DATABASE_ENVIRONMENT_CHECK=1 bundle exec rake db:schema:load

Re-seed the database with the required groups

bundle exec rake db:seed

Start a new Rails console session:

bundle exec rails c

Verify the groups were created

Group.all

From the console, remove any Solr index data:

Blacklight.default_index.connection.delete_by_query('*:*')
Blacklight.default_index.connection.commit

Verify there are no records

Blacklight.default_index.connection.get 'select', :params => {:q => '*:*'}

Exit out both the Rails console and pod shell, returning to your local shell, and restart the application by deleting both the running Rails pods:

kubectl delete po -l app.kubernetes.io/name=scholarsphere

Verify that the pods have been restarted:

kubectl get pods

You should see two new pods in the list with a recent value for AGE.

Log back into the newly restarted Rails pod:

kubectl exec -it deployment/scholarsphere -- /app/bin/vaultshell

Create a new Rails console session:

bundle exec rails c

Leave the Rails session open.

Remove Data from S3

  1. Login to Penn State's AWS portal: http://login.aws.psu.edu/
  2. Go to Storage > S3
  3. Select the edu.psu.libraries.scholarsphere.prod bucket with the radio button
  4. Choose "Empty"

Make sure the empty operation completes before adding data into the bucket. Click into the bucket, make sure it's empty. click "show versions" and make sure versions are gone, too

Prepare Version 3

Verify there are no currently running jobs in any queue https://scholarsphere.psu.edu/admin/queues/overview

Put Scholarsphere 3 into "read-only" mode: https://github.com/psu-stewardship/scholarsphere/wiki/Read-Only-Mode

Login to the psu-access terminal, and being a new tmux session to login to the Scholarpshere 3 production jobs server.

Check the swap status. If there isn't the full amount, reset it:

sudo su -
swapoff -a
swapon -a

Start a new Rails console session on the jobs server:

sudo su - deploy
cd scholarsphere/current
bundle exec rails c production

Verify Scholarsphere client configration:

ENV['S3_ENDPOINT']
ENV['AWS_REGION']
ENV['SS4_ENDPOINT']
ENV['AWS_SECRET_ACCESS_KEY']
ENV['AWS_ACCESS_KEY_ID']
ENV['AWS_BUCKET']

Take note of the client key:

ENV['SS_CLIENT_KEY']

Return to the Rails console for the version 4 instance

Create an external application record using the token you copied from the version 3 console:

token = [paste token here]

ExternalApp.find_or_create_by(name: 'Scholarsphere 4 Migration') do |app|                                                             
  app.api_tokens.build(token: token)
  app.contact_email = '[email protected]'                                                                                    
end

Return to the Rails console for the version 3 instance, and update the list of works and collections to be migrated. Note: File sets are also included in this list even though their files are migrated with the works. They are added as records to the database so that we can verify their checksums at a later date.

ActiveFedora::SolrService.query('{!terms f=has_model_ssim}GenericWork,Collection,FileSet', rows: 1_000_000, fl: ['id', 'has_model_ssim']).map do |hit|
  Scholarsphere::Migration::Resource.find_or_create_by(pid: hit.id, model: hit.model)
end

Clear out the results from any previous runs:

Scholarsphere::Migration::Resource.update_all(client_status: nil, client_message: nil, exception: nil, started_at: nil, completed_at: nil)

Queue up the jobs, with works first, then collections:

Scholarsphere::Migration::Resource.where(model: 'GenericWork').map do |resource|
  Scholarsphere::Migration::Job.perform_later(resource)
end

Scholarsphere::Migration::Resource.where(model: 'Collection').map do |resource|
  Scholarsphere::Migration::Job.perform_later(resource)
end

Open a web page and visit the version 4 url, but don't login. You should begin to see works being migrated.

Verify the migration jobs are running on the version 3 instance https://scholarsphere.psu.edu/admin/queues/overview

Stop. Wait until the maintenance window the following day

Estimated time to complete the migration is 5 hours.

Phase 3: November 17th Maintenance Window

Verify Migration

Verify the migration completed by checking the resque queue on the version 3 production jobs server. There should be no currently running jobs. Take note of any failures. We'll reprocess these later.

Login to the productions job server for version 3 and start a new rails console session. Alternatively, you can repoen the tmux session from last night.

Get a listing of the status responses from the version 4 client:

Scholarsphere::Migration::Resource.select(:client_status).distinct.map(&:client_status)

Look at the messages for any failed migrations, or where the client status is nil, or anything other than 200, 201, or 303.

Scholarsphere::Migration::Resource.where(client_status: nil).where.not(model: 'FileSet').map(&:exception)
Scholarsphere::Migration::Resource.where(client_status: 500).where.not(model: 'FileSet').map(&:message).uniq
Scholarsphere::Migration::Resource.where(client_status: 422).where.not(model: 'FileSet').map(&:message)

Nil status are usually RSolr errors, and these can simply be rerun:

Scholarsphere::Migration::Resource.where(client_status: 500).where.not(model: 'FileSet').map do |resource|
  Scholarsphere::Migration::Job.perform_later(resource)
end

Any remaining nil statuses should be Ldp::Gone errors and can be ignored.

Re-run the 500 errors as well, but this will not fix all of them:

Scholarsphere::Migration::Resource.where(client_status: 500).where.not(model: 'FileSet').map do |resource|
  Scholarsphere::Migration::Job.perform_later(resource)
end

The remaining 500 errors should be ActionController::UrlGenerationError and will need to be fixed in post-migration. As of our most recent test migration, there are 16 of these.

422 errors are usually all collections that don't have all the works they need. First, try rerunning the jobs:

Scholarsphere::Migration::Resource.where(client_status: 422).where.not(model: 'FileSet').map do |resource|
  Scholarsphere::Migration::Job.perform_later(resource)
end

These should all be in the 500 errors above. You can generate a report for these:

Scholarsphere::Migration::Resource.where(client_status: 422).where.not(model: 'FileSet').map do |resource|
  { model: resource.model, pid: resource.pid, message: resource.message }
end

File.write('422-report.json', _.to_json)

Create Feature Works

Using the existing featured works in version 3, create a list of featured works from the version 4 console:

LegacyIdentifier.where(old_id: ['j3t945s668', '41n79h518r', '6dj52w505v']).map do |id|
  FeaturedResource.create(resource_uuid: id.resource.uuid, resource: id.resource)
end

Re-index Works

bundle exec rake solr:reindex_works

Make 4.0.0 Live

Update DNS

scholarsphere.psu.edu IN CNAME ingress-prod.vmhost.psu.edu

Change DEFAULT_URL_HOST to the public-facing DNS name scholarsphere.psu.edu

vault kv patch secret/app/scholarsphere/prod DEFAULT_URL_HOST=scholarsphere.psu.edu

Perform a rolling restart to have the pods pickup the new config

kubectl rollout restart deployment/scholarsphere
kubectl rollout status deployment/scholarsphere

Verify https://scholarsphere.psu.edu is the new version 4.0.0 instance. Note: This process can take up to 5 minutes

Configure Legacy ScholarSphere

On all three production VMS (web1, web2, jobs1), change service_instance to be scholarsphere-3.libraries.psu.edu

vim /opt/heracles/deploy/scholarsphere/current/config/application.yml

Restart apache on the web VMs

sudo su -
systemctl restart httpd

Restart Resque on jobs

systemctl restart resque

Clear Cron on Version 3 Hosts

Using Capistrano, we can clear out the crontabs on our version 3 hosts.

bundle exec cap prod whenever:clear_crontab

Make Announcements

Send out appropriate emails and messages. TODO: More to come here.

Phase 4: Post-Release

At this point, 4.0.0 is officially released and is available to the public.

Updating Work Types

See https://github.com/psu-stewardship/scholarsphere-4/issues/670

Migrate Statistics

From the v.3 source application's console:

Scholarsphere::Migration::Statistics.call

This will take several minutes. Afterwards, copy the file to your local account and gzip it:

cp statistics.csv /tmp
exit
cp /tmp/statistics.csv .
gzip statistics.csv

You should now have a file named statistics.csv.gz in your local account. Copy that to your laptop via scp, then copy to up to one of the scholarsphere pods:

kubectl get pods
kubectl cp ~/Downloads/statistics.csv.gz scholarsphere-xxxxxxxx-yyyyy:/app

Log into the pod:

kubectl exec -it scholarsphere-xxxxxxxx-yyyyy /bin/bash

Run the rake taks to import the statistics:

source /vault/secrets/config
gunzip statistics.csv.gz 
bundle exec rake migration:statistics[statistics.csv]

This will enqueue several thousand jobs to update each statistic. You can follow along from the Sidekiq queue

Verify File Checksums

Migrated files were verified using the etags that were calculated by Amazon's S3 service upon upload. These were either md5 checksums, or multipart checksums. The details for the verification can be found here