-
Notifications
You must be signed in to change notification settings - Fork 46
2021 09 18 (Saturday) Deployment
This deployment includes new features from Batch 8 and re-attempts a previous deployment that included a few bug fixes.
That previous deployment failed because one segment had become overloaded with records and the lambda that migrated those records timed out at 15 minutes. We wrote a script to move those old and unused records to partitions based on the month they occurred. This will let us use them in the future for pagination and avoid a hot partition. Performing this action also had the unexpected bonus of our glue jobs from Production to Staging account taking only 2 hours and 16 minutes, down from the most recent attempt 7 hours and 54 minutes.
One thing we observed while deploying this in the staging environments is that the first deploy will fail. A zip file is required to be available, but the deployment builds and prepares that file. Then a script can be run in order to make that file available to the deployment. It's somewhat tedious because this deployment involves a migration, you also need to re-delete the resources (DynmamoDB and Elasticsearch) that Terraform created.
A new feature of this deployment are readonly smoketests. In order for this to function, we needed to specify the USTC_ADMIN_USER
in addition to the USTC_ADMIN_PASS
in the CircleCI environment variables.
Until we can accomplish a more seamless user experience for deployments, we will be performing deployments on weekends or after midnight on weekdays.
- https://github.com/flexion/ef-cms/issues/8872
- https://github.com/flexion/ef-cms/issues/8588
- https://github.com/flexion/ef-cms/issues/8828
- https://github.com/flexion/ef-cms/issues/8920
- https://github.com/flexion/ef-cms/issues/8892
- https://github.com/flexion/ef-cms/issues/7095
- https://github.com/flexion/ef-cms/issues/8763
- https://github.com/flexion/ef-cms/issues/8824
- https://github.com/flexion/ef-cms/issues/8868
- https://github.com/flexion/ef-cms/issues/8509
- https://github.com/flexion/ef-cms/issues/8508
-
22:00 - deleted efcms-prod-beta (full of previous migration attempt) table east & west
-
22:02 - deleted efcms-search-prod-beta cluster
-
22:03 - made the Pull Request
-
22:05 - ran account-specific
NOTE: Using 8 nodes in prod for the Kibana cluster until we make snapshots and are comfortable running without all of the logs to search upon.
-
22:18 - added
USTC_ADMIN_USER
andUSTC_ADMIN_PASS
environment variables to CircleCI -
22:46 - ran
./docker-to-ecr.sh latest
-
22:47 - account specific finished! ✅
-
22:47 - run env specific
-
22:49 - need to add feature flag for maintenance mode
$ ./scripts/set-maintenance-mode.sh false prod
-
22:51 - merged PR ✅; CircleCI build (This is expected to fail in deploy step)
-
23:28 - deploy failed
│ Error: failed getting S3 Bucket (*********************.efcms.****.us-east-1.lambdas) Object (maintenance_notify_blue.js.zip): NotFound: Not Found │ status code: 404, request id: H5JCKNCXRYRPWE4T, host id: XtGQYbjtV15imEL4gr2IxB1LKnFwyN6+01S+yIulSkuUmp2Dxan3zTVeIQri7rr/Gf5nIQPuthM= │ │ with module.ef-cms_apis.data.aws_s3_bucket_object.maintenance_notify_blue_east_object, │ on ../template/main-east.tf line 182, in data "aws_s3_bucket_object" "maintenance_notify_blue_east_object": │ 182: data "aws_s3_bucket_object" "maintenance_notify_blue_east_object" { │ ╵ ╷ │ Error: failed getting S3 Bucket (*********************.efcms.****.us-west-1.lambdas) Object (maintenance_notify_blue.js.zip): NotFound: Not Found │ status code: 404, request id: T7X3ZB5AHFNNVFHC, host id: pqlwSMjvpbpeTO4f9RO4lVLpQ711FjGd5cj1LVJPAvqe53jmVtTU7xLhZXzUt4gCAcoh6D53/AE= │ │ with module.ef-cms_apis.data.aws_s3_bucket_object.maintenance_notify_blue_west_object, │ on ../template/main-west.tf line 137, in data "aws_s3_bucket_object" "maintenance_notify_blue_west_object": │ 137: data "aws_s3_bucket_object" "maintenance_notify_blue_west_object" { │ ```
-
23:28 - need to run maintenance script
$ ./setup-s3-maintenance-file.sh prod copy: s3://dawson.ustaxcourt.gov.efcms.prod.us-east-1.lambdas/maintenance_notify_green.js.zip to s3://dawson.ustaxcourt.gov.efcms.prod.us-east-1.lambdas/maintenance_notify_blue.js.zip copy: s3://dawson.ustaxcourt.gov.efcms.prod.us-west-1.lambdas/maintenance_notify_green.js.zip to s3://dawson.ustaxcourt.gov.efcms.prod.us-west-1.lambdas/maintenance_notify_blue.js.zip
-
23:29 - delete DynamoDB East & West tables
-
23:36 - retry from failed
-
23:59 - confirm that the deploy table was updated correctly, and the migration will be
alpha
=>beta
-
00:10 - migration started
-
01:12 - migration finished! ✅
-
01:15 - re-indexing begins 📈
-
01:45 - updated admin pass and CircleCI env var
-
01:50 - re-ran from failed for readonly smoketests
NOTE: Observed that the
USTC_ADMIN_USER
was enabled after failure. Need to gracefully handle a failure and disable the admin account. -
1:55 - adjusted admin pass again
-
1:57 - re-ran smoketests
-
2:03 - maintenance:disengage
-
1:57 - re-ran smoketests
-
2:11 - they pass!
-
2:11 - observed the
USTC_ADMIN_USER
is disabled, as is thetestAdmissionsClerk
account. -
2:15 - checked the health of the migration; indexing rate is 80k;
┌─────────┬───────────────────────┬────────────┬───────────┬──────────┐ │ (index) │ indexName │ countAlpha │ countBeta │ diff │ ├─────────┼───────────────────────┼────────────┼───────────┼──────────┤ │ 0 │ 'efcms-case' │ 1999766 │ 532582 │ 1467184 │ │ 1 │ 'efcms-case-deadline' │ 17580 │ 527 │ 17053 │ │ 2 │ 'efcms-docket-entry' │ 18344494 │ 5653610 │ 12690884 │ │ 3 │ 'efcms-message' │ 343346 │ 93030 │ 250316 │ │ 4 │ 'efcms-user' │ 2019112 │ 341324 │ 1677788 │ │ 5 │ 'efcms-user-case' │ 1995026 │ 53493 │ 1941533 │ │ 6 │ 'efcms-work-item' │ 891996 │ 174096 │ 717900 │ └─────────┴───────────────────────┴────────────┴───────────┴──────────┘ Total Difference: 18762658 (6848662/25611320) 26.74%
-
03:00 - re-indexing is progressing
┌─────────┬───────────────────────┬────────────┬───────────┬─────────┐ │ (index) │ indexName │ countAlpha │ countBeta │ diff │ ├─────────┼───────────────────────┼────────────┼───────────┼─────────┤ │ 0 │ 'efcms-case' │ 1999766 │ 997323 │ 1002443 │ │ 1 │ 'efcms-case-deadline' │ 17580 │ 3985 │ 13595 │ │ 2 │ 'efcms-docket-entry' │ 18344494 │ 9907253 │ 8437241 │ │ 3 │ 'efcms-message' │ 343346 │ 123550 │ 219796 │ │ 4 │ 'efcms-user' │ 2019114 │ 1012755 │ 1006359 │ │ 5 │ 'efcms-user-case' │ 1995026 │ 429675 │ 1565351 │ │ 6 │ 'efcms-work-item' │ 891996 │ 339182 │ 552814 │ └─────────┴───────────────────────┴────────────┴───────────┴─────────┘ Total Difference: 12797599 (12813723/25611322) 50.03%
-
04:45 - re-indexing is continuing
┌─────────┬───────────────────────┬────────────┬───────────┬────────┐ │ (index) │ indexName │ countAlpha │ countBeta │ diff │ ├─────────┼───────────────────────┼────────────┼───────────┼────────┤ │ 0 │ 'efcms-case' │ 1999766 │ 1841790 │ 157976 │ │ 1 │ 'efcms-case-deadline' │ 17580 │ 8170 │ 9410 │ │ 2 │ 'efcms-docket-entry' │ 18344494 │ 17687738 │ 656756 │ │ 3 │ 'efcms-message' │ 343346 │ 259407 │ 83939 │ │ 4 │ 'efcms-user' │ 2019114 │ 1490314 │ 528800 │ │ 5 │ 'efcms-user-case' │ 1995026 │ 1682098 │ 312928 │ │ 6 │ 'efcms-work-item' │ 891996 │ 679024 │ 212972 │ └─────────┴───────────────────────┴────────────┴───────────┴────────┘ Total Difference: 1962781 (23648541/25611322) 92.33%
5:11 - Indexing is complete, but there appears to have been errors, which is why it took so long. There must be something up with the 6 missing user cases. I'm calling the deployment for now, and will investigate tomorrow.
┌─────────┬───────────────────────┬────────────┬───────────┬──────┐
│ (index) │ indexName │ countAlpha │ countBeta │ diff │
├─────────┼───────────────────────┼────────────┼───────────┼──────┤
│ 0 │ 'efcms-case' │ 1999766 │ 1999766 │ 0 │
│ 1 │ 'efcms-case-deadline' │ 17580 │ 17580 │ 0 │
│ 2 │ 'efcms-docket-entry' │ 18344494 │ 18344494 │ 0 │
│ 3 │ 'efcms-message' │ 343346 │ 343346 │ 0 │
│ 4 │ 'efcms-user' │ 2019114 │ 2019114 │ 0 │
│ 5 │ 'efcms-user-case' │ 1995026 │ 1995020 │ 6 │
│ 6 │ 'efcms-work-item' │ 891996 │ 891996 │ 0 │
└─────────┴───────────────────────┴────────────┴───────────┴──────┘
We experienced a few hiccups in tonight's deployment, and further investigation is required. Going to provide the outline of items here.
Six records were missing in the efcms-user-case
cluster. So, I created an Elasticsearch script to identify them by querying that index on both clusters, and finding the records that did not exist on the destination cluster. These were the six records:
user|6e9acd85-ea2d-40a3-9c66-d0c982eafdcf case|7275-19
user|cc90d791-6224-4bab-bf70-1c43327807a0 case|12103-19
user|cc90d791-6224-4bab-bf70-1c43327807a0 case|12106-19
user|f6ff6e98-4d8f-4695-8ccc-36a6afefb460 case|16612-21
user|be5a7e6f-c734-4e59-9890-02e341ed3e4d case|16960-21
user|5c7824f5-0120-4df3-924d-fd2de169190b case|15511-21
Took a look in DynamoDB, and I found records for the first three in the source table, but could not find records for the last three. 🤔 All of the user records are IRS Practitioners.
So, I looked more closely at the application logs, and I was able to identify that these IRS Practitioners had recently been removed from these cases. And it would seem that the operation did not properly remove the mapping records from Elasticsearch after removing the records from DynamoDB:
- Sep 10, 2021 @ 15:50:09.843 - /case-parties/16612-21/counsel/f6ff6e98-4d8f-4695-8ccc-36a6afefb460 DELETE
- Sep 17, 2021 @ 15:29:58.983 - /case-parties/16960-21/counsel/be5a7e6f-c734-4e59-9890-02e341ed3e4d DELETE
- Sep 10, 2021 @ 15:09:36.463 - /case-parties/15511-21/counsel/5c7824f5-0120-4df3-924d-fd2de169190b DELETE
And these three, where the record exists in the source table and not the destination table, were removed Sunday or Monday. This explains why they were missing when running the query on Monday afternoon.
- Sep 20, 2021 @ 10:37:05.425 - /case-parties/7275-19/counsel/6e9acd85-ea2d-40a3-9c66-d0c982eafdcf DELETE
- Sep 20, 2021 @ 07:14:04.308 - /case-parties/12103-19/counsel/cc90d791-6224-4bab-bf70-1c43327807a0 DELETE
- Sep 19, 2021 @ 15:08:34.391 - /case-parties/12106-19/counsel/cc90d791-6224-4bab-bf70-1c43327807a0 DELETE
So, all of the records did get migrated -- eventually! And investigation on whether or not we are properly and reliably removing records from ES for User Case is in order.
Re-indexing appeared to really slow down at around 3am. Compared with previous deployments that involved a blue-green migration, this one took about twice as long.
Usual Deployment:
Saturday's deployment:
- Try to replicate removing an IRS Practitioner from a case and see that the record doesn't get removed from Elasticsearch
- Work with AWS Support to identify issues during re-indexing.
- Add story to make sure that the
USTC_ADMIN_USER
gets disabled if the readonly smoketest script fails.