-
Notifications
You must be signed in to change notification settings - Fork 54
Alert Response
SimpleReport is actively monitored by Azure's Application Insights. In the event that abnormal application behavior is detected, an alert will automatically be sent to on-call engineering personnel for resolution.
Are you on-call? Lucky you! Here are some common alerts, and how to respond to them.
Don't forget the on-call engineer also needs to address any support requests during their shift. Here's more information on how to get setup and address support requests.
Are you getting a ton of alerts and not sure what to do? See our escalation policy.
Don't forget to put up the maintenance banner if needed.
- 10+ DB queries with durations over 1.25s in the past 5 minutes
- GraphQL query validation failures
- Prod alert when an ExperianAuthException is seen
- Prod HTTP Server 5xx Errors >= 10
- QueueBatchedReportStreamUploader failed to successfully complete
- QueueBatchedReportStreamUploader is not triggering on schedule
- Twilio Alert
- Prod deploy health alert
- CDC Redirect Alert
What Went Wrong?
A number of known issues can cause slow DB query responses. If there isn't an issue opened for the slow DB query please feel free to open a new issue.
What Should You Do?
These alerts usually resolve themselves, however, it is recommended to check the following:
- check the logs (follow the second link in the pager duty alert that starts with
https://portal.azure.com
to explore the logs) to see if this is continuing - check application insights (e.g. for prod
prime-simple-report-prod-insights
) and click the "Overview" tab to explore the graphs for server response time, etc...
What Went Wrong?
This most often occurs in a "SimpleReport - Non-Prod" environment as a result of a SimpleReport engineer testing on one of the lower environments
What Should You Do?
If this is occurring on a lower environment, it does not need the urgency of other production-related pages. However, please make sure to
- check the logs (follow the second link in the pager duty alert that starts with
https://portal.azure.com
to explore the logs) for anything noteworthy - check-in with the engineer who is testing on the lower environment to confirm they are in fact testing a change/feature/etc
Affected Component: SimpleReport backend LiveExperianService
What Went Wrong? We use Experian to verify users' identity during user signup. Before we submit a request to Experian, we must first fetch an activation token from Experian using our credentials. If we see this alert, it means there was a problem fetching the token and the identity verification steps couldn't be completed for the user. More context on how we use Experian to verify identity can be found here.
What Should You Do?
- View the alert in the Azure portal. Query the exceptions table for ExperianAuthExceptions in the time period of the alert to get the stack trace, which will include the response from Experian.
- Possible Experian API responses when fetching a token are documented here.
- Most exceptions have historically been because of intermittent 500 responses from Experian which are not actionable and resolve themselves.
- Query requests in Azure to see if this is a one-off and we've since had successful requests to
/identity-verification/get-questions
or/identity-verification/submit-answers
endpoints or if all requests are failing. - This alert can be triggered if Experian doesn't recognize our credentials, which has happened in the past when they expired the application password without notifying us. If this appears to be the cause, first verify that we haven't made any changes to our credentials or the
LiveExperianService
code. If not, the resolution is to contact Experian for help.
WIP What Went Wrong?
There are several reasons 500 errors can be thrown.
What Should You Do?
It is recommended to check the following:
- check the logs (follow the second link in the pager duty alert that starts with
https://portal.azure.com
to explore the logs) to see if this is continuing - check application insights (e.g. for prod
prime-simple-report-prod-insights
)- click the "Overview" tab to explore the graphs for failed requests, etc...
- click the "Failures" tab to explore the 500 failures and it may be helpful to get the call stack of the errors
Affected Component: rs-batch-publisher-prod
function app
What Went Wrong?
The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader
. The function was successfully triggered, but failed to either pull messages from the queue, or properly perform an upload.
What Should You Do?
- Check the function history. You can see at a glance what the most recent set of runs looks like.
- For the failed run, take note of the Operation Id. You can cross-reference this value in Application Insights to get a better picture of what caused the failure.
- If this alert is being fired off repeatedly within a short timeframe, reach out to the ReportStream team. We will need to confirm whether the issue is on the SimpleReport side, or whether it originates from ReportStream.
Follow up
- If this failure caused a message to be added to either error queues
fhir-data-publishing-error
ortest-event-publishing-error
you will need to move these messages from the error queue to the appropriate queues for reprocessing. (e.g. Messages in thetest-event-publishing-error
will need to be moved to thetest-event-publishing-queue
.) Please refer to how to do this here. [LINK TO BE ADDED]
Affected Component: rs-batch-publisher-prod
function app
What Went Wrong?
The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader
. If this alert fires, chances are high that the code for the function is missing or corrupt.
What Should You Do?
- Check the function history. You can see at a glance what the most recent set of runs looks like. Runs should take place every two minutes; a gap of longer than this confirms that the fired alert is valid.
- Take a look at the Code + Test pane. Ensure that the files present here match what currently exists in the codebase.
- If there are discrepancies between what files should be present, and what files are present, re-deploy the functions using the corresponding GitHub Action.
Affected Component: We set up a post-prod deploy health check workflow that fires up a Selenium browser to ping a frontend page at /app/health/deploy-smoke-test
. That page returns a success / failure status based on the status of /api/actuator/health/backend-and-db-smoke-test
to verify the front and backend can talk to each other after a deploy.
What Should You Do?
- Verify that the health pages load with the UP / success statuses. If they don't, check to see that the deploy didn't break communication between the front and backends. This will probably be caused by a change to the Terraform config in a recent commit or some sort of manual Azure change.
Affected Component: We have healthchecks setup in Azure for a number of URLs corresponding to SimpleReport, one of them is simplereport.cdc.gov (which redirects to simplereport.gov). The infrastructure for this endpoint/redirect sits outside of SimpleReport and is not something we can directly affect.
What Should You Do?
- Most of these alerts are intermittent and will be automatically resolved by PagerDuty/Azure.
- If the alert does not autoresolve within a matter of minutes (most resolve within 5-10 minutes) escalate via email to the CDC team that owns this infrastructure (TODO: get an email for escalation)
Impact: Twilio message sends
Issues: Twilio tracks errors created when trying to send messages. Twilio may be experiencing a high error rate related to but not limited to sending messages to landlines, unreachable carriers, HTTP errors, unknown handsets, or spam filtering messages sent by SimpleReport.
Actions to take:
- Check the Twilio error logs to see what the problems are. It is possible to filter the results and narrow or expand the displayed time frame.
- Check Twilio status page for outtage.
- Check individual errors by clicking into an error, then clicking the
RESOURCE SID
link to get more information. Navigating to this view will allow us to see the number we tried to send to, the body of the message, and a complete historical record of that message within Twilio. - Possible corrective actions:
- Update user records within SimpleReport.
- Submit a Twilio support ticket. Twilio suggests that we do this if we have an example of three or more filtered messages that we believe we legitimate sends.
- Getting Started
- [Setup] Docker and docker compose development
- [Setup] IntelliJ run configurations
- [Setup] Running DB outside of Docker (optional)
- [Setup] Running nginx locally (optional)
- [Setup] Running outside of docker
- Accessing and testing weird parts of the app on local dev
- Accessing patient experience in local dev
- API Testing with Insomnia
- Cypress
- How to run e2e locally for development
- E2E tests
- Database maintenance
- MailHog
- Running tests
- SendGrid
- Setting up okta
- Sonar
- Storybook and Chromatic
- Twilio
- User roles
- Wiremock
- CSV Uploader
- Log local DB queries
- Code review and PR conventions
- SimpleReport Style Guide
- How to Review and Test Pull Requests for Dependabot
- How to Review and Test Pull Requests with Terraform Changes
- SimpleReport Deployment Process
- Adding a Developer
- Removing a developer
- Non-deterministic test tracker
- Alert Response - When You Know What is Wrong
- What to Do When You Have No Idea What is Wrong
- Main Branch Status
- Maintenance Mode
- Swapping Slots
- Monitoring
- Container Debugging
- Debugging the ReportStream Uploader
- Renew Azure Service Principal Credentials
- Releasing Changelog Locks
- Muting Alerts
- Architectural Decision Records
- Backend Stack Overview
- Frontend Overview
- Cloud Architecture
- Cloud Environments
- Database ERD
- External IDs
- GraphQL Flow
- Hibernate Lazy fetching and nested models
- Identity Verification (Experian)
- Spring Profile Management
- SR Result bulk uploader device validation logic
- Test Metadata and how we store it
- TestOrder vs TestEvent
- ReportStream Integration
- Feature Flag Setup
- FHIR Resources
- FHIR Conversions
- Okta E2E Integration
- Deploy Application Action
- Slack notifications for support escalations
- Creating a New Environment Within a Resource Group
- How to Add and Use Environment Variables in Azure
- Web Application Firewall (WAF) Troubleshooting and Maintenance
- How to Review and Test Pull Requests with Terraform Changes