AWS Lambda to query Dovetail CloudFront usage and insert into BigQuery
-
Requests to the Dovetail CDN are logged to an S3 bucket.
-
This lambda BigQuery-queries for the
MAX(day) FROM dt_bytes
, and processes days >= the result (or all the way back to the S3 expiration date). -
Then we Athena-query for a day of logs, grouping by path and summing bytes sent.
-
Paths are parsed and grouped as
/<podcast>/<episode>/...
or/<podcast>/<feed>/episode/...
. Unrecognized paths that use a bunch of bandwidth are warning-logged. -
Resulting bytes usage is inserted back into BigQuery:
{day: "2024-04-23", feeder_podcast: 123, feeder_episode: "abcd-efgh", feeder_feed: null, bytes: 123456789}
Local development is dependency free! Just:
yarn install
yarn test
yarn lint
However, if you actually want to hit Athena/BigQuery, you'll need to cp env-example .env
and fill in several dependencies:
ATHENA_DB
the athena database you're usingATHENA_TABLE
the athena table that has been configured to query to the Dovetail CDN S3 logs- NOTE: you must have your AWS credentials setup and configured locally to reach/query Athena
BQ_DATASET
the BigQuery dataset to load thedt_bytes
table in. You should usedevelopment
or something locally (notstaging
orproduction
)
Then run yarn start
and you're off!
This function's code is deployed as part of the usual
PRX CI/CD process.
The lambda zip is built via yarn build
, uploaded to S3, and deployed into the wild.
While that's all straightforward, there are some gotchas setting up access:
- AWS permissions are (Athena, S3, Glue, etc) are documented in the Cloudformation Stack for this app.
- Google is configured via the
BQ_CLIENT_CONFIG
ENV and Federated Access - In addition to the steps documented in (2), the Service Account you create must have the following permissions:
BigQuery Job User
in your BigQuery project- Any role on the BigQuery dataset that provides
bigquery.tables.create
, so the table load jobs can execute. We have a custom role to provide this minimal access, but any role with that create permission will work. BigQuery Data Editor
only on thedt_bytes
table in the dataset for this environment (click the table name in BigQuery UI -> Share -> Manage Permissions)