The data pipeline project was created for data analytics team at data services in VT university libraries(VTUL). The project was designed for Ellie Kohler, the head of Library Data Analytics and Assessment team to analyze library data. The project collects data on a weekly basis from libinsight to an aws s3 bucket analytics-datapipeline. The gathered data is collected in csv format. The data is then queried by aws athena and uploaded to Tableau for analysis purposes. The script that collects libinsight data: libInsightData_ec2inst is a lambda function. It is triggered on a weekly basis. In the libinsight athena database, a table is created mapping the original libinsight data file. The athena query then performs a query on the the original libinsight data file and stores the results in a different s3 bucket lib-insight-serialized-data... This athena query is coded into a lambda function . The trigger goes off for this lambda function everytime the original libinsight s3 data file gets updated which is on a weekly basis.
Tableau account is associated with the user data-analytics-team. The IAM policy on this user provides tableau account holder(Ellie Kohler) access to athena queries(read and write), access to the original s3 bucket analytics-datapipeline (read -only access) and access to the athena query results s3 bucket lib-insight-serialized-data.. (read-write-access). The athena query results are uploaded to Tableau. The query results are also automated on a weekly basis based on the athena query updates
The script(lambda function) is broken down into the following parts:
- Get query results from libinsight using libinsight api with parameters: libinsight ID, data range and libinsight token
- Append all the pages of the libinsight query results together as one dictionary. Libinsight api returns query results that are limited to one page at a time.
- Transform the query response parameters to fit the needs of the data analytics team.
- Serialize the data to s3 bucket and upload the record as a csv file
- Create athena query and store query results in the s3 bucket
- Upload and automate athena query results to tableau for data analysis
- Add triggers to automate the upload on a weekly basis
The lambda function to start the ec2 instance is StartLibInsightEC2Instance. The lambda function to stop the ec2 instance is StopLibInsightEC2Instance. Triggers are also added to these lambda functions.
See the wiki for documentation. For more detailed documentation see readme-notes.md
This project is hosted in the Data Services AWS account. AWS components are identified by the following tags:
- Unit : AnalyticsAssessment
- Owner : DataServices
- Stack : Test
- User : elliek
- Application : DataPipeline