Skip to content

Automation of data collection and cleaning to facilitate analysis and reporting of library interactions.

Notifications You must be signed in to change notification settings

VTUL/Data-Pipeline

Repository files navigation

Data Pipeline Project

The data pipeline project was created for data analytics team at data services in VT university libraries(VTUL). The project was designed for Ellie Kohler, the head of Library Data Analytics and Assessment team to analyze library data. The project collects data on a weekly basis from libinsight to an aws s3 bucket analytics-datapipeline. The gathered data is collected in csv format. The data is then queried by aws athena and uploaded to Tableau for analysis purposes. The script that collects libinsight data: libInsightData_ec2inst is a lambda function. It is triggered on a weekly basis. In the libinsight athena database, a table is created mapping the original libinsight data file. The athena query then performs a query on the the original libinsight data file and stores the results in a different s3 bucket lib-insight-serialized-data... This athena query is coded into a lambda function . The trigger goes off for this lambda function everytime the original libinsight s3 data file gets updated which is on a weekly basis.

Tableau account is associated with the user data-analytics-team. The IAM policy on this user provides tableau account holder(Ellie Kohler) access to athena queries(read and write), access to the original s3 bucket analytics-datapipeline (read -only access) and access to the athena query results s3 bucket lib-insight-serialized-data.. (read-write-access). The athena query results are uploaded to Tableau. The query results are also automated on a weekly basis based on the athena query updates

The script(lambda function) is broken down into the following parts:

  • Get query results from libinsight using libinsight api with parameters: libinsight ID, data range and libinsight token
  • Append all the pages of the libinsight query results together as one dictionary. Libinsight api returns query results that are limited to one page at a time.
  • Transform the query response parameters to fit the needs of the data analytics team.
  • Serialize the data to s3 bucket and upload the record as a csv file
  • Create athena query and store query results in the s3 bucket
  • Upload and automate athena query results to tableau for data analysis
  • Add triggers to automate the upload on a weekly basis

The lambda function to start the ec2 instance is StartLibInsightEC2Instance. The lambda function to stop the ec2 instance is StopLibInsightEC2Instance. Triggers are also added to these lambda functions.

Documentation

See the wiki for documentation. For more detailed documentation see readme-notes.md

Environment

This project is hosted in the Data Services AWS account. AWS components are identified by the following tags:

  • Unit : AnalyticsAssessment
  • Owner : DataServices
  • Stack : Test
  • User : elliek
  • Application : DataPipeline

About

Automation of data collection and cleaning to facilitate analysis and reporting of library interactions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •