Scripts to generate institutional usage statistics for the Canadiana collections.
The workflow of the script is illustrated in the diagram below:
-
Logs Parser (
logs_parser.py
):- The process begins with
logs_parser.py
, which reads raw log files from the server. - This script performs a pre-filtering step to retain only the logs that correspond to page views (i.e., requests containing "/view").
- Each individual page from a document is counted as a separate view.
- Any non-relevant logs (e.g., errors, requests other than page views) are excluded.
- The output of this script is a filtered log file, which is a merged collection of all the log files, excluding the irrelevant "trash" lines.
- The process begins with
-
Usage Report Generator (
usage_report.py
):- Once the filtered logs are ready, they are passed to
usage_report.py
. - This script reads the filtered logs and generates an Excel usage report.
- To generate the report, you must input an institution name.
- The
usage_report.py
script will then load the IP lookup table (usingip_parser.py
) to identify the IP addresses and proxies associated with the chosen institution. - Finally, the script counts the number of matching entries per day from the filtered logs, providing detailed usage data.
- Once the filtered logs are ready, they are passed to
-
Clone the repository: If you haven't already, clone the repository to your local machine:
git clone <repository-url>
-
Set up your environment:
- Make a copy of the
.env.sample
file and rename it to.env
:cp .env.sample .env
- Open the newly created
.env
file and replace<your-server-name>
with the appropriate server name.
- Make a copy of the
-
Install dependencies:
- Ensure you have Python installed (recommended version:
Python 3.x
). - Install the required dependencies by running:
This will install all necessary Python packages listed in the
pip install -r requirements.txt
requirements.txt
file.
- Ensure you have Python installed (recommended version:
-
Prepare the required files:
- Obtain the latest version of the licensing team's IP Addresses excel file and the
institutions.txt
file. Ensure thatinstitutions.txt
is up to date and matches the list of institutions in the IP Addresses excel file.
- Obtain the latest version of the licensing team's IP Addresses excel file and the
Running the scripts individually can be helpful for debugging purposes or when you need to test usage data for a single institution at a time.
Before generating reports, you need to parse the raw logs by running logs_parser.py
. This step filters out irrelevant data and creates a log file with only page view requests.
python logs_parser.py <raw_logs_folder_path>
<raw_logs_folder_path>
: Path to the folder containing raw log files.- The resulting filtered log file will be saved as
logs_YYYY-MM-DD.txt
(in theprocessed/
directory).
Run the usage_report.py
script to create the Excel usage report based on the filtered logs and institution IP addresses using the following command:
python usage_report.py <institution> <logs>
<institution>
: Name (or abbreviation) of the institution for which to generate the report.<logs>
: Path to the filtered log file (output fromlogs_parser.py
)
Note: An abbreviation column is used for institution names based on their email domains (e.g. for University of Toronto, the abbreviation is "utoronto"). This was introduced to avoid Excel's 30-character limit for sheet names.
usage_report.py
uses ip_parser.py
internally to load the IP lookup table. If needed, you can still run ip_parser.py
manually for testing or data verification.
One way to do that is opening a python terminal, loading the script, and running its functions. For example:
from ip_parser import *
# Path to the IP address Excel file
file_path = "IP Addresses Excel file path"
# Process the IP data (skipping non-header rows)
ips_df = ips_to_df(file_path, 2)
# View the processed data
print(ip_df.head())
The run_usage_reports.sh
script automates the process of generating reports for a list of institutions using a specified filtered log file.
run_usage_reports.sh <log_file_path>
<log_file_path>
: Path to a filtered log file generated by logs_parser.py (e.g.,logs_YYYY_MM_DD.txt
).
- Filtered Logs: Generated by
logs_parser.py
(e.g.,logs_YYYY_MM_DD.txt
). - Usage Report: Excel report generated by usage_report.py (e.g.,
usage-report.xlsx
).
By default:
- Filtered logs are stored in the processed/ folder.
- Final reports are saved in the reports/ folder.
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.