Skip to content

Basic Usage

John Wieczorek edited this page Feb 18, 2019 · 13 revisions

Basic Usage

  1. Overview
  2. Dataset information setup
    1. Example:
  3. Repository name consistency check
  4. Manual calculation of stats for a month
    1. Examples for April 2012 usage
      1. Default usage: No override, no testing, no GitHub processing
      2. Overriding an existing period
      3. Storing on GitHub and sending notification issue
      4. Using a custom table
  5. Restart after failure
    1. Scenario 1, failed in the middle of storing issues on GitHub
  6. Manual storing of report files on GitHub
  7. Creating issues on GitHub for stored reports

Manual storing of report files on GitHub

Overview

The usagestats service can be turned off to avoid abuse and keep usage costs down. The service can be managed at

https://console.cloud.google.com/appengine/versions?project=vertnet-portal&serviceId=tools-usagestats&versionssize=50

Base URL: http://<version>.tools-usagestats.vertnet-portal.appspot.com/

Currently available versions:

  • prod
  • dev (not deployed)

How to build URLs properly:

<Base URL>/<family>/<command>

Dataset information setup

Needed to ensure consistency and proper functioning. The process of extracting the reports relies on having information of all datasets already in the datastore.

Family: admin/setup

Command: datasets

Method: POST

Parameters (bold means mandatory): None

Example:

curl -i -X POST -d "" http://tools-usagestats.vertnet-portal.appspot.com/admin/setup/datasets

Repository name consistency check

Needed to ensure consistency and proper functioning. Resource repositories on GitHub must be properly and uniquely referenced in the Carto registry table (resource_staging). Otherwise, the application won't be able to adequately store the report or send a notification issue.

Family: admin/tools

Command: repo_checker

Method: GET or POST

Parameters (bold means mandatory): None

Example:

curl -i -X GET http://tools-usagestats.vertnet-portal.appspot.com/admin/tools/repo_checker

Result: If successful, a JSON-like message; {"result": "success"}. Otherwise, a list of all the mismatching repository/resource names

Manual calculation of stats for a month

Family: admin/parser

Command: init

Method: POST

Parameters (bold means mandatory):

  • period: the period to process (i.e., the month to calculate) in the format 'YYYYMM'. For example, period=201811 for November 2018 usage statistics.
  • force: true/false, override existing data for the given period. Defaults to False.
  • testing: true/false, use the VertNet/statReports testing repository instead of the repositories of the publishers to store reports and send issues. Defaults to False.
  • github_store: true/false, store txt versions of the reports in GitHub repositories. Defaults to False.
  • github_issue: true/false, create new issues in GitHub to notify of report availability. Defaults to False.
  • table_name: the name of the Carto table to query in order to extract usage data. Defaults to query_log_master. Note: If table_name is provided, no time constraints will be applied to the content of the resulting reports unless the table name is the same as the default. Therefore, the table with the given name must contain all and nothing but the records for the reporting period.

Examples for April 2012 usage

Default usage: No override, no testing, no GitHub processing

curl -i -X POST -d "period=201204" http://tools-usagestats.vertnet-portal.appspot.com/admin/parser/init

Overriding an existing period

curl -i -X POST -d "period=201204&force=true" http://tools-usagestats.vertnet-portal.appspot.com/admin/parser/init

Storing on GitHub and sending notification issue

NOTE: This one does not correctly pass the period on to the github_storage step. Do init, github_store and github_issue separately.

curl -i -X POST -d "period=201604&github_store=true&github_issue=true" http://tools-usagestats.vertnet-portal.appspot.com/admin/parser/init

Using a custom table

curl -i -X POST -d "period=201204&table_name=query_log_201204" http://tools-usagestats.vertnet-portal.appspot.com/admin/parser/init

Restart after failure

Thanks to the modular structure of the application, one can easily restart the process at almost any point without pain. Each step has its own request handler that can be called directly passing the required parameters. Besides, for long-running tasks like storing all reports on GitHub, the process enters a loop that processes batches of reports until it finishes or receives a DeadlineExceededError from App Engine. In the latter case, it respawns taking the last (unfinished) batch of reports.

Scenario 1, failed in the middle of storing issues on GitHub

Let's assume you forgot to call the datasets setup process and launched the main task with an incomplete collection of dataset information. This is what would likely happen:

  • init would end successfully
  • get_events would end successfully, extracting the events to parse
  • process_events would end successfully, storing the reports in the datastore
  • github_store would fail eventually, when trying to get info from a dataset that is not in the datastore. The process will halt here. NOTE: As of 2018-10-11, gbifdatasetids found in the logs but not found in the data store are skipped with the Report entities' 'stored' property set to True. This is to avoid endless loops when an incorrect gbifdatasetid was actually stored in the logs and will never appear in the Dataset Entity list.

At this point, you have all reports in the datastore, some of them with the stored property set to True and others (the ones after and including the one that failed) set to False, and all of them with the issue_sent property set to False (since this process did not execute yet). The correct way to proceed is as follows:

  1. Make a request to the datasets setup handler so that the missing data are uploaded to the datastore
  2. Optionally, remove Report Entities from the Datastore for Datasets that do not exist if those Datasets are in error (this happened for 456058db-f70b-4005-97ad-e08570cf0c56, which is an invalid gbifdatasetid that had been applied to some records in the portal index).
  3. Make a request to the github_store endpoint, passing these two parameters exactly as you wrote them for the init request:
    • period
    • github_issue

The task will restart from the github_store endpoint and will process only those reports with store = False. After that, it will continue as usual depending on the value of the github_issue variable.

Manual storing of report files on GitHub

Family: admin/parser

Command: github_store

Method: POST

Parameters (bold means mandatory):

  • period: the period to process (i.e., the month to calculate) in the format 'YYYYMM'. For example, period=201811 for November 2018 usage statistics.
  • testing: true/false, use the VertNet/statReports testing repository instead of the repositories of the publishers to store reports and send issues. Defaults to False.
  • github_issue: true/false, create new issues in GitHub to notify of report availability. Defaults to False.
  • gbifdatasetid: process the one resource with the given gbifdatasetid

Storing reports on GitHub with no issue generation

curl -i -X POST -d "period=201204" http://tools-usagestats.vertnet-portal.appspot.com/admin/parser/github_store

Manual creation of GitHub issues

Family: admin/parser

Command: github_issue

Method: POST

Parameters (bold means mandatory):

  • period: the period to process (i.e., the month to calculate) in the format 'YYYYMM'. For example, period=201811 for November 2018 usage statistics.
  • testing: true/false, use the VertNet/statReports testing repository instead of the repositories of the publishers to store reports and send issues. Defaults to False.
  • gbifdatasetid: process the one resource with the given gbifdatasetid

Creating issues on GitHub for stored reports

curl -i -X POST -d "period=201204" http://tools-usagestats.vertnet-portal.appspot.com/admin/parser/github_issue