Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Enable duplicate detection via bag manifests #118

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions reports/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Automation Tools Reports Module

A collection of reporting scripts that can be run independently of the
automation tools or in concert with.

## Duplicates

Duplicates can identify duplicate entries across AIPs across your entire AIP
store. See the [README](duplicates/README.md)
Empty file added reports/__init__.py
Empty file.
133 changes: 133 additions & 0 deletions reports/duplicates/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Duplicates

Duplicates can identify duplicate entries across AIPs across your entire AIP
store.

## Configuration

**Python**

The duplicates module has its own dependencies. To ensure it can run, please
install these first:

* `$ sudo pip install -r requirements.txt`

**Storage Service**

To configure your report, modify [config.json](config.json) with information
about how to connect to your Storage Service, e.g.
```json
{
"storage_service_url": "http://127.0.0.1:62081",
"storage_service_user": "test",
"storage_service_api_key": "test"
}
```

## Running the script

Once configured there are a number of ways to run the script.

* **From the duplicates directory:** `$ python duplicates.py`
* **From the report folder as a module:** `$ python -m duplicates.duplicates`
* **From the automation-tools folder as a module:** `$ python -m reports.duplicates.duplicates`

## Output

The tool has two outputs:

* `aipstore-duplicates.json`
* `aipstore-duplicates.csv`

A description of those follows:

* **Json**: Which reports on the packages across which duplicates have been
found and lists duplicate objects organized by checksum. The output might be
useful for developers creating other tooling around this work, e.g.
visualizations, as json is an easy to manipulate standard in most programming
languages.

The json output is organised as follows:
```json
{
"manifest_data": {
"{matched-checksum-1}": [
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
}
],
"{matched-checksum-2}": [
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
},
{
"basename": "{filename}",
"date_modified": "{modified-date}",
"dirname": "{directory-name}",
"filepath": "{relative-path}",
"package_name": "{package-name}",
"package_uuid": "{package-uuid}"
}
]
},
"packages": {
"{package-uuid}": "{package-name}",
"{package-uuid}": "{package-name}"
}
}
```

* **CSV**: Which reports the same information but as a 2D representation. The
CSV is ready-made to be manipulated in tools such as
[OpenRefine](http://openrefine.org/). The CSV dynamically resizes depending on
where some rows have different numbers of duplicate files to report.

## Process followed

Much of the work done by this package relies on the
[amclient package](https://github.com/artefactual-labs/amclient). The process
used to create a report is as follows:

1. Retrieve a list of all AIPs across all pipelines.
2. For every AIP download the bag manifest for the AIP (all manifest
permutations are tested so all duplicates are discovered whether you are using
MD5, SHA1 or SHA256 in your Archivematica instances).
3. For every entry in the bag manifest record the checksum, package, and path.
4. Filter objects with matching checksums into a duplicates report.
5. For every matched file in the duplicates report download the package METS
file.
6. Using the METS file augment the report with date_modified information.
(Other data might be added in future).
7. Output the report as JSON to `aipstore-duplicates.json`.
8. Re-format the report to output in a 2D table to `aipstore-duplicates.csv`.

## Future work

As a standalone module, the duplicates work could be developed in a number of
ways that might be desirable in an archival appraisal workflow.
Empty file added reports/duplicates/__init__.py
Empty file.
164 changes: 164 additions & 0 deletions reports/duplicates/accruals.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Script to compare a source of new transfer material (accruals) with the
contents of an AIP store and output CSV files containing information about:

* True duplicates.
* Near duplicates.
* Non duplicates.

These files can be used as an input to generate new transfer material to be
transferred into Archivematica.
"""

from __future__ import print_function, unicode_literals

import copy
import logging
import os
import sys

try:
from .appconfig import AppConfig
from .digital_object import DigitalObject
from . import duplicates
from . import loggingconfig
from .serialize_to_csv import CSVOut
from . import utils
except (ValueError, ImportError):
from appconfig import AppConfig
from digital_object import DigitalObject
import duplicates
import loggingconfig
from serialize_to_csv import CSVOut
import utils

logging_dir = os.path.dirname(os.path.abspath(__file__))

logger = logging.getLogger("accruals")
logger.disabled = False

# Location purpose = Transfer Source (TS)
location_purpose = "TS"
default_location = AppConfig().accruals_transfer_source

# If Docker we need to work with paths differently...
DOCKER = AppConfig().docker

# Store our appraisal paths.
accrual_paths = []


def create_manifest(aip_index, accrual_objs):
"""do something."""
dupes = []
near_matches = []
non_matches = []
aip_obj_hashes = aip_index.get(duplicates.MANIFEST_DATA)
for accrual_obj in accrual_objs:
for accrual_hash in accrual_obj.hashes:
if accrual_hash in aip_obj_hashes.keys():
for _, aip_items in aip_obj_hashes.items():
for aip_item in aip_items:
if accrual_obj == aip_item:
accrual_obj.flag = True
cp = copy.copy(accrual_obj)
cp.package_name = aip_item.package_name
dupes.append(cp)
else:
diff = accrual_obj % aip_item
if (
diff == "No matching components"
or "checksum match" not in diff
):
"""Don't output."""
continue
accrual_obj.flag = True
cp1 = copy.copy(accrual_obj)
cp2 = copy.copy(aip_item)
near_matches.append([cp1, cp2])
# Only need one hash to match then break.
# May also be redundant as we only have one hash from the
# bag manifests...
break
for accrual_obj in accrual_objs:
if accrual_obj.flag is False:
cp = copy.copy(accrual_obj)
if cp not in non_matches:
non_matches.append(cp)
return dupes, near_matches, non_matches


def create_comparison_obj(transfer_path):
"""Do something."""
transfer_arr = []
for root, _, files in os.walk(transfer_path, topdown=True):
for name in files:
file_ = os.path.join(root, name)
if os.path.isfile(file_):
transfer_arr.append(DigitalObject(file_, transfer_path))
return transfer_arr


def stat_transfers(accruals_path, all_transfers):
"""Retrieve all transfer paths and make a request to generate statistics
about all the objects in that transfer path.
"""
aip_index = duplicates.retrieve_aip_index()
dupe_reports = []
near_reports = []
no_match_reports = []
transfers = []
for transfer in all_transfers:
transfer_home = os.path.join(accruals_path, transfer)
if DOCKER:
transfer_home = utils.get_docker_path(transfer_home)
objs = create_comparison_obj(transfer_home)
transfers.append(objs)
match_manifest, near_manifest, no_match_manifest = create_manifest(
aip_index, objs
)
if match_manifest:
dupe_reports.append({transfer: match_manifest})
if near_manifest:
near_reports.append({transfer: near_manifest})
if no_match_manifest:
no_match_reports.append({transfer: no_match_manifest})
CSVOut.output_reports(
aip_index, transfers, dupe_reports, near_reports, no_match_reports
)


def main(location=default_location):
"""Primary entry point for this script."""
am = AppConfig().get_am_client()
sources = am.list_storage_locations()
accruals = False
for source in sources.get("objects"):
if (
source.get("purpose") == location_purpose
and source.get("description") == location
):
"""do something."""
am.transfer_source = source.get("uuid")
am.transfer_path = source.get("path")
accruals = True
if not accruals:
logger.info("Exiting. No transfer source: %s", location)
sys.exit()
# All transfer directories. Assumption is the same as Archivematica that
# each transfer is organized into a single directory at this level.
all_transfers = am.transferables().get("directories")
stat_transfers(am.transfer_path, all_transfers)


if __name__ == "__main__":
loggingconfig.setup("INFO", os.path.join(logging_dir, "report.log"))
transfer_source = default_location
try:
source = sys.argv[1:][0]
logger.error("Attempting to find transfers at: %s", transfer_source)
except IndexError:
pass
sys.exit(main(transfer_source))
48 changes: 48 additions & 0 deletions reports/duplicates/appconfig.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# -*- coding: utf-8 -*-

"""Class to help bring-together application configuration for the
de-duplication work.
"""


import json
import os

from amclient import AMClient


class AppConfig:
"""Application configuration class."""

def __init__(self):
"""Initialize class."""
config_file = os.path.join(os.path.dirname(__file__), "config.json")
self._load_config(config_file)

def _load_config(self, config_file):
"""Load our configuration information."""
with open(config_file) as json_config:
conf = json.load(json_config)

self.docker = True if conf.get("docker").lower() == "true" else False

self.storage_service_user = conf.get("storage_service_user")
self.storage_service_api_key = conf.get("storage_service_api_key")
self.storage_service_url = conf.get("storage_service_url")
self.accruals_transfer_source = conf.get("accruals_transfer_source")

# Space to configure a new location in.
self.default_space = conf.get("default_storage_space")
self.default_path = conf.get("default_path")

# Information about the candidate transfer.
self.candidate_agent = conf.get("candidate_agent")
self.candidate_location = conf.get("candidate_location")

def get_am_client(self):
"""Return an Archivematica API client to the caller."""
am = AMClient()
am.ss_url = self.storage_service_url
am.ss_user_name = self.storage_service_user
am.ss_api_key = self.storage_service_api_key
return am
11 changes: 11 additions & 0 deletions reports/duplicates/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"docker": "true",
"candidate_agent": "IISH",
"storage_service_url": "http://127.0.0.1:62081",
"storage_service_user": "test",
"storage_service_api_key": "test",
"accruals_transfer_source": "accruals",
"default_storage_space": "b57c0e2c-606a-47d4-a612-884444d9dda1",
"default_path": "/home/ross-spencer/.am/ss-location-data/candidate-transfers",
"candidate_location": "Automated candidate transfers"
}
Loading