artefactual · ross-spencer · Jan 21, 2019 · Apr 1, 2019 · Jun 26, 2019 · Jun 26, 2019
diff --git a/reports/README.md b/reports/README.md
@@ -0,0 +1,9 @@
+# Automation Tools Reports Module
+
+A collection of reporting scripts that can be run independently of the
+automation tools or in concert with.
+
+## Duplicates
+
+Duplicates can identify duplicate entries across AIPs across your entire AIP
+store. See the [README](duplicates/README.md)
diff --git a/reports/__init__.py b/reports/__init__.py
diff --git a/reports/duplicates/README.md b/reports/duplicates/README.md
@@ -0,0 +1,133 @@
+# Duplicates
+
+Duplicates can identify duplicate entries across AIPs across your entire AIP
+store.
+
+## Configuration
+
+**Python**
+
+The duplicates module has its own dependencies. To ensure it can run, please 
+install these first: 
+
+* `$ sudo pip install -r requirements.txt`
+
+**Storage Service**
+
+To configure your report, modify [config.json](config.json) with information
+about how to connect to your Storage Service, e.g.
+```json
+{
+	"storage_service_url": "http://127.0.0.1:62081",
+	"storage_service_user": "test",
+	"storage_service_api_key": "test"
+}
+```
+
+## Running the script
+
+Once configured there are a number of ways to run the script.
+
+* **From the duplicates directory:** `$ python duplicates.py`
+* **From the report folder as a module:** `$ python -m duplicates.duplicates`
+* **From the automation-tools folder as a module:** `$ python -m reports.duplicates.duplicates`
+
+## Output
+
+The tool has two outputs:
+
+* `aipstore-duplicates.json`
+* `aipstore-duplicates.csv`
+
+A description of those follows:
+
+* **Json**: Which reports on the packages across which duplicates have been
+found and lists duplicate objects organized by checksum. The output might be
+useful for developers creating other tooling around this work, e.g.
+visualizations, as json is an easy to manipulate standard in most programming
+languages.
+
+The json output is organised as follows:
+```json
+{
+    "manifest_data": {
+        "{matched-checksum-1}": [
+            {
+                "basename": "{filename}",
+                "date_modified": "{modified-date}",
+                "dirname": "{directory-name}",
+                "filepath": "{relative-path}",
+                "package_name": "{package-name}",
+                "package_uuid": "{package-uuid}"
+            },
+            {
+                "basename": "{filename}",
+                "date_modified": "{modified-date}",
+                "dirname": "{directory-name}",
+                "filepath": "{relative-path}",
+                "package_name": "{package-name}",
+                "package_uuid": "{package-uuid}"
+            },
+            {
+                "basename": "{filename}",
+                "date_modified": "{modified-date}",
+                "dirname": "{directory-name}",
+                "filepath": "{relative-path}",
+                "package_name": "{package-name}",
+                "package_uuid": "{package-uuid}"
+            }
+        ],
+        "{matched-checksum-2}": [
+            {
+                "basename": "{filename}",
+                "date_modified": "{modified-date}",
+                "dirname": "{directory-name}",
+                "filepath": "{relative-path}",
+                "package_name": "{package-name}",
+                "package_uuid": "{package-uuid}"
+            },
+            {
+                "basename": "{filename}",
+                "date_modified": "{modified-date}",
+                "dirname": "{directory-name}",
+                "filepath": "{relative-path}",
+                "package_name": "{package-name}",
+                "package_uuid": "{package-uuid}"
+            }
+        ]
+    },
+    "packages": {
+        "{package-uuid}": "{package-name}",
+        "{package-uuid}": "{package-name}"
+    }
+}
+```
+
+* **CSV**: Which reports the same information but as a 2D representation. The
+CSV is ready-made to be manipulated in tools such as
+[OpenRefine](http://openrefine.org/). The CSV dynamically resizes depending on
+where some rows have different numbers of duplicate files to report.
+
+## Process followed
+
+Much of the work done by this package relies on the
+[amclient package](https://github.com/artefactual-labs/amclient). The process
+used to create a report is as follows:
+
+1. Retrieve a list of all AIPs across all pipelines.
+2. For every AIP download the bag manifest for the AIP (all manifest
+permutations are tested so all duplicates are discovered whether you are using
+MD5, SHA1 or SHA256 in your Archivematica instances).
+3. For every entry in the bag manifest record the checksum, package, and path.
+4. Filter objects with matching checksums into a duplicates report.
+5. For every matched file in the duplicates report download the package METS
+file.
+6. Using the METS file augment the report with date_modified information.
+(Other data might be added in future).
+7. Output the report as JSON to `aipstore-duplicates.json`.
+8. Re-format the report to output in a 2D table to `aipstore-duplicates.csv`.
+
+## Future work
+
+As a standalone module, the duplicates work could be developed in a number of
+ways that might be desirable in an archival appraisal workflow.
diff --git a/reports/duplicates/__init__.py b/reports/duplicates/__init__.py
diff --git a/reports/duplicates/accruals.py b/reports/duplicates/accruals.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+"""Script to compare a source of new transfer material (accruals) with the
+contents of an AIP store and output CSV files containing information about:
+
+   * True duplicates.
+   * Near duplicates.
+   * Non duplicates.
+
+These files can be used as an input to generate new transfer material to be
+transferred into Archivematica.
+"""
+
+from __future__ import print_function, unicode_literals
+
+import copy
+import logging
+import os
+import sys
+
+try:
+    from .appconfig import AppConfig
+    from .digital_object import DigitalObject
+    from . import duplicates
+    from . import loggingconfig
+    from .serialize_to_csv import CSVOut
+    from . import utils
+except (ValueError, ImportError):
+    from appconfig import AppConfig
+    from digital_object import DigitalObject
+    import duplicates
+    import loggingconfig
+    from serialize_to_csv import CSVOut
+    import utils
+
+logging_dir = os.path.dirname(os.path.abspath(__file__))
+
+logger = logging.getLogger("accruals")
+logger.disabled = False
+
+# Location purpose = Transfer Source (TS)
+location_purpose = "TS"
+default_location = AppConfig().accruals_transfer_source
+
+# If Docker we need to work with paths differently...
+DOCKER = AppConfig().docker
+
+# Store our appraisal paths.
+accrual_paths = []
+
+
+def create_manifest(aip_index, accrual_objs):
+    """do something."""
+    dupes = []
+    near_matches = []
+    non_matches = []
+    aip_obj_hashes = aip_index.get(duplicates.MANIFEST_DATA)
+    for accrual_obj in accrual_objs:
+        for accrual_hash in accrual_obj.hashes:
+            if accrual_hash in aip_obj_hashes.keys():
+                for _, aip_items in aip_obj_hashes.items():
+                    for aip_item in aip_items:
+                        if accrual_obj == aip_item:
+                            accrual_obj.flag = True
+                            cp = copy.copy(accrual_obj)
+                            cp.package_name = aip_item.package_name
+                            dupes.append(cp)
+                        else:
+                            diff = accrual_obj % aip_item
+                            if (
+                                diff == "No matching components"
+                                or "checksum match" not in diff
+                            ):
+                                """Don't output."""
+                                continue
+                            accrual_obj.flag = True
+                            cp1 = copy.copy(accrual_obj)
+                            cp2 = copy.copy(aip_item)
+                            near_matches.append([cp1, cp2])
+                # Only need one hash to match then break.
+                # May also be redundant as we only have one hash from the
+                # bag manifests...
+                break
+    for accrual_obj in accrual_objs:
+        if accrual_obj.flag is False:
+            cp = copy.copy(accrual_obj)
+            if cp not in non_matches:
+                non_matches.append(cp)
+    return dupes, near_matches, non_matches
+
+
+def create_comparison_obj(transfer_path):
+    """Do something."""
+    transfer_arr = []
+    for root, _, files in os.walk(transfer_path, topdown=True):
+        for name in files:
+            file_ = os.path.join(root, name)
+            if os.path.isfile(file_):
+                transfer_arr.append(DigitalObject(file_, transfer_path))
+    return transfer_arr
+
+
+def stat_transfers(accruals_path, all_transfers):
+    """Retrieve all transfer paths and make a request to generate statistics
+    about all the objects in that transfer path.
+    """
+    aip_index = duplicates.retrieve_aip_index()
+    dupe_reports = []
+    near_reports = []
+    no_match_reports = []
+    transfers = []
+    for transfer in all_transfers:
+        transfer_home = os.path.join(accruals_path, transfer)
+        if DOCKER:
+            transfer_home = utils.get_docker_path(transfer_home)
+        objs = create_comparison_obj(transfer_home)
+        transfers.append(objs)
+        match_manifest, near_manifest, no_match_manifest = create_manifest(
+            aip_index, objs
+        )
+        if match_manifest:
+            dupe_reports.append({transfer: match_manifest})
+        if near_manifest:
+            near_reports.append({transfer: near_manifest})
+        if no_match_manifest:
+            no_match_reports.append({transfer: no_match_manifest})
+    CSVOut.output_reports(
+        aip_index, transfers, dupe_reports, near_reports, no_match_reports
+    )
+
+
+def main(location=default_location):
+    """Primary entry point for this script."""
+    am = AppConfig().get_am_client()
+    sources = am.list_storage_locations()
+    accruals = False
+    for source in sources.get("objects"):
+        if (
+            source.get("purpose") == location_purpose
+            and source.get("description") == location
+        ):
+            """do something."""
+            am.transfer_source = source.get("uuid")
+            am.transfer_path = source.get("path")
+            accruals = True
+    if not accruals:
+        logger.info("Exiting. No transfer source: %s", location)
+        sys.exit()
+    # All transfer directories. Assumption is the same as Archivematica that
+    # each transfer is organized into a single directory at this level.
+    all_transfers = am.transferables().get("directories")
+    stat_transfers(am.transfer_path, all_transfers)
+
+
+if __name__ == "__main__":
+    loggingconfig.setup("INFO", os.path.join(logging_dir, "report.log"))
+    transfer_source = default_location
+    try:
+        source = sys.argv[1:][0]
+        logger.error("Attempting to find transfers at: %s", transfer_source)
+    except IndexError:
+        pass
+    sys.exit(main(transfer_source))
diff --git a/reports/duplicates/appconfig.py b/reports/duplicates/appconfig.py
@@ -0,0 +1,48 @@
+# -*- coding: utf-8 -*-
+
+"""Class to help bring-together application configuration for the
+de-duplication work.
+"""
+
+
+import json
+import os
+
+from amclient import AMClient
+
+
+class AppConfig:
+    """Application configuration class."""
+
+    def __init__(self):
+        """Initialize class."""
+        config_file = os.path.join(os.path.dirname(__file__), "config.json")
+        self._load_config(config_file)
+
+    def _load_config(self, config_file):
+        """Load our configuration information."""
+        with open(config_file) as json_config:
+            conf = json.load(json_config)
+
+        self.docker = True if conf.get("docker").lower() == "true" else False
+
+        self.storage_service_user = conf.get("storage_service_user")
+        self.storage_service_api_key = conf.get("storage_service_api_key")
+        self.storage_service_url = conf.get("storage_service_url")
+        self.accruals_transfer_source = conf.get("accruals_transfer_source")
+
+        # Space to configure a new location in.
+        self.default_space = conf.get("default_storage_space")
+        self.default_path = conf.get("default_path")
+
+        # Information about the candidate transfer.
+        self.candidate_agent = conf.get("candidate_agent")
+        self.candidate_location = conf.get("candidate_location")
+
+    def get_am_client(self):
+        """Return an Archivematica API client to the caller."""
+        am = AMClient()
+        am.ss_url = self.storage_service_url
+        am.ss_user_name = self.storage_service_user
+        am.ss_api_key = self.storage_service_api_key
+        return am
diff --git a/reports/duplicates/config.json b/reports/duplicates/config.json
@@ -0,0 +1,11 @@
+{
+	"docker": "true",
+	"candidate_agent": "IISH",
+	"storage_service_url": "http://127.0.0.1:62081",
+	"storage_service_user": "test",
+	"storage_service_api_key": "test",
+	"accruals_transfer_source": "accruals",
+	"default_storage_space": "b57c0e2c-606a-47d4-a612-884444d9dda1",
+	"default_path": "/home/ross-spencer/.am/ss-location-data/candidate-transfers",
+	"candidate_location": "Automated candidate transfers"
+}