Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Enable duplicate detection via bag manifests #118

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

ross-spencer
Copy link
Contributor

@ross-spencer ross-spencer commented Jun 26, 2019

Compare an accruals location to an AIP store

This commit introduces an accruals->aips comparison capability.

Digital objects in an accruals folder can now be compared to the
contents of an AIP store.

Where filepaths and checksums and dates match, the object is
considered to be identical (a true duplicate). Where they don't,
users can use modulo (%) to identify where the object isn't in fact
identical.

Much of the benefit of this work is derived from the nature of the
AIP structure imposed on a digital transfer.

Once the comparison is complete, three reports are output in CSV
format:

  • True-duplicates.
  • Near-duplicates (checksums match, but other components might not).
  • Non-duplicates.

Additionally a summary report output in JSON.

Connected to archivematica/Issues#448

Configuration

API configuration, and transfer source location is done via this configuration file. Note the '"accruals_transfer_source"' parameter describes a transfer source in the storage service with the Description 'accruals'. But could equally be any other value more appropriate to your institution.

The primary script will also accept a value for this transfer source on the command line, e.g.

  • python3 -m duplicates.accruals <my_transfer_source_description>

With everything configured correctly the successful output on the command line may look as follows:

$ python3 -m duplicates.accruals
INFO      2019-07-08 17:11:16 duplicates.py:171  No result for algorithm: md5
INFO      2019-07-08 17:11:16 duplicates.py:171  No result for algorithm: sha1
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/METS.8a8e1cc5-82ec-491a-8bda-cc7d0223553f.xml
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/README.html
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/logs/fileFormatIdentification.log
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/logs/filenameCleanup.log
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/logs/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/logs/fileFormatIdentification.log
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/logs/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/logs/filenameCleanup.log
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/objects/metadata/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/directory_tree.txt
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/objects/submissionDocumentation/transfer-1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/METS.xml
INFO      2019-07-08 17:11:17 duplicates.py:171  No result for algorithm: sha512
INFO      2019-07-08 17:11:17 serialize_to_csv.py:41   Number of files in '1' AIPs in the AIP store: 4
INFO      2019-07-08 17:11:17 serialize_to_csv.py:44   Number of transfers: 3
INFO      2019-07-08 17:11:17 serialize_to_csv.py:47   Number of items in transfer 1: 4
INFO      2019-07-08 17:11:17 serialize_to_csv.py:47   Number of items in transfer 2: 5
INFO      2019-07-08 17:11:17 serialize_to_csv.py:47   Number of items in transfer 3: 2
{
    "count_of_files_across_aips": 4,
    "files_in_transfer-1": 4,
    "files_in_transfer-2": 5,
    "files_in_transfer-3": 2,
    "number_of_aips": 1,
    "numer_of_transfers": 3
}
ERROR     2019-07-08 17:11:17 serialize_to_csv.py:84   Outputting report to: true_duplicates_comparison.csv
ERROR     2019-07-08 17:11:17 serialize_to_csv.py:118  Outputting report to: near_matches_comparison.csv
ERROR     2019-07-08 17:11:17 serialize_to_csv.py:141  Outputting report to: non_matches_list.csv

The CSV files output as a result can then be used to compile a list of files specifically selected to be transferred into Archivematica.

ross-spencer and others added 2 commits June 26, 2019 18:17
This commit enables duplicate detection via bag manifests in the AIP
store. AIP comparison to other AIPs.
Begin to pull parts of the code out that need to be more generic.
In this commit we're starting to test other AIP compression types.
@ross-spencer ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch 2 times, most recently from 25436fc to aaca9fa Compare July 3, 2019 15:18
@ross-spencer ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch 4 times, most recently from e3459b2 to aa8c4c3 Compare July 8, 2019 15:09
This commit introduces an accruals->aips comparison capability.

Digital objects in an accruals folder can now be compared to the
contents of an AIP store.

Where filepaths and checksums and dates match, the object is
considered to be identical (a true duplicate). Where they don't,
users can use modulo (%) to identify where the object isn't in fact
identical.

Much of the benefit of this work is derived from the nature of the
AIP structure imposed on a digital transfer.

Once the comparison is complete, three reports are output in CSV
format:

 * True-duplicates.
 * Near-duplicates (checksums match, but other components might not).
 * Non-duplicates.

Additionally a summary report output in JSON.
@ross-spencer ross-spencer force-pushed the dev/issue-448-add-duplicate-reporting-mechanism branch from aa8c4c3 to cab6f33 Compare July 8, 2019 15:11
@ross-spencer ross-spencer self-assigned this Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants