title | author | date | output |
---|---|---|---|
Document dimension preprocessing summary |
Helsinki Computational History Group (COMHIS) |
2020-04-14 |
markdown_document |
-
Some dimension info is provided in the original raw data for altogether 471076 documents (97.9%) but could not be interpreted for 6003 documents (ie. dimension info was successfully estimated for 98.7 % of the documents where this field was not empty).
-
Document size (area) info was obtained in the final preprocessed data for altogether 466698 documents (97%). For the remaining documents, critical dimension information was not available or could not be interpreted: List of entries where document surface area could not be estimated
-
Document gatherings info is originally available for 464163 documents (96%), and further estimated up to 465073 documents (97%) in the final preprocessed data.
-
Document height info is originally available for 4649 documents (1%), and further estimated up to 466698 documents (97%) in the final preprocessed data.
-
Document width info is originally available for 0 documents (0%), and further estimated up to 466698 documents (97%) in the final preprocessed data.
These tables can be used to verify the accuracy of the conversions from the raw data to final estimates:
The estimated dimensions are based on the following auxiliary information sheets:
-
Document dimension estimates (used when information is partially missing)
-
Discarded entries (curated); these entries have been curated, and confirmed to contain no interpretable dimension information. These are discarded before other processing.
-
Discarded entries (non-curated); these entries have not been curated, and they could not be interpreted for dimension information.
Left: final gatherings vs. final document dimension (width x height). Right: original gatherings versus original heights where both are available. The point size indicates the number of documents for each case. The red dots indicate the estimated height that is used when only gathering information is available.
Left: Document dimension histogram (surface area); Right: title count per gatherings.