-
Notifications
You must be signed in to change notification settings - Fork 0
/
physical_dimension.Rmd
executable file
·186 lines (137 loc) · 8.27 KB
/
physical_dimension.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: "Document dimension preprocessing summary"
author: "Helsinki Computational History Group (COMHIS)"
date: "`r Sys.Date()`"
output: markdown_document
---
```{r init, echo=FALSE}
ntop <- 20
#opts_chunk$set(comment=NA, fig.width=6, fig.height=6)
opts_chunk$set(fig.path = paste0(output.folder, "figure/"))
theme_set(theme_bw(20))
```
## Document size comparisons
* Some dimension info is provided in the original raw data for altogether `r sum(!is.na(df.orig$value))` documents (`r round(100*mean(!is.na(df.orig$value)),1)`%) but could not be interpreted for `r sum(!is.na(df.orig$value) & (is.na(df$gatherings)))` documents (ie. dimension info was successfully estimated for `r round(100 - 100 * sum(!is.na(df.orig$value) & (is.na(df$gatherings)))/sum(!is.na(df.orig$value)), 1)` % of the documents where this field was not empty).
* Document size (area) info was obtained in the final preprocessed data for altogether `r sum(!is.na(df$area))` documents (`r round(100*mean(!is.na(df$area)))`%). For the remaining documents, critical dimension information was not available or could not be interpreted: [List of entries where document surface area could not be estimated](output.tables/physical_dimension_incomplete.csv)
* Document gatherings info is originally available for `r sum(!is.na(df$gatherings.original))` documents (`r round(100*mean(!is.na(df$gatherings.original)))`%), and further estimated up to `r sum(!is.na(df$gatherings))` documents (`r round(100*mean(!is.na(df$gatherings)))`%) in the final preprocessed data.
* Document height info is originally available for `r sum(!is.na(df$height.original))` documents (`r round(100*mean(!is.na(df$height.original)))`%), and further estimated up to `r sum(!is.na(df$height))` documents (`r round(100*mean(!is.na(df$height)))`%) in the final preprocessed data.
* Document width info is originally available for `r sum(!is.na(df$width.original))` documents (`r round(100*mean(!is.na(df$width.original)))`%), and further estimated up to `r sum(!is.na(df$width))` documents (`r round(100*mean(!is.na(df$width)))`%) in the final preprocessed data.
These tables can be used to verify the accuracy of the conversions from the raw data to final estimates:
* [Dimension conversions from raw data to final estimates](output.tables/conversions_physical_dimension.csv)
* [Automated tests for dimension conversions](https://github.com/COMHIS/bibliographica/blob/master/inst/extdata/tests_dimension_polish.csv)
The estimated dimensions are based on the following auxiliary information sheets:
* [Document dimension abbreviations](https://github.com/COMHIS/bibliographica/blob/master/inst/extdata/document_size_abbreviations.csv)
* [Standard sheet size estimates](https://github.com/COMHIS/bibliographica/blob/master/inst/extdata/sheetsizes.csv)
* [Document dimension estimates](https://github.com/COMHIS/bibliographica/blob/master/inst/extdata/documentdimensions.csv) (used when information is partially missing)
* [Discarded entries (curated)](rejected_entries_curated.csv); these entries have been curated, and confirmed to contain no interpretable dimension information. These are discarded before other processing.
* [Discarded entries (non-curated)](rejected_entries_noncurated.csv); these entries have not been curated, and they could not be interpreted for dimension information.
Left: final gatherings vs. final document dimension (width x height). Right: original gatherings versus original heights where both are available. The point size indicates the number of documents for each case. The red dots indicate the estimated height that is used when only gathering information is available.
```{r summary, echo=FALSE, message=FALSE, warning=FALSE, fig.width=9, fig.height=7, fig.show="hold", out.width="420px"}
df <- df.preprocessed
dfs <- df %>% filter(!is.na(area) & !is.na(gatherings))
dfs <- dfs[, c("gatherings", "area")]
dfm <- melt(table(dfs)) # TODO switch to gather here
names(dfm) <- c("gatherings", "area", "documents")
dfm$gatherings <- factor(dfm$gatherings, levels = levels(df$gatherings))
p <- ggplot(dfm, aes(x = gatherings, y = area))
p <- p + scale_y_continuous(trans = "log2")
p <- p + geom_point(aes(size = documents))
p <- p + scale_size(trans="log10")
p <- p + ggtitle("Gatherings vs. area")
p <- p + xlab("Size (gatherings)")
p <- p + ylab("Size (area)")
p <- p + coord_flip()
print(p)
# Compare given dimensions to gatherings
# (not so much data with width so skip that)
df2 <- filter(df, !is.na(height) | !is.na(width))
df2 <- df2[!is.na(as.character(df2$gatherings)),]
df3 <- filter(df2, !is.na(height))
ss <- sheet_sizes()
df3$gathering.height.estimate <- ss[match(df3$gatherings, ss$gatherings),"height"]
df4 <- df3 %>% group_by(gatherings, height) %>% tally()
p <- ggplot(df4, aes(y = gatherings, x = height))
p <- p + geom_point(aes(size = n))
p <- p + geom_point(data = unique(df3), aes(y = gatherings, x = gathering.height.estimate), color = "red")
p <- p + ylab("Gatherings (original)") + xlab("Height (original)")
p <- p + ggtitle("Gatherings vs. height")
print(p)
```
Left: Document dimension histogram (surface area);
Right: title count per gatherings.
```{r sizes, echo=FALSE, message=FALSE, warning=FALSE, fig.width=7, fig.height=5, fig.show="hold",out.width="420px"}
p <- ggplot(df, aes(x = area))
p <- p + geom_histogram()
p <- p + xlab("Document surface area (log10)")
p <- p + ggtitle("Document dimension (surface area)")
p <- p + scale_x_log10()
print(p)
p <- ggplot(df, aes(x = gatherings))
p <- p + geom_bar()
n <- nchar(max(na.omit(table(df$gatherings))))
p <- p + scale_y_log10(breaks=10^(0:n))
p <- p + ggtitle("Title count")
p <- p + xlab("Size (gatherings)")
p <- p + ylab("Title count")
p <- p + coord_flip()
print(p)
```
<!--
### Gatherings timelines
```{r ndef, echo=FALSE, message=FALSE, warning=FALSE}
nmin <- 15
```
Popularity of different document sizes over time. Left: absolute title
counts. Right: relative title counts. Gatherings with less than `r
nmin` documents at every decade are excluded:
```{r compbyformat, echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=7, fig.show="hold", out.width="430px"}
dfs <- df %>% filter(!is.na(gatherings))
res <- timeline(dfs, group = "gatherings", nmin = nmin, mode = "absolute")
print(res$plot)
res <- timeline(dfs, group = "gatherings", nmin = nmin, mode = "percentage")
print(res$plot)
```
## Average document dimensions
Here we use the original data only:
```{r avedimstime, echo=FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=7}
# only include gatherings with sufficiently many documents
nmin <- 2000
top.gatherings <- setdiff(names(which(table(df$gatherings.original) > nmin)), "NA")
df2 <- filter(df, !gatherings.original == "NA" &
(!is.na(height.original) | !is.na(width.original))) %>%
filter(gatherings.original %in% top.gatherings) %>%
select(publication_decade, gatherings.original, height.original, width.original)
df3 <- df2 %>% group_by(gatherings.original, publication_decade) %>%
summarize(mean.height.original = mean(height.original, na.rm = T),
mean.width.original = mean(width.original, na.rm = T),
n = n())
p <- ggplot()
p <- p + geom_point(data = df3, aes(x = publication_decade,
y = mean.height.original,
size = n,
group = gatherings.original,
color = gatherings.original))
# Use mean height here to speed up
p <- p + geom_smooth(data = df3, method = "loess",
aes(x = publication_decade,
y = mean.height.original,
group = gatherings.original,
color = gatherings.original))
p <- p + ggtitle("Height")
print(p)
```
Only the most frequently occurring gatherings are listed here:
```{r avedims, echo=FALSE, message=FALSE, warning=FALSE}
df2 <- filter(df, !is.na(gatherings.original) & (!is.na(height.original) | !is.na(width.original))) %>%
filter(gatherings.original %in% top.gatherings) %>%
group_by(gatherings.original) %>%
summarize(
mean.width = mean(width.original, na.rm = T),
median.width = mean(width.original, na.rm = T),
mean.height = mean(height.original, na.rm = T),
median.height = mean(height.original, na.rm = T),
n = n())
mean.dimensions <- as.data.frame(df2)
kable(mean.dimensions, caption = "Average document dimensions", digits = 2)
```
-->