forked from lmu-osc/code-publishing
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata.qmd
494 lines (397 loc) · 16.3 KB
/
data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
---
title: "Add Data"
engine: knitr
---
## Add Data File
```{r}
#| include: false
invisible(loadNamespace("palmerpenguins")) # Tell renv that we need this package
stopifnot(packageDescription("palmerpenguins", fields = "License") == "CC0")
write.csv(palmerpenguins::penguins, file = "data.csv", row.names = FALSE)
```
You can now download the data set we have prepared for you
and put it into your project folder:
[data.csv](data.csv){.btn .btn-lg .btn-info download=""}
::: {.column-margin}
{width=250px}
:::
The data set is from the package [`palmerpenguins`][palmerpenguins]
(v`r packageVersion("palmerpenguins")`)
and contains the recorded bill lengths and sex of penguins
living on three islands in the Palmer Archipelago, Antarctica.
It was made available by Allison Horst, Alison Hill, and Kristen Gorman
under the license [CC0\ 1.0][cc0].
[palmerpenguins]: https://allisonhorst.github.io/palmerpenguins/
[cc0]: https://creativecommons.org/publicdomain/zero/1.0/
::: {#imp-legal-restrictions .callout-important}
### Consider Legal Restrictions Before Sharing
Everything you put into the project folder will be shared publicly.
For reasons of reproducibility, this should include the data you analyze.
Of course, you should only share them to the extent you are allowed to.
Besides copyright and similar rights, you need to take into account
applicable privacy laws (e.g., the GDPR for European citizens)
and contractual obligations (e.g., with your data provider).
Privacy laws and contractual obligations may require you
to create an anonymized or synthetic data set[^anon-synth-tools],
in which case you should provide a reference
to a repository where the originally measured data can be obtained from.
For further information, you can watch the talk "[Data anonymity](https://osf.io/j6fy8)"
by Felix Schönbrodt recorded during the LMU Open Science Center Summer School 2023
and have a look at [the accompanying slides](https://osf.io/z6gcu).
Another resource to look at is the presentation
"Creating reproducible packages when data are confidential" by @Vilhuber2024.
:::
[^anon-synth-tools]: For example, using [Amnesia](https://amnesia.openaire.eu/),
[ARX](https://arx.deidentifier.org/), [sdcTools](https://github.com/sdcTools),
or [Synthpop](https://www.synthpop.org.uk/).
## Add Data Dictionary
Whether or not distributing the data set,
it is important to document the meaning (e.g., units)
and values of its variables.
This is typically done with a _data dictionary_ (also called a _codebook_).
In the following, we will demonstrate how to create a simple data dictionary
using the R package [`datawizard`][datawizard].
You can install them now using:
[datawizard]: https://easystats.github.io/datawizard/
```{.r filename="Console"}
renv::install("datawizard")
```
You can put the code that follows in a separate document.
Create it by clicking on _File_ > _New File_ > _Quarto Document..._.
Choose a title such as `Data Dictionary`,
select _HTML_ as format,
uncheck the use of the visual markdown editor, and click on _Create_.
Remove everything except the YAML header (between the `---`).
To make the HTML file self-contained,
also set `embed-resources: true` such that the YAML header looks as follows:
```{.yml filename="data_dictionary.qmd"}
---
title: "Data Dictionary"
format:
html:
embed-resources: true
---
```
Then, save it as `data_dictionary.qmd` by clicking on _File_ > _Save_.
To create the actual data dictionary, first write a description for all columns
so others can understand what the variable names mean.
Where necessary, also document their value
-- this is especially important if their meaning is non-obvious.
In the following, we demonstrate this by storing the penguins' binomial name
along with the English name.
``````{cat}
#| engine.opts = list(file = "_data_dictionary.qmd"),
#| class.source = "md",
#| filename = "data_dictionary.qmd"
```{r}
#| echo: false
# Store the description of variables
vars <- c(
species = "a character string denoting penguin species",
island = "a character string denoting island in Palmer Archipelago, Antarctica",
bill_length_mm = "a number denoting bill length (millimeters)",
bill_depth_mm = "a number denoting bill depth (millimeters)",
flipper_length_mm = "an integer denoting flipper length (millimeters)",
body_mass_g = "an integer denoting body mass (grams)",
sex = "a character string denoting penguin sex",
year = "an integer denoting the study year"
)
# Store the description of variable values
vals <- list(
species = c(
Adelie = "Pygoscelis adeliae",
Gentoo = "Pygoscelis papua",
Chinstrap = "Pygoscelis antarcticus"
)
)
```
``````
Then, load the data and use `datawizard`
to add the descriptions to the `data.frame`:[^not-permanent]
[^not-permanent]: Note that the code provided does not alter the data file
-- no description will be added to `data.csv`.
The descriptions are only added to a (temporary) copy of the data set within R
to create the data dictionary.
::: {.column-margin}
{width=250px}
:::
``````{cat}
#| engine.opts = list(file = "_data_dictionary.qmd", append = TRUE),
#| class.source = "md",
#| filename = "data_dictionary.qmd"
```{r}
#| echo: false
dat <- read.csv("data.csv")
for (x in names(vars)) {
if (x %in% names(vals)) {
dat <- datawizard::assign_labels(
dat,
select = I(x),
variable = vars[[x]],
values = vals[[x]]
)
} else {
dat <- datawizard::assign_labels(
dat,
select = I(x),
variable = vars[[x]]
)
}
}
```
``````
Then, you can create the data dictionary containing the descriptions,
but also some other information about each variable
(e.g., the number of missing values) and print it.
``````{cat}
#| engine.opts = list(file = "_data_dictionary.qmd", append = TRUE),
#| class.source = "md",
#| filename = "data_dictionary.qmd"
```{r}
#| echo: false
#| column: "body-outset"
#| classes: plain
datawizard::data_codebook(dat) |>
datawizard::data_select(exclude = ID) |>
datawizard::data_filter(N != "") |>
datawizard::print_md()
```
``````
```{r}
#| child: "_data_dictionary.qmd"
```
Depending on the type of data, it may also be necessary
to describe sampling procedures (e.g., selection criteria),
measurement instruments (e.g., questionnaires),
appropriate weighting,
already applied preprocessing steps, or contact information.
In our case, as the data has already been published,
we only store a reference to its source.
The data set is from the R package `palmerpenguins`.
If you had it installed
you could use the function `citation()` to create such a reference:
```{r}
#| label: "data-citation"
#| eval: false
citation("palmerpenguins", auto = TRUE) |>
format(bibtex = FALSE, style = "text")
```
Without the package `palmerpenguins` installed,
you can find a [suggested citation on its website][palmerpenguins-citation]
and add that to your data dictionary:
[palmerpenguins-citation]: https://allisonhorst.github.io/palmerpenguins/#citation
```{r}
#| ref.label = "data-citation",
#| render = function(x, options) gsub("\\n", " ", x = x),
#| echo = FALSE,
#| class.output = "md code-overflow-wrap",
#| attr.output = 'filename="data_dictionary.qmd"'
# This chunk takes the output from the chunk "data-citation"
# and renders it with all newlines replaced by whitespaces.
```
Finally, you can render the data dictionary by running the following:
```{.bash filename="Terminal"}
quarto render data_dictionary.qmd
```
This should create the file `data_dictionary.html`
which you open and view in your web browser.
One could go even further by making the information machine-readable in a standardized way.
We provide an optional example of that in @nte-frictionless.
If you want to learn more about the sharing of research data,
have a look at the tutorial "[FAIR research data management][fair-tutorial]".
[fair-tutorial]: https://lmu-osc.github.io/FAIR-Data-Management/
::: {#nte-frictionless .callout-note collapse="true"}
### Create Machine-Readable Variable Documentation
This example demonstrates how the title and description of the data set,
the description of the variables and their valid values are stored in a machine-readable way.
We'll reuse the descriptions we already created[^value-labels] and add a few others.
[^value-labels]: Unfortunately, the descriptions of values are not reused in this example,
as they are [not supported][enum-labels] by the specification we are using.
[enum-labels]: https://specs.frictionlessdata.io/patterns/#table-schema-enum-labels-and-ordering
First, store the title and description of the data set as a whole:
```{.r filename="Console"}
table_info <- c(
title = "penguins data set",
description = "Size measurements for adult foraging penguins near Palmer Station, Antarctica"
)
```
As before, also provide a reference to the source.
```{r}
#| echo: false
#| class-output: "r code-overflow-wrap"
#| attr-output: 'filename="Console"'
# We have provided the data set as CSV file to the readers.
# Therefore, we cannot assume or require that readers have
# the R package palmerpenguins installed. Instead, we create
# the citation on our end and hide how we obtained it.
citation("palmerpenguins", auto = TRUE)$url |>
paste0("dat_source <- \"", ... = _, "\"") |>
cat()
```
Next, create a list of the categorical variables' valid values:
```{.r filename="Console"}
valid_vals <- list(
species = c("Adelie", "Gentoo", "Chinstrap"),
island = c("Torgersen", "Biscoe", "Dream"),
sex = c("male", "female"),
year = c(2007, 2008, 2009)
)
```
Finally, store the descriptions of the variables we already created earlier:
```{.r filename="Console"}
# Store the description of variables
vars <- c(
species = "a character string denoting penguin species",
island = "a character string denoting island in Palmer Archipelago, Antarctica",
bill_length_mm = "a number denoting bill length (millimeters)",
bill_depth_mm = "a number denoting bill depth (millimeters)",
flipper_length_mm = "an integer denoting flipper length (millimeters)",
body_mass_g = "an integer denoting body mass (grams)",
sex = "a character string denoting penguin sex",
year = "an integer denoting the study year"
)
```
Generally, metadata are either stored embedded into the data or externally,
for example, in a separate file.
We will use the "[frictionless data](https://frictionlessdata.io/)" standard,
where metadata are stored separately.
Another alternative would be [RO-Crate](https://www.researchobject.org/ro-crate/).
Specifically, one can use the R package [`frictionless`][frictionless]
to create a _schema_ which describes the structure of the data.[^frictionless-note]
For the purpose of the following code,
it is just a nested list that we edit to include our own information.
We also explicitly record in the schema
that missing values are stored in the data file as `NA`
and that the data are licensed under CC0\ 1.0.
Finally, the package is used to create a metadata file that contains the schema.
[frictionless]: https://docs.ropensci.org/frictionless/
[^frictionless-note]: In June 2024, [version 2](https://datapackage.org/)
of the frictionless data standard has been released.
As of November 2024, the R package `frictionless` only supports the first version,
though support for v2 is [planned](https://github.com/frictionlessdata/frictionless-r/labels/datapackage%3Av2).
```{.r filename="Console"}
# Install {frictionless} and the required dependency {stringi}
renv::install(c(
"frictionless",
"stringi"
))
# Read data and create schema
dat_filename <- "data.csv"
dat <- read.csv(dat_filename)
dat_schema <- frictionless::create_schema(dat)
# Add descriptions to the fields
dat_schema$fields <- lapply(dat_schema$fields, \(x) {
c(x, description = vars[[x$name]])
})
# Record valid values
dat_schema$fields <- lapply(dat_schema$fields, \(x) {
if (x[["name"]] %in% names(valid_vals)) {
modifyList(x, list(constraints = list(enum = valid_vals[[x$name]])))
} else {
x
}
})
# Define missing values
dat_schema$missingValues <- c("", "NA")
# Create package with license info and write it
dat_package <- frictionless::create_package() |>
frictionless::add_resource(
resource_name = "penguins",
data = dat_filename,
schema = dat_schema,
title = table_info[["title"]],
description = table_info[["description"]],
licenses = list(list(
name = "CC0-1.0",
path = "https://creativecommons.org/publicdomain/zero/1.0/",
title = "CC0 1.0 Universal"
)),
sources = list(list(
title = "CRAN",
path = dat_source
))
)
frictionless::write_package(dat_package, directory = ".")
```
This creates the metadata file `datapackage.json` in the current directory.
Make sure it is located in the same folder as `data.csv`,
as together they comprise a [data package](https://specs.frictionlessdata.io/data-package/).
:::
Having added the data and its documentation,
one can view and record the utilized packages with `renv`,
thus bringing the project into a consistent state:
```{.r filename="Console"}
renv::status()
renv::snapshot()
```
## Add Data Citation and Attribution
All data relied upon should be cited in the manuscript
to allow for precise identification and access.
From the "eight core principles of data citation" by @Starr2015,
licensed under [CC0\ 1.0][cc0]:
> **Principle 1 – Importance**: "Data should be considered legitimate,
> citable products of research. Data citations should be accorded the same
> importance in the scholarly record as citations of other research objects,
> such as publications."
>
> **Principle 3 – Evidence**: "In scholarly literature, whenever and wherever a
> claim relies upon data, the corresponding data should be cited."
>
> **Principle 5 – Access**: "Data citations should facilitate access to the data
> themselves and to such associated metadata, documentation, code, and other
> materials, as are necessary for both humans and machines to make informed use
> of the referenced data."
>
> **Principle 7 – Specificity and Verifiability**: "Data citations should
> facilitate identification of, access to, and verification of the specific data
> that support a claim. Citations or citation metadata should include
> information about provenance and fixity sufficient to facilitate verifying
> that the specific time slice, version and/or granular portion of data
> retrieved subsequently is the same as was originally cited."
Now, it's your turn to add an appropriate citation for the data set to the manuscript.
Does your citation adhere to the principles above?
::: {#tip-cite-palmerpenguins .callout-tip collapse="true"}
#### Citing the Data Set (Solution)
You can find an appropriate BibTeX entry on [the package website][palmerpenguins-citation]
or with the function `citation()`:^[Note that this function requires
to have the respective package installed.]
```{r}
#| class.output: "bib code-overflow-scroll"
#| attr-output: 'filename="Bibliography.bib"'
citation("palmerpenguins", auto = TRUE) |>
transform(key = "horst2020") |>
toBibtex()
```
Copy the BibTeX entry to the file `Bibliography.bib`.
Then, find the line in the manuscript that says "cite data here"
and replace it with a sentence such as the following:
```{.md filename="Manuscript.qmd"}
The analyzed data are by @horst2020.
```
Render the document to check that the citation is displayed properly.
```{.bash filename="Terminal"}
quarto render Manuscript.qmd
```
:::
While citation happens in the manuscript for reasons of academic integrity and reproducibility,
to comply with any licenses you also may need to provide attribution within your project folder.
Even though the data file we use here does not require attribution,
we recommend adding a short paragraph to `LICENSE.txt`:
```{r}
#| echo: false
#| class-output: "txt code-overflow-wrap"
#| attr-output: 'filename="LICENSE.txt"'
cat(paste0(
"The penguins data stored in \"data.csv\" by Allison Horst, Alison Hill, and Kristen Gorman available from <",
citation("palmerpenguins", auto = TRUE)$url,
"> are licensed under CC0 1.0: <https://creativecommons.org/publicdomain/zero/1.0/>"
))
```
As before, if the license required adding the full license text,
you would also need to copy it to the project folder (if not already in there).
Finally, you can go through the commit routine:
```{.bash filename="Terminal"}
git status
git add .
git commit -m "Add data"
```