diff --git a/chapters/02_primary_practices.md b/chapters/02_primary_practices.md index 2f16230..20663cf 100644 --- a/chapters/02_primary_practices.md +++ b/chapters/02_primary_practices.md @@ -229,6 +229,94 @@ toy_data/ Very (very!) small pieces of data for dev testing If a directory contains many files or subdirectories, consider whether it's clearer to write a separate manifest specifically for that directory. +### Document the Data + +In a perfect world, every data set would come with detailed documentation or +**metadata** about how the data were collected, what assumptions were made, +what biases might be present, any ethical concerns, the overall structure, what +each observation means, what each feature means, and more. Good data +documentation guides researchers towards appropriate, responsible use of the +data. + +Collecting data as part of a project gives you and your collaborators control +over how the data are documented, so you can ensure there are no gaps. If your +project uses data collected earlier or by someone else, it's a good practice to +fill gaps in the existing documentation with your own. Thorough documentation +isn't just beneficial to other researchers, it's also beneficial to future +you---small details you notice and document about features could be important +later in the project. + +:::{seealso} +See DataLab's [README, Write Me! workshop reader][datalab-readme] for more +about how to document data. +::: + +[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/ + +(create-data-dictionary)= +#### Create a Data Dictionary + +A **data dictionary**, part of your metadata, is a document that explains what +every field or element in your dataset means as well as any restrictions on +their values. This includes things like the data type (ex. number, date, text, +boolean), and whether that field can be missing. The more information you +include, the more helpful it will be down the line (see [Captain +Obvious][captain_o]). Data dictionaries are the most efficient way to +communicate the structure and content of your data to other collaborators, +including future you! A very basic one could look like this: + +|Field Name |Field Description |Data Type |Notes | +|-----------|------------------------------------------|------------|----------| +|person_id |autogenerated by database |integer | | +|name |legal full name (family name, given name) |string | | +|occupation |A person's job or vocation |string |Must come from the Bureau of Labor Statistics Occupation List | +|... |... |... |... | + + + +If you aren't sure where to start with creating a data dictionary, DataLab has a +[template][datalab_dd_template] you can use as a jumping off point. If you +prefer step by step instructions, Kristin Briney's [Create a Data Dictionary +exercise][create_dd] minght be for you. [Open Science Framework (OSF)][osf_dd] has +resources on what details to add to your data dictionary, and the +[USGS][usgs_dd] provides many examples of data dictionaries and how they are +used in different contexts. If you are working with multiple data sets, make +sure to clarify which data dictionary to use with each data set. + +If your dataset looks less like a series of rows and columns, and more like a +long list of files, consider creating a **data inventory** instead. A data +inventory should include the author or source, title, publication year and DOI +(if published), and file name for each file, and can include more file metadata +as necessary. For example, a data inventory for works of fiction in the public +domain could look something like this. + +|Author |Title |Year |Filename | +|--------------------|--------------------|-----|------------------------------------------| +|Bronte,Charlotte |JaneEyre |1847 |EN_1847_BronteCharlotte_JaneEyre.txt | +|Austen,Jane |SenseandSensibility |1811 |EN_1811_AustenJane_SenseandSensibility.txt| +|Wollstonecraft,Mary |Maria |1798 |EN_1798_WollstonecraftMary_Maria.txt | +|... |... |... |... | + + +If you also need to keep track of things like the provenance or license +associated with each file or data set, DataLab's +[data inventory template][datalab_di_template] provides a comprehensive +starting point. + +[osf_dd]: https://help.osf.io/article/217-how-to-make-a-data-dictionary +[usgs_dd]: https://www.usgs.gov/data-management/data-dictionaries +[captain_o]: https://dataedo.com/blog/captain-obivous-guide-to-column-descriptions-data-dictionary-best-practices +[datalab_dd_template]: https://docs.google.com/spreadsheets/d/12N0hKyeT0ndZnt7rVZsz7LTW--BHhbb6TOegXEKQoxE/edit?usp=sharing +[datalab_di_template]: https://docs.google.com/spreadsheets/d/1nUb-eu82Q7VplDpk0np5rYuaN52mYHLdql18pRD0i4Y/edit?usp=sharing +[create_dd]: https://caltechlibrary.github.io/RDMworkbook/documentation.html#data-dictionary + +:::{seealso} +See [OSF][osf_dd] and the [Research Data Management Workbook][create_dd] for how +to create a data dictionary, UC Davis DataLab for [data +dictionary][datalab_dd_template] and [data inventory][datalab_di_template] +templates, and [USGS][usgs_dd] for examples. +::: + (workflows)= #### Workflows diff --git a/chapters/03_secondary_practices.md b/chapters/03_secondary_practices.md index 6af33ea..6c6bbaa 100644 --- a/chapters/03_secondary_practices.md +++ b/chapters/03_secondary_practices.md @@ -12,31 +12,6 @@ relevant, and we recommend you do too. Documentation ------------- -### Document the Data - -In a perfect world, every data set would come with detailed documentation or -**metadata** about how the data were collected, what assumptions were made, -what biases might be present, any ethical concerns, the overall structure, what -each observation means, what each feature means, and more. Good data -documentation guides researchers towards appropriate, responsible use of the -data. - -Collecting data as part of a project gives you and your collaborators control -over how the data are documented, so you can ensure there are no gaps. If your -project uses data collected earlier or by someone else, it's a good practice to -fill gaps in the existing documentation with your own. Thorough documentation -isn't just beneficial to other researchers, it's also beneficial to future -you---small details you notice and document about features could be important -later in the project. - -:::{seealso} -See DataLab's [README, Write Me! workshop reader][datalab-readme] for more -about how to document data. -::: - -[datalab-readme]: https://ucdavisdatalab.github.io/workshop_how-to-data-documentation/ - - (document-every-experiment)= ### Document Every Experiment