You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: 1_08_getting_help.Rmd
+1-1
Original file line number
Diff line number
Diff line change
@@ -100,7 +100,7 @@ demo()
100
100
When we use the `demo` function like this it only lists the demos associated with packages that have been loaded in the current session (via `library`). If we want to see all the demos we can run we need to use the somewhat cryptic `demo(package = .packages(all.available = TRUE))`.
101
101
102
102
In order to actually run a demo we use the `demo` function, setting the `topic` and `package` arguments. For example, to run the "colors" demo in the __grDevices__ package we would use:
103
-
```{r,echo=FALSE}
103
+
```{r,eval=FALSE}
104
104
demo(colors, package = "grDevices", ask = FALSE)
105
105
```
106
106
This particular demo shows off some of the pre-defined colours we might use to customise the appearance of a plot. We've suppressed the output though because so much is produced.
Copy file name to clipboardexpand all lines: 2_03_tidy_data_dplyr_intro.Rmd
+1-1
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Introduction
4
4
5
-
[Data wrangling]
5
+
Data wrangling refers to the process of manipulating raw data into the format that we want it in, for example for data visualisation or statistical analyses. There are a wide range of ways we may want to manipulate our data, for example by creating new variables, subsetting the data, or calculating summaries. Data wrangling is often a time consuming process. It is also not the most interesting part of any analysis - we are interested in answering biological questions, not in formatting data. However, it is a necessary step to go through to be able to conduct the analyses that we're really interested in. Learning how to manipulate data efficiently can save us a lot of time and trouble and is therefore a really important skill to master.
<metaname="description" content="Course book for Introduction to Exploratory Data Analysis with R (APS 135) in the Department of Animal and Plant Sciences, University of Sheffield.">
11
-
<metaname="generator" content="bookdown 0.3 and GitBook 2.6.7">
10
+
<metaname="generator" content="bookdown 0.5 and GitBook 2.6.7">
12
11
13
12
<metaproperty="og:title" content="APS 135: Introduction to Exploratory Data Analysis with R" />
<liclass="chapter" data-level="16" data-path="building-piplines.html"><ahref="building-piplines.html"><iclass="fa fa-check"></i><b>16</b> Building piplines</a><ul>
281
-
<liclass="chapter" data-level="16.1" data-path="building-piplines.html"><ahref="building-piplines.html#why-do-we-need-pipes"><iclass="fa fa-check"></i><b>16.1</b> Why do we need ‘pipes’?</a></li>
279
+
<liclass="chapter" data-level="16" data-path="building-pipelines.html"><ahref="building-pipelines.html"><iclass="fa fa-check"></i><b>16</b> Building pipelines</a><ul>
280
+
<liclass="chapter" data-level="16.1" data-path="building-pipelines.html"><ahref="building-pipelines.html#why-do-we-need-pipes"><iclass="fa fa-check"></i><b>16.1</b> Why do we need ‘pipes’?</a></li>
<p>Data wrangling refers to the process of manipulating raw data into the format that we want it in, for example for data visualisation or statistical analyses. There are a wide range of ways we may want to manipulate our data, for example by creating new variables, subsetting the data, or calculating summaries. Data wrangling is often a time consuming process. It is also not the most interesting part of any analysis - we are interested in answering biological questions, not in formatting data. However, it is a necessary step to go through to be able to conduct the analyses that we’re really interested in. Learning how to manipulate data efficiently can save us a lot of time and trouble and is therefore a really important skill to master.</p>
377
376
</div>
378
377
<divid="why-dplyr" class="section level2">
379
378
<h2><spanclass="header-section-number">11.2</span> The value of <strong>dplyr</strong></h2>
<metaname="description" content="Course book for Introduction to Exploratory Data Analysis with R (APS 135) in the Department of Animal and Plant Sciences, University of Sheffield.">
11
-
<metaname="generator" content="bookdown 0.3 and GitBook 2.6.7">
10
+
<metaname="generator" content="bookdown 0.5 and GitBook 2.6.7">
12
11
13
12
<metaproperty="og:title" content="APS 135: Introduction to Exploratory Data Analysis with R" />
<liclass="chapter" data-level="16" data-path="building-piplines.html"><ahref="building-piplines.html"><iclass="fa fa-check"></i><b>16</b> Building piplines</a><ul>
281
-
<liclass="chapter" data-level="16.1" data-path="building-piplines.html"><ahref="building-piplines.html#why-do-we-need-pipes"><iclass="fa fa-check"></i><b>16.1</b> Why do we need ‘pipes’?</a></li>
279
+
<liclass="chapter" data-level="16" data-path="building-pipelines.html"><ahref="building-pipelines.html"><iclass="fa fa-check"></i><b>16</b> Building pipelines</a><ul>
280
+
<liclass="chapter" data-level="16.1" data-path="building-pipelines.html"><ahref="building-pipelines.html#why-do-we-need-pipes"><iclass="fa fa-check"></i><b>16.1</b> Why do we need ‘pipes’?</a></li>
<li><p>to provide a foundation for further data collection.</p></li>
382
381
</ul>
383
382
<p>EDA involves a mix of both numerical and visual methods of analysis. Statistical methods are sometimes used to supplement EDA, but its main purpose is to facilitate understanding before diving into formal statistical modelling.</p>
384
-
<p>Even if we think we already know what kind of analysis we need to pursue, it’s always a good idea to <strong>explore a data set before diving into the analysis</strong>. At the very least, this will help us to determine whether or not our plans are sensible. Very often it uncovers new patterns and insights. In this chapter we’re going to examine some basic concepts that underpin EDA: 1) classifying different types data, and 2) distinguishing between populations and samples. This will set us up to learn how to explore our data in later chapters.</p>
383
+
<p>Even if we think we already know what kind of analysis we need to pursue, it’s always a good idea to <strong>explore a data set before diving into the analysis</strong>. At the very least, this will help us to determine whether or not our plans are sensible. Very often it uncovers new patterns and insights. In this chapter we’re going to examine some basic concepts that underpin EDA: 1) classifying different types of data, and 2) distinguishing between populations and samples. This will set us up to learn how to explore our data in later chapters.</p>
385
384
</div>
386
385
<divid="variables" class="section level2">
387
386
<h2><spanclass="header-section-number">17.2</span> Statistical variables and data</h2>
@@ -421,10 +420,10 @@ <h3><span class="header-section-number">17.2.2</span> Ratio vs. interval scales
<h2><spanclass="header-section-number">17.3</span> Populations, samples and distributions</h2>
424
-
<p>When we collect data of any kind, we are working a sample of objects (e.g. trees, insects, fields) from a wider population. We usually want to know something about the wider population, but since it’s impossible to study every member of the population, we study the properties of one or more samples instead.</p>
423
+
<p>When we collect data of any kind, we are working with a sample of objects (e.g. trees, insects, fields) from a wider population. We usually want to know something about the wider population, but since it’s impossible to study every member of the population, we study the properties of one or more samples instead.</p>
425
424
<p>The problem with samples is that they are ‘noisy’. If we were repeat the same data collection protocol more than once we should expect to end up with a different sample each time, even if the wider population never changes. This results purely from chance variation in the sampling of different units. Picking apart the relationship between samples and populations is the basis of much of statistics. This topic is best dealt with in a dedicated statistics book, so we won’t develop these ideas in much detail here.</p>
426
425
<p>The reason we mention the distinction between a population and a sample is because EDA is primarily concerned with properties of samples—it aims to characterise the sample in hand without trying to say too much about the wider population from which it is derived.</p>
427
-
<p>When we talk about “exploring a variable” what we are really doing is exploring is the <strong>sample distribution</strong> of that variable. What is this? The sample distribution is a statement about the frequency with which different values occur in a particular sample. Imagine we took a a sample of undergraduates and measured their height. The majority of students would be round about 1.7m tall, even though there would obviously be some variation among students. Men would tend to be slightly taller than women, and very small or very tall people would be rare. We know from experience that no one in in this sample would be over 3 meters tall. These are all statements about a (hypothetical) sample distribution of undergraduate heights.</p>
426
+
<p>When we talk about “exploring a variable” what we are really doing is exploring is the <strong>sample distribution</strong> of that variable. What is this? The sample distribution is a statement about the frequency with which different values occur in a particular sample. Imagine we took a sample of undergraduates and measured their height. The majority of students would be round about 1.7m tall, even though there would obviously be some variation among students. Men would tend to be slightly taller than women, and very small or very tall people would be rare. We know from experience that no one in this sample would be over 3 meters tall. These are all statements about a (hypothetical) sample distribution of undergraduate heights.</p>
428
427
<p>Our goal when exploring the sample distribution of a variable is to answer questions such as, What are the most common values of the variable; and How much do observations differ from one another? Rather than simply describing these properties in verbal terms, as we did above, we want to describe in a more informative way. There are two ways to go about this:</p>
429
428
<olstyle="list-style-type: decimal">
430
429
<li><p><strong>Calculate descriptive statistics</strong>. Descriptive statistics are used to quantify the basic features of a sample distribution. They provide simple summaries about the sample that can be used to make comparisons and draw preliminary conclusions. For example, we often use ‘the mean’ to summarise the ‘most likely’ values of a variable in a sample.</p></li>
0 commit comments