Skip to content

Commit

Permalink
improved readme
Browse files Browse the repository at this point in the history
  • Loading branch information
zackarno committed Mar 4, 2020
1 parent d711cc4 commit f0e544a
Show file tree
Hide file tree
Showing 2 changed files with 73 additions and 126 deletions.
18 changes: 9 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,14 @@ devtools::install_github("zackarno/butteR")

### Example using the stratified sampler function

The stratified sampler function can be useful if you want to generate random samples from spatial point data. It has been most useful for me when I have shelter footparint data that I want to sample. For now, the function only reads in point data. Therefore, if the footprint data you have is polygons it should first be converted to points (centroids).
The stratified sampler function can be useful if you want to generate random samples from spatial point data. It has been most useful for me when I have shelter footprint data that I want to sample. For now, the function only reads in point data. Therefore, if the footprint data you have is polygons it should first be converted to points (centroids).

I believe the most useful/powerful aspect of this function is the ability to write out well labelled kml/kmz files that can be loaded onto phone and opened with maps.me or other applications. To use this function properly it is important that you first familiarize yourself with some of the theory that underlies random sampling and that you learn how "seeds" can be used/set in R to make random sampling reproducible. The function generates randome seeds and stores it as a an attribute field of the spatial sample. There is also the option to write the seed to the working directory as text file. Understanding how to use the seeds becomes important if you want to reproduce your results, or if you need to do subsequent rounds of sampling where you want to exclude the previous sample without having to read in the previous samples.
I believe the most useful/powerful aspect of this function is the ability to write out well labelled kml/kmz files that can be loaded onto phone and opened with maps.me or other applications. To use this function properly it is important that you first familiarize yourself with some of the theory that underlies random sampling and that you learn how "seeds" can be used/set in R to make random sampling reproducible. The function generates random seeds and stores it as a an attribute field of the spatial sample. There is also the option to write the seed to the working directory as text file. Understanding how to use the seeds becomes important if you want to reproduce your results, or if you need to do subsequent rounds of sampling where you want to exclude the previous sample without having to read in the previous samples.


To show how the function can be used I will first simulate a spatial data set and sample frame

```{r example}
```{r example, warning= F , message=F}
library(butteR)
library(dplyr)
library(sf)
Expand Down Expand Up @@ -81,23 +81,23 @@ sampler_ouput<-butteR::stratified_sampler(sample.target.frame = sample_frame,
```
The output is stored in a list. Below is the first 6 results of each stratified sample. The results are stratified sample. They can be viewed collectively or one at a time.
The output is stored in a list of data frames. Each data frame consists of the sample for one strata. Below I have printed the table of the first 6 results for strata A,B, and C in our example.
```{r}
sampler_ouput$results %>% purrr:::map(head) %>% knitr::kable()
sampler_ouput$results[1:3] %>% purrr:::map(head) %>% knitr::kable()
sampler_ouput$results$D %>% head()
````

The random_seed is saved in the list as well as an attribute of each stratified sample. The random seed is very important for reproducibility which is quite useful for subsequent rounds of data collection
The random_seed is saved in the list as well as an attribute of each stratified sample. The random seed is very important to be able to reproduce you work. This is particularly useful when you need to perform additional rounds (sometimes unexpected) of sampling for an assessment.

```{r}
sampler_ouput$random_seed
```

You can also view all of the remaining points which were not not randomly sampled. You can choose to have these written to a shape file. It is generally a good back up policy to write these as well.
The output of the stratified sampler object also stores the remaining sample as a separate data frame. It is often a good idea to write these to a shapefile or csv as back up, especially if you are not 100 % sure how to use the random seeds to reproduce your sampling.

```{r}
Expand Down Expand Up @@ -129,7 +129,7 @@ pt_sf2<- sf::st_as_sf(x = pt_data2, coords=coords, crs=4326)
```


Next I will show two spatial verification functions. The first one just finds the closest distance between points. It uses rTree spatial indexing so it will work quickly on fairly large datasets.
Next I will show two spatial verification functions. The first one just finds the closest distance between points. It uses rTree spatial indexing so it will work quickly on fairly large data sets.



Expand All @@ -140,7 +140,7 @@ closest_pts<- butteR::closest_distance_rtree(pt_sf1, pt_sf2)
closest_pts %>% head() %>% knitr::kable()
```

You could easily just filter the "closest_pts" ouput by a distance threshold of your choice. However to make it simpler I have wrapped this function in the function "check_distances_from_target" (I need to come up with a better name for this function). It will return all of the points in from "dataset"that are further than the set threshold from any point in the "target_points". It will also show you the distance to the closest target point. Obviously this is fake data so there are a ton of points returned (I will just display the first 6 rows). In your assessment dat there should obviously be much less.
You could easily just filter the "closest_pts" output by a distance threshold of your choice. However to make it simpler I have wrapped this function in the function "check_distances_from_target" (I need to come up with a better name for this function). It will return all of the points in from "data set"that are further than the set threshold from any point in the "target_points". It will also show you the distance to the closest target point. Obviously this is fake data so there are a ton of points returned (I will just display the first 6 rows). In your assessment data there should obviously be much less.


```{r}
Expand Down
181 changes: 64 additions & 117 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ devtools::install_github("zackarno/butteR")

The stratified sampler function can be useful if you want to generate
random samples from spatial point data. It has been most useful for me
when I have shelter footparint data that I want to sample. For now, the
when I have shelter footprint data that I want to sample. For now, the
function only reads in point data. Therefore, if the footprint data you
have is polygons it should first be converted to points (centroids).

Expand All @@ -31,7 +31,7 @@ phone and opened with maps.me or other applications. To use this
function properly it is important that you first familiarize yourself
with some of the theory that underlies random sampling and that you
learn how “seeds” can be used/set in R to make random sampling
reproducible. The function generates randome seeds and stores it as a an
reproducible. The function generates random seeds and stores it as a an
attribute field of the spatial sample. There is also the option to write
the seed to the working directory as text file. Understanding how to use
the seeds becomes important if you want to reproduce your results, or if
Expand All @@ -44,17 +44,7 @@ data set and sample frame
``` r
library(butteR)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.1
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
lon<-runif(min=88.00863,max=92.68031, n=1000)
lat<-runif(min=20.59061,max=26.63451, n=1000)
strata_options<-LETTERS[1:8]
Expand All @@ -72,25 +62,25 @@ pt_data %>% head() %>% knitr::kable()

| lon | lat | strata |
| -------: | -------: | :----- |
| 90.14262 | 26.06148 | D |
| 91.21273 | 23.59155 | C |
| 90.19238 | 26.24277 | E |
| 90.02332 | 25.27046 | H |
| 89.53342 | 20.90264 | G |
| 88.85128 | 20.98232 | G |
| 88.21260 | 25.12462 | F |
| 88.38863 | 22.14892 | G |
| 91.14093 | 23.49458 | H |
| 89.72288 | 22.25252 | G |
| 89.90292 | 22.34828 | D |
| 89.84144 | 23.05792 | B |

``` r
sample_frame %>% head() %>% knitr::kable()
```

| strata | sample\_size |
| :----- | -----------: |
| A | 33 |
| B | 69 |
| C | 39 |
| A | 63 |
| B | 58 |
| C | 28 |
| D | 85 |
| E | 30 |
| F | 16 |
| E | 60 |
| F | 45 |

Next we will run the stratified\_sampler function using the two
simulated data sets as input.
Expand All @@ -111,125 +101,82 @@ sampler_ouput<-butteR::stratified_sampler(sample.target.frame = sample_frame,
)
```

The output is stored in a list. Below is the first 6 results of each
stratified sample. The results are stratified sample. They can be viewed
collectively or one at a time.
The output is stored in a list of data frames. Each data frame consists
of the sample for one strata. Below I have printed the table of the
first 6 results for strata A,B, and C in our example.

``` r
sampler_ouput$results %>% purrr:::map(head) %>% knitr::kable()
sampler_ouput$results[1:3] %>% purrr:::map(head) %>% knitr::kable()
```

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_A | 828005 | 27 |
| 2\_A | 828005 | 68 |
| 3\_A | 828005 | 83 |
| 4\_A | 828005 | 100 |
| 5\_A | 828005 | 101 |
| 6\_A | 828005 | 124 |
| 1\_A | 643039 | 17 |
| 2\_A | 643039 | 23 |
| 3\_A | 643039 | 58 |
| 4\_A | 643039 | 69 |
| 5\_A | 643039 | 77 |
| 6\_A | 643039 | 90 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_B | 828005 | 10 |
| 2\_B | 828005 | 41 |
| 3\_B | 828005 | 44 |
| 4\_B | 828005 | 62 |
| 5\_B | 828005 | 69 |
| 6\_B | 828005 | 92 |
| 1\_B | 643039 | 28 |
| 2\_B | 643039 | 43 |
| 3\_B | 643039 | 49 |
| 4\_B | 643039 | 84 |
| 5\_B | 643039 | 88 |
| 6\_B | 643039 | 116 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_C | 828005 | 2 |
| 2\_C | 828005 | 32 |
| 3\_C | 828005 | 36 |
| 4\_C | 828005 | 45 |
| 5\_C | 828005 | 110 |
| 6\_C | 828005 | 138 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_D | 828005 | 1 |
| 2\_D | 828005 | 12 |
| 3\_D | 828005 | 13 |
| 4\_D | 828005 | 17 |
| 5\_D | 828005 | 28 |
| 6\_D | 828005 | 51 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_E | 828005 | 33 |
| 2\_E | 828005 | 50 |
| 3\_E | 828005 | 66 |
| 4\_E | 828005 | 87 |
| 5\_E | 828005 | 109 |
| 6\_E | 828005 | 146 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_F | 828005 | 135 |
| 2\_F | 828005 | 153 |
| 3\_F | 828005 | 317 |
| 4\_F | 828005 | 381 |
| 5\_F | 828005 | 402 |
| 6\_F | 828005 | 462 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_G | 828005 | 5 |
| 2\_G | 828005 | 6 |
| 3\_G | 828005 | 14 |
| 4\_G | 828005 | 19 |
| 5\_G | 828005 | 20 |
| 6\_G | 828005 | 25 |

| Description | rnd\_seed | uuid |
| :---------- | --------: | ---: |
| 1\_H | 828005 | 23 |
| 2\_H | 828005 | 24 |
| 3\_H | 828005 | 30 |
| 4\_H | 828005 | 49 |
| 5\_H | 828005 | 75 |
| 6\_H | 828005 | 85 |
| 1\_C | 643039 | 12 |
| 2\_C | 643039 | 37 |
| 3\_C | 643039 | 48 |
| 4\_C | 643039 | 81 |
| 5\_C | 643039 | 107 |
| 6\_C | 643039 | 122 |

``` r


sampler_ouput$results$D %>% head()
#> Description rnd_seed uuid
#> 1 1_D 828005 1
#> 2 2_D 828005 12
#> 3 3_D 828005 13
#> 4 4_D 828005 17
#> 5 5_D 828005 28
#> 6 6_D 828005 51
#> 1 1_D 643039 10
#> 2 2_D 643039 29
#> 3 3_D 643039 33
#> 4 4_D 643039 38
#> 5 5_D 643039 68
#> 6 6_D 643039 71
```

The random\_seed is saved in the list as well as an attribute of each
stratified sample. The random seed is very important for reproducibility
which is quite useful for subsequent rounds of data collection
stratified sample. The random seed is very important to be able to
reproduce you work. This is particularly useful when you need to perform
additional rounds (sometimes unexpected) of sampling for an assessment.

``` r
sampler_ouput$random_seed
#> [1] 828005
#> [1] 643039
```

You can also view all of the remaining points which were not not
randomly sampled. You can choose to have these written to a shape file.
It is generally a good back up policy to write these as well.
The output of the stratified sampler object also stores the remaining
sample as a separate data frame. It is often a good idea to write these
to a shapefile or csv as back up, especially if you are not 100 % sure
how to use the random seeds to reproduce your sampling.

``` r

sampler_ouput$samp_remaining %>% head() %>% knitr::kable()
```

| | lon | lat | strata | uuid | rnd\_seed |
| -- | -------: | -------: | :----- | ---: | --------: |
| 3 | 90.19238 | 26.24277 | E | 3 | 828005 |
| 4 | 90.02332 | 25.27046 | H | 4 | 828005 |
| 7 | 90.77956 | 25.45381 | E | 7 | 828005 |
| 8 | 90.88944 | 22.56836 | G | 8 | 828005 |
| 9 | 90.76433 | 21.99042 | A | 9 | 828005 |
| 11 | 90.83148 | 25.57179 | E | 11 | 828005 |
| | lon | lat | strata | uuid | rnd\_seed |
| - | -------: | -------: | :----- | ---: | --------: |
| 2 | 88.38863 | 22.14892 | G | 2 | 643039 |
| 4 | 89.72288 | 22.25252 | G | 4 | 643039 |
| 5 | 89.90292 | 22.34828 | D | 5 | 643039 |
| 6 | 89.84144 | 23.05792 | B | 6 | 643039 |
| 8 | 90.72916 | 21.23794 | G | 8 | 643039 |
| 9 | 92.38461 | 22.17366 | A | 9 | 643039 |

### Example using the check\_distance\_from\_target function

Expand Down Expand Up @@ -257,7 +204,7 @@ pt_sf2<- sf::st_as_sf(x = pt_data2, coords=coords, crs=4326)

Next I will show two spatial verification functions. The first one just
finds the closest distance between points. It uses rTree spatial
indexing so it will work quickly on fairly large datasets.
indexing so it will work quickly on fairly large data sets.

``` r

Expand All @@ -277,15 +224,15 @@ closest_pts %>% head() %>% knitr::kable()
| 199 | B | c(91.9385484030803, 22.9929798167442) | A | c(92.0439420932042, 22.9314622797974) | 12776.161 |
| 419 | D | c(88.6396377435045, 22.2862520419468) | C | c(88.7253538271838, 22.3836231110146) | 13936.767 |

You could easily just filter the “closest\_pts” ouput by a distance
You could easily just filter the “closest\_pts” output by a distance
threshold of your choice. However to make it simpler I have wrapped this
function in the function “check\_distances\_from\_target” (I need to
come up with a better name for this function). It will return all of the
points in from “dataset”that are further than the set threshold from any
point in the “target\_points”. It will also show you the distance to the
closest target point. Obviously this is fake data so there are a ton of
points returned (I will just display the first 6 rows). In your
assessment dat there should obviously be much less.
points in from “data set”that are further than the set threshold from
any point in the “target\_points”. It will also show you the distance to
the closest target point. Obviously this is fake data so there are a ton
of points returned (I will just display the first 6 rows). In your
assessment data there should obviously be much less.

``` r

Expand Down

0 comments on commit f0e544a

Please sign in to comment.