@@ -90,13 +90,13 @@ In the remainder of this chapter, we will primarily focus on the data
9090volume challenge, in particular exploring how different decisions about
9191data storage formats and layouts enable (or constrain) us as we attempt
9292to work with data at large scale. We'll formalize a couple of concepts
93- we've alluded at throughout the course, introduce a few new
93+ we've alluded to throughout the course, introduce a few new
9494technologies, and characterize the current state of best practice --
9595with caveat that this is an evolving area!
9696
9797## Cloud optimized data
9898
99- As we've seen, when it comes global and even regional environmental
99+ As we've seen, when it comes to global and even regional environmental
100100phenomena observed using some form of remote-sensing technology, much of
101101our data is now _ cloud-scale_ . In the most basic sense, this means many
102102datasets -- and certainly relevant collections of datasets that you
@@ -117,7 +117,7 @@ be beneficial when the compute platform is "near" the data, reducing
117117network and perhaps even I/O latency. However, to the extent that
118118realizing these benefits may require committing to some particular
119119commercial cloud provider (with its associated cost structure), this may
120- or may not be desirable change.
120+ or may not be a desirable change.
121121:::
122122
123123Practically speaking, accessing data in the cloud means using HTTP to
@@ -173,7 +173,7 @@ Geospatial Consortium
173173(OGC)] ( https://www.ogc.org/announcement/cloud-optimized-geotiff-cog-published-as-official-ogc-standard/ )
174174in 2023. However, COGs _ are_ GeoTIFFs, which have been around since the
1751751990s, and GeoTIFFs are TIFFs, which date back to the 1980s. Let's work
176- our way through this linage .
176+ our way through this lineage .
177177
178178First we have the original ** TIFF** format, which stands for Tagged Image
179179File Format. Although we often think of TIFFs as image files, they're
@@ -227,7 +227,7 @@ the same compressed tile.
227227
228228A set of overviews (lower resolution tile pyramids) are computed from
229229the main full resolution data and stored in the file, again following a
230- tiling scheme and and arranged in order. This allows clients to load a
230+ tiling scheme and arranged in order. This allows clients to load a
231231lower resolution version of the data when appropriate, without needing
232232to read the full resolution data itself.
233233:::
@@ -261,7 +261,7 @@ particular desired subset of the image at a particular resolution
261261the desired bounding box into pixel coordinates, then identifying which
262262tile(s) in the COG intersect with the area of interest, then determining
263263the associated byte ranges of the tile(s) based on the metadata read in
264- teh first step. And the best part is that "client" here refers to the
264+ the first step. And the best part is that "client" here refers to the
265265underlying software, which takes care of all of the details. As a user,
266266typically all you need to do is specify the file location, area of
267267interest, and desired overview level (if relevant)!
@@ -286,7 +286,7 @@ configuration (shape) optimized for expected usage patterns. This
286286enables a client interested in a subset of data to retrieve the relevant
287287data without receiving too much additional unwanted data. In addition,
288288chunk layout should be such that, under expected common usage patterns,
289- proximal chunks are morely likely to be requested together. On average,
289+ proximal chunks are more likely to be requested together. On average,
290290this will reduce the number of separate read requests a client must
291291issue to retrieve and piece together any particular desired data subset.
292292In addition, chunks should almost certainly be compressed with a
@@ -339,7 +339,7 @@ choice of how to break the data into separately compressed and
339339addressable subsets is now decoupled from the choice of how to break the
340340data into separate files; a massive dataset can be segmented into a very
341341large number of small chunks without necessarily creating a
342- correspondingingly large number of small individual files, which can
342+ correspondingly large number of small individual files, which can
343343cause problems in certain contexts. In some sense, this allows a Zarr
344344store to behave a little more like a COG, with its many small,
345345addressable tiles contained in a single file.
@@ -640,8 +640,8 @@ our own custom multidimensional data array from a large collection of
640640data resources that themselves be arbitrarily organized with respect to
641641our specific use case.
642642
643- Interested in learning more about STAC? If so, head over the [ STAC
644- Index] ( https://stacindex.org/ ) , and online resource listing many
643+ Interested in learning more about STAC? If so, head over to the [ STAC
644+ Index] ( https://stacindex.org/ ) , an online resource listing many
645645published STAC catalogs, along with various related software and
646646tooling.
647647
@@ -661,7 +661,7 @@ structured netCDF and GeoTIFF files -- historically successful and
661661efficiently used in local storage, but often suboptimal at cloud scale.
662662The second represents simple, ad hoc approaches to splitting larger data
663663into smaller files, thrown somewhere on a network-accessible server, but
664- without efficiently readable overaching metadata and without any optimal
664+ without efficiently readable overarching metadata and without any optimal
665665structure. The next two represent cloud optimized approaches, with data
666666split into addressable units described by up-front metadata that clients
667667can use to efficiently access the data. The first of these resembles a
@@ -694,7 +694,7 @@ use of external metadata -- whether as Zarr metadata or STAC catalogs --
694694that allows clients to issue data requests that "just work"
695695regardless of the underlying implementation details.
696696
697- As a final takeway , insofar as there's a community consensus around the
697+ As a final takeaway , insofar as there's a community consensus around the
698698best approaches for managing data today, it probably looks something
699699like this:
700700
@@ -703,7 +703,7 @@ like this:
703703 cloud
704704- ** Zarr stores** (with intelligent chunking and sharding), potentially
705705 referenced by STAC catalogs, as the go-to approach for storing and
706- provisioing multidimensional Earth array data in the cloud
706+ provisioning multidimensional Earth array data in the cloud
707707- ** Virtual Zarr stores** , again potentially in conjunction with STAC
708708 catalogs, as a cost-effective approach for cloud-enabling many legacy
709709 data holdings in netCDF format
0 commit comments