6. Data (OR Configuring Input Data for New Study Regions?)

If you are only generating a policy checklist refer to the Policy Checklist Data

When configuring new study region(s), some input data will be required. This is to be stored in sub-folders within the process/data folder. Examples of required data include a population grid, an excerpt from OpenStreetMap, and a study region boundary (either an administrative boundary or an urban region from the Global Human Settlements Urban Centres Database). Other data are optional: for example, custom boundaries for aggregation (see example), GTFS transit feed data, or a completed policy checklist.

The kinds of data that can be configured for usage are summarised in the below table. We have provided examples for each of these for Las Palmas, with the exception at the time of writing of a completed policy checklist.

Usage note	Data sub-folder	Purpose
Required	OpenStreetMap	an OpenStreetMap .pbf file with coverage of the region (and time) of interest; this could be an historical planet file, or a region-specific excerpt
Required	population_grids	Population distribution raster grid or vector data with coverage of urban region of interest. GHS population grid (R2023) is recommended (for example, the 2020 Molleweide 100m grid tiles corresponding to your area of interest, with these saved and extracted to a folder like `process/data/GHS/R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0`, which may be specified in `process/configuration/datasets.yml`. Take care to select the correct Epoch for your analysis before downloading!
Conditional	region_boundaries	Vector boundary for identifying study region (e.g. geopackage, geojson or shp). If a geopackage is used, a specific layer can optionally be specified and queried, as per this example
Conditional	urban_regions	Global Human Settlements Layer Urban Centres database and/or administrative boundary for urban region of interest
Optional	policy_review	Summarising results of a policy review analysis
Optional	transit_feeds	Collections of zipped GTFS feeds to represent public transport service frequency
Optional	other_custom_data	Other custom data, such as points of interest

Specific paths to data are configurable in the configuration files. However, in general, it is recommended that input data be stored within subfolders of the data folder as per the provided example.

OpenStreetMap data

OpenStreetMap data are used to represent features of interest within regions of interest.

This can be retrieved in .pbf (recommended; smaller file size) or .osm format from sites including:

https://download.geofabrik.de/
http://download.openstreetmap.fr/extracts/
https://planet.openstreetmap.org/pbf/ (whole of planet data archives; very large file sizes, but can be useful for retrieving historical data)

It is recommended that a suffix to indicate the publication date is added to the name of downloaded files as a record of the time point at which the excerpt is considered representative of the region of interest e.g. "las_palmas-latest.osm_20230210.pbf" or "oceania_yyyymmdd.pbf" where yyyy is the year, mm is the 2-digit numerical month, and dd is the 2-digit numerical day date_.

The main considerations are that the excerpt is as small as possible while ensuring that it has complete coverage of the region(s) of interest (and about 1600 metres additional beyond the boundary, or as otherwise configured). Using a smaller rather than a larger file will speed up processing.

For example, for a project considering multiple cities in Spain, the researchers could download an excerpt for the country of Spain or for specific sub-regions like Catalunya as required to ensure that the region of interest is encompassed within the extract. Using the example of Spain, which also contains regions outside Europe, if sourcing from Geofabrik as the example links above, it is worth noting that the excerpts are grouped by continent so some care should be taken. For example, Las Palmas de Gran Canaria are found under the Canary Islands in Africa on Geofabrik, or a smaller PBF specifically for Las Palmas can be retrieved from download.openstreetmap.fr under 'africa/spain/canarias'. Our software provides example data for Las Palmas, and to reduce file size we prepared a clipped portion of the data around this city using only the minimum area required.

Population grid data

Population grid or vector data are used to represent the population distribution within regions of interest.

We recommend usage and configuration of the R2023a or more recent population data from the Global Human Settlements Layer project, with a time point of 2020 (or as otherwise appropriate for your project needs, and bearing in mind the limitations of the GHSL population model) and using Mollweide (equal areas, 100m) projection. note: take care when retrieving the GHSL population data to select the correct Epoch! The default at the time of writing is a population projection for 2030, however, currently the most relevant dataset is the 2020 estimates; make sure to select 2020

A whole-of-planet population dataset for 2020 can be downloaded from the following link: https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2022A/GHS_POP_E2020_GLOBE_R2022A_54009_100/V1-0/GHS_POP_E2020_GLOBE_R2022A_54009_100_V1_0.zip

This file is approximately 5 GB in size, so if you are only analysing one or a few cities, it may be worth identifying the specific tiles that relate to your study region(s) using the interactive map downloader provided by GHSL and storing them as suggested above.

The report describing this data is located here.

The citation for the data is: Schiavina M., Freire S., MacManus K. (2022): GHS-POP R2022A - GHS population grid multitemporal (1975-2030).European Commission, Joint Research Centre (JRC). https://doi.org/10.2905/D6D86A90-4351-4508-99C1-CB074B022C4A

Optionally, official population data can be used if this is judged to be more accurate or meaningful for your city's analysis and interpretation of results by your intended audience. For example, to use the official 1km Australian population grid for 2021 (instead of the 100m grid estimates from the GHSL-POP r2022a dataset for 2020 or 2025), this could be specified as per the following code block:

population:
    name: "Australian Population Grid 2021 (ABS, 2023)"
    data_dir: population_grids/Australian_Population_Grid_2021_in_TIFF_format
    data_type: raster:Int64
    resolution: 1km
    raster_band: 1
    raster_nodata: 
    pop_min_threshold: 5
    crs_name: GDA94 / Australian Albers
    crs_standard: EPSG
    crs_srid: 3577
    source_url: https://www.abs.gov.au/statistics/people/population/regional-population/2021/GEOTIFF.zip
    year_published: 2022
    year_target: 2021
    date_acquired: 20230302
    licence: CC BY 4.0
    citation: "Australian Bureau of Statistics (2022), Australian Population Grid 2021. https://www.abs.gov.au/statistics/people/population/regional-population/2021"

Optionally, population estimates from a vector data source can also be configured, for example:


population: 
    alias: catalunya_2021
    name: "PoblaciÃ³ de Catalunya georeferenciada a 1 de gener de 2021"
    data_dir: population_grids/gridpoblacio01012021/gridpoblacio_01012021.shp
    vector_population_data_field: TOTAL
    population_denominator: TOTAL
    crs_name: ETRS89 / UTM zone 31N
    crs_standard: ESRI
    crs_srid: 25831
    source_url: https://www.idescat.cat/serveis/biblioteca/docs/bib/publicacions/gridpoblacio01012021.zip
    provider: Institut dâ€™EstadÃstica de Catalunya
    year_published: 2023
    year_target: 2021
    date_acquired: 20230608
    licence: CC BY 4.0
    licence_url: https://creativecommons.org/licenses/by/4.0/deed.ast
    data_type: vector
    pop_min_threshold: 1
    # urban sample points intersecting grid cells with an estimated population less than this will be excluded from the analysis
    citation: "PoblaciÃ³ De Catalunya Georeferenciada a 1 De Gener De .. Barcelona: Generalitat de Catalunya. Institut d'EstadÃstica de Catalunya, 2016. https://biblio.idescat.cat/publicacions/Record/21104"

In the above vector data example, this contains multiple fields with estimates by age group which opens up additional possibilities for analysis of indicators for specific population sub-groups (in this case, using age). To do this, one could modify the vector_population_data_field to the field containing the population of interest, e.g.

    vector_population_data_field: P_15_64

In this way, age-specific indicators could be calculated and compared. Indicators based on overall population density use the estimates from the population_denominator field; hence, if you are interested in the overall population for your indicators the data_field and the denominator should share the same value.

Choice of the population data grid is an important methodological choice for analysts as it determines the resolution at which results are output (for example a choice between 100m estimated population, or 1000m official population statistic, or administrative areas of varying dimensions). As such, it is good to know that this is an option available to ensure your indicator results serve their intended purpose and audience.

Region boundary data

Region boundary data are used to identify the study region(s) of interest. These may be retrievable from a government open data portal, or the UN OCHA Humanitarian Data Exchange.

Additional custom region boundaries can be used for custom aggregations, as per the example:

###########
## Optional custom aggregation to additional areas of interest (e.g. neighbourhoods, suburbs, specific developments):
custom_aggregations:
    ## Name for this aggregation layer
    ## The name is followed by a colon, indicating that a list of details follows
    school_districts_grid_pop:
        ## path to data relative to the project data folder
        data: "region_boundaries/Example/Las Palmas excerpt- gobcan_educacion_areainfluenciacentrosecundaria.geojson"
        ## The field used as a unique identifier
        id: 'Codigo'
        ## A list of column field names to be retained
        keep_columns: Denominaci, cod_postal
        ## The indicator layer to be aggregated ("point" or "grid")
        ## Aggregation is based on the average of intersecting results
        ## unless the agg_distance parameter is defined (see alternative example below)
        aggregation_source: grid
        ## The variable used for weighting (e.g. 'pop_est' for population when using the grid; leave blank or "false" if using sample points)
        weight: pop_est
        note: "Example of aggregating indicators for high school catchment districts within Las Palmas, using the intersection with the population grid and taking the population weighted average of indicators.  Boundary data was derived from data sourced from the open data portal of the Gobierno de Canarias under CC BY 4.0 licence terms: https://opendata.sitcan.es/dataset/centros-educativos/resource/ea650255-c6ea-48c1-84e8-547735624017 (last updated 31 May 2023)."
    ## an example for aggregating for buildings represented in OpenStreetMap
    buildings_osm_30m:
        data: "OSM:building is not NULL"
        keep_columns: building
        ## Distance within metres to use for taking average when aggregating
        ## (see note)
        aggregate_within_distance: 30
        aggregation_source: point
        note: "Example of aggregating using buildings extracted from the configured OpenStreetMap data, taking the average of sample point estimates taken along the pedestrian network within 30m.  This has been done because the point indicators were sampled along the pedestrian network and are therefore unlikely to intersect with buildings.  By taking the average of points within some reasonable distance, the result is like a moving window average that should provide a reasonable representation of the immediate neighbourhood milieu surrounding the building."
###########

Urban region data

Urban region data are used to optionally identify and restrict the analysis to the urban portion of a study region, and optionally can also be used to identify an urban region of interest (e.g. an urban agglomeration surrounding a major city). Additionally, city-level urban covariates contained within the database may be linked with the final estimates (e.g. air pollution estimates).

This is done using the Global Human Settlements Urban Centres Database, for example GHS_STAT_UCDB2015MT_GLOBE_R2019A_V1_2.gpkg (or a more recent release, if available), which may be retrieved from https://ghsl.jrc.ec.europa.eu/download.php?ds=ucdb.

The dataset linked above represents urban centres in 2015 and may be cited as: Florczyk A., Corbane C,. Schiavina M., Pesaresi M., Maffenini L., Melchiorri, M., Politis P., Sabo F., Freire S., Ehrlich D., Kemper T., Tommasi P., Airaghi D., Zanchetta L. (2019) GHS Urban Centre Database 2015, multitemporal and multidimensional attributes, R2019A. European Commission, Joint Research Centre (JRC)PID: https://data.jrc.ec.europa.eu/dataset/53473144-b88c-44bc-b4a3-4583ed1f547e

Policy checklist data

The global indicators software has been designed to report on both policy and spatial indicator results. Once the policy checklist has been completed for a city, save this in the project data folder (e.g. within data/policy_review). Using the GHSCI web app interface you can now select this completed file to view, query and generate a PDF summary report.

GTFS transit feed data

Collections of General Transit Feed Specification (GTFS) feeds are used to represent public transport service frequency. See the GTFS reference for details on what is required for GTFS feeds. More than one transit operator may operate in the region of interest, so it is important to aim to have full coverage of both the region and the operators, without overlap of services, to represent the service frequency of public transport stops in a city.

Transit feed data may be available from Government open data portals, or otherwise from aggregator sites such as https://transitfeeds.com/ and https://www.transit.land/.

We recommend that zipped feeds be stored in a region-specific GTFS parent folder, e.g. "Spain - Valencia", that contains one or zipped GTFS feeds, e.g. gtfs_spain_valencia_EMT_20190403.zip and gtfs_spain_valencia_MetroValencia_20190403.zip. The parent folder and specific feeds are configured in the region configuration file. For example,

## Optional set-up for General Transit Feed Specification (GTFS) transit data.
## GTFS feed data is used to evaluate access to public transport stops with regular weekday daytime service
## For cities with no GTFS feeds identified, this may be left commented out.
gtfs_feeds:
    ## City-specific parent folder in the 'process/data/transit_feeds' directory
    folder: Example
    ## list of zipped GTFS feeds saved in the above folder
    gtfs_es_las_palmas_de_gran_canaria_guaguas_20230222.zip:
        ## Name of agency that published this data
        gtfs_provider: Guaguas
        ## Year the data was published
        gtfs_year: 2023
        ## Source URL for the data
        gtfs_url: http://www.guaguas.com/transit/google_transit.zip
        ## The start date of a representative period for analysis
        ## (outside school holidays and extreme weather events), e.g. Spring/Summer
        ## for Northern Hemisphere: 20230405
        ## for Southern Hemisphere: 20231008
        start_date_mmdd: 20230405
        ## The start date of a representative period for analysis
        ## (outside school holidays and extreme weather events), e.g. Spring/Summer
        ## for Northern Hemisphere: 20230605
        ## for Southern Hemisphere: 20231205
        end_date_mmdd: 20230605

Some GTFS feeds don't adhere to the reference standard and some customisation is possible to accommodate the mapping of alternate route_types and agency_id values. The defaults for these fields can be seen in the datasets.yml file here. These can be over-ridden for each mode, for example, for the below feed which describes service schedules for a semi-Metro a route type of 400 was used. This was mapped as a value for the 'Tram' mode to allow analysis to progress:

gtfs_feeds:
    folder: Marc issues/Malaga
    20230519_130136_Metro_Malaga:
      gtfs_provider: Metro de MÃ¡laga
      gtfs_year: 2023
      start_date_mmdd: 20230519
      end_date_mmdd: 20230831
      modes:
        Tram: {'route_types': [400]}

Multiple values can also be specified, e.g.,

      modes:
        Bus: {'route_types': [700,712,714]}

However, in most cases, the route_type values for a feed (found in the routes.txt file) should correspond to the defaults of the specification, in which case the modes parameter can be omitted as per the example configuration. This will draw upon the default settings.

Data output folder

Generated resources are saved to study region-specific codename sub-folders within the data/_study_region_outputs directory.

The following provides an indicative list of the contents of this folder (italics indicate a word that will vary depending on configuration):

Item	Type	Description
_web reports	Folder	Contains generated policy and spatial indicator reports, optionally in multiple languages
figures	Folder	Contains generated maps and figures
__region name__codename_processing_log.txt	text file	A text file that is progressively appended to with the screen outputs for analyses that are not otherwise displayed. This contains a record of processing, and is useful when debugging if something has gone awry with a particular configuration or supplied data
_parameters.yml	text file	A text file containing records of the most recent configuration analysed (on detection of a new version, older versions will be retained with a suffix indicating the date at which it was current)
analysis_report_yyyy-mm-dd_hhmm.pdf	PDF file	A PDF report summarising approach taken for analysis of the core set of spatial indicators and generation of associated resources
compare_reference codename_comparison_codename_yyyy-mm-dd_hhmm.csv	CSV file	A CSV spreadsheet containing a comparison of summary indicator results for a reference study region and a comparison study region (generated as a result of running the compare process)
codename_1600m_buffer.gpkg	Geopackage file	A geopackage containing derived study region features of interest used in analyses, and including grid and overall summary results for this region
codenameindicators{resolution}_yyyy.csv	CSV file	Grid or small area summary results of indicator analysis for this region
codename_indicators_region.csv	CSV file	Overall summary results of indicator analysis for this region
codename_metadata.xml	XML file	XML metadata (ISO19115)
codename_metadata.yml	YML file	YML metadata
output_data_dictionary.csv	CSV file	CSV data dictionary
output_data_dictionary.xlsx	XLSX file	Formatted Excel data dictionary
codename_osm_yyyymmdd.pbf	PBF file	An excerpt from OpenStreetMap for this buffered study region as configured
poly_codename.poly	text file	A polygon boundary file; this is generated for the buffered urban region of interest as per configuration in regions/codename.yml, and is used to excerpt a portion of OpenStreetMap for this region from the configured input data
population_resolution_codename_project epsg code.tif	TIF file	A population raster for this buffered study region, excerpted from the input data, in the projects coordinate reference system
population_resolution_codename_source crs.tif	TIF file	A population raster for this buffered study region, excerpted from the input data, in the coordinate reference system of the input population data (e.g. Mollweide, in the case of the recommended GHS-POP data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly