Update documentation to include tsbs_load and timestream

timescale · Dec 1, 2020 · 354f236 · 354f236
1 parent 0cf8054
commit 354f236
Show file tree

Hide file tree

Showing 25 changed files with 282 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -14,6 +14,7 @@ Current databases supported:
 + MongoDB [(supplemental docs)](docs/mongo.md)
 + SiriDB [(supplemental docs)](docs/siridb.md)
 + TimescaleDB [(supplemental docs)](docs/timescaledb.md)
++ Timestream [(suplemental docs)](docs/timestream.md)
 + VictoriaMetrics [(supplemental docs)](docs/victoriametrics.md)
 
 ## Overview
@@ -76,6 +77,7 @@ cases are implemented for each database:
 |MongoDB|X|
 |SiriDB|X|
 |TimescaleDB|X|X|
+|Timestream|X||
 |VictoriaMetrics|X²||
 
 ¹ Does not support the `groupby-orderby-limit` query
@@ -88,7 +90,8 @@ query execution performance. (It currently does not measure
 concurrent insert and query performance, which is a future priority.)
 To accomplish this in a fair way, the data to be inserted and the
 queries to run are pre-generated and native Go clients are used
-wherever possible to connect to each database (e.g., `mgo` for MongoDB).
+wherever possible to connect to each database (e.g., `mgo` for MongoDB, 
+`aws sdk` for Timestream).
 
 Although the data is randomly generated, TSBS data and queries are
 entirely deterministic. By supplying the same PRNG (pseudo-random number
@@ -217,6 +220,37 @@ A full list of query types can be found in
 
 ### Benchmarking insert/write performance
 
+TSBS has two ways to benchmark insert/write performance:
+* On the fly simulation and load with `tsbs_load`
+* Pre-generate data to a file and load it either with `tsbs_load` or the
+db specific executables `tsbs_load_*`
+
+#### Using the unified `tsbs_load` executable
+
+The `tsbs_load` executable can load data in any of the supported databases.
+It can use a pregenerated data file as input, or simulate the data on the 
+fly. 
+
+You first start by generating a config yaml file populated with the default
+values for each property with:
+```shell script
+$ tsbs_load config --target=<db-name> --data-source=[FILE|SIMULATOR]
+```
+for example, to generate an example for TimescaleDB, loading the data from file
+```shell script
+$ tsbs_load config --target=timescaledb --data-source=FILE
+Wrote example config to: ./config.yaml
+```
+
+You can then run tsbs_load with the generated config file with:
+```shell script
+$ tsbs_load load timescaledb --config=./config.yaml
+```
+
+For more details on how to use tsbs_load check out the [supplemental docs](docs/tsbs_load.md)
+
+#### Using the database specific `tsbs_load_*` executables
+
 TSBS measures insert/write performance by taking the data generated in
 the previous step and using it as input to a database-specific command
 line program. To the extent that insert programs can be shared, we have
@@ -241,15 +275,15 @@ cat /tmp/timescaledb-data.gz | gunzip | tsbs_load_timescaledb \
 ```
 
 For simpler testing, especially locally, we also supply
-`scripts/load_<database>.sh` for convenience with many of the flags set
+`scripts/load/load_<database>.sh` for convenience with many of the flags set
 to a reasonable default for some of the databases.
 So for loading into TimescaleDB, ensure that TimescaleDB is running and
 then use:
 ```bash
 # Will insert using 2 clients, batch sizes of 10k, from a file
 # named `timescaledb-data.gz` in directory `/tmp`
 $ NUM_WORKERS=2 BATCH_SIZE=10000 BULK_DATA_DIR=/tmp \
-    scripts/load_timescaledb.sh
+    scripts/load/load_timescaledb.sh
 ```
 
 This will create a new database called `benchmark` where the data is
@@ -263,7 +297,7 @@ Example for writing to remote host using `load_timescaledb.sh`:
 # named `timescaledb-data.gz` in directory `/tmp`
 $ NUM_WORKERS=2 BATCH_SIZE=10000 BULK_DATA_DIR=/tmp DATABASE_HOST=remotehostname
 DATABASE_USER=user DATABASE \
-    scripts/load_timescaledb.sh
+    scripts/load/load_timescaledb.sh
 ```
 
 ---

diff --git a/docs/timestream.md b/docs/timestream.md
@@ -0,0 +1,76 @@
+# TSBS Supplemental Guide: Timestream
+
+Amazon Timestream is a serverless time series database service.
+This supplemental guide explains how the data generated for TSBS is stored, 
+additional flags available when using the data importer (`tsbs_load load timestream`),
+and additional flags available for the query runner (`tsbs_run_queries_timestream`). **This
+should be read *after* the main README.**
+
+## Data format
+
+Data generated by `tsbs_generate_data` for Timestream is serialized in a
+"pseudo-CSV" format, along with a custom header at the beginning. The
+header is several lines long:
+* one line composed of a comma-separated list of tag labels, with the literal string `tags` as the first value in the list
+* one or more lines composed of a comma-separated list of field labels, with the table name as the first value in the list
+* a blank line
+
+An example for the `cpu-only` use case:
+```text
+tags,hostname,region,datacenter,rack,os,arch,team,service,service_version,service_environment
+cpu,usage_user,usage_system,usage_idle,usage_nice,usage_iowait,usage_irq,usage_softirq,usage_steal,usage_guest,usage_guest_nice
+
+```
+
+Following this, each reading is composed of two rows:
+1. a comma-separated list of tag values for the reading, with the literal string `tags` as the first value in the list
+1. a comma-separated list of field values for the reading, with the hypertable the reading belongs to being the first value and the timestamp as the second value
+
+An example for the `cpu-only` use case:
+```text
+tags,host_0,eu-central-1,eu-central-1b,21,Ubuntu15.10,x86,SF,6,0,test
+cpu,1451606400000000000,58.1317132304976170,2.6224297271376256,24.9969495069947882,61.5854484633778867,22.9481393231639395,63.6499207106198313,6.4098777048301052,44.8799140503027445,80.5028770761136201,38.2431182911542820
+```
+
+---
+
+## `tsbs_load load timestream` Additional Flags
+
+#### loader.db-specific.aws-region (type: `string`, default `us-east-1`)
+
+AWS region where the db is located
+
+#### loader.db-specific.use-common-attributes (type: `boolean`, default `true`)
+
+Timestream client makes write requests with common attributes.
+If false, each value is written as a separate Record, and a request of 100 records at once is sent.
+
+#### loader.db-specific.hash-property (type: `string`, default `hostname`)
+
+Dimension to use when hasing points to different workers
+
+#### loader.db-specific.use-current-time (type: `boolean`, default: `false`)
+
+Use the current local timestamp when creating the records to load.
+Usefull when you don't want to worry about the retention period vs simulated period.
+
+#### loader.db-specific.mag-store-retention-in-days (type: `int`, default: `180`)
+
+The duration for which data must be stored in the magnetic store
+
+#### loader.db-specific.mem-store-retention-in-hours (type: `int`, default: `12`)
+
+The duration for which data must be stored in the memory store.
+
+---
+## `tsbs_generate_queries` required `-db-name` flag
+
+Timestream requires the database name be part of the WHERE clause
+of every query, so the `--db-name` flag is a required flag 
+
+---
+## `tsbs_run_queries_timestream` Additional Flags
+
+#### `-aws-region` (type: `string`, default: `us-east-1`)
+
+AWS region where the database is located
diff --git a/docs/tsbs_load.md b/docs/tsbs_load.md
@@ -0,0 +1,109 @@
+# Supplemental Guide for `tsbs_load` 
+
+The `tsbs_load` executable can benchmark data ingestion
+for all the implemented databases.
+
+## Generating a config file
+
+`tsbs_load` uses YAML files to specify the configuration for 
+running the load benchmark.
+
+The config file is separated in two top-level sections:
+```yaml
+data-source:
+  ...
+loader: 
+  ...
+```
+* `data-source` contains the configuration for where to 
+read the data from (`type: SIMULATOR` or `type: FILE`)
+  * For `SIMULATOR` the configuration specifies the time range to be simulated,
+  the use-case, scale and other properties that regard the data
+  * For `FILE` the configuration only specifies the location of the pre-generated
+  file with `tsbs_generate_data`
+* `loader` contains the configuration for the loading the data. Two sub-sections are
+important here `db-specific` and `runner`
+  * The `db-specific` configuration varies depending of the target database
+  and for TimescaleDB contains information about user, password, ssl mode, while
+  for influx it contains information about backoff interval, replication factor etc.
+  * The `runner` configuration specifies the number of concurrent workers to use,
+  batch size, hashing and so on
+
+To generate an example configuration file for a specific database run
+```shell script
+$ tsbs_load config --target=<db-name> --data-source=[FILE|SIMULATOR]
+```
+specifying db-name to one of the implemented databases and data-source to
+FILE or SIMULATOR
+
+⚠️ **The generated config file will be populated with the default values for each property.**
+
+The generated config file is saved in `./config.yaml`
+
+## On the fly simulation and load with `data-source: SIMULATOR`
+
+When you run `tsbs_generate_data` a simulator is created for 
+the selected use case and the simulated data points are serialized
+to a file. `tsbs_load` utilizes the same simulators but the 
+simulated points are directly piped to the worker clients that send batches
+of data to the databases. 
+
+You can notice that the same properties you configure in the YAML file
+are the same flags that you need to specify when running `tsbs_generate_data`.
+
+You can run `tsbs_load` with 
+```shell script
+$ tsbs_load load <db_name> --config=./path-to-config.yaml
+```
+Where `<db_name>` is one of the implemented databases or you can run 
+```shell script
+$ tsbs_load load --help
+```
+for a list of the available databases.
+
+## Information about a property and overriding
+
+The generated yaml file with `tsbs_load config` does not contain
+information about what each of the properties represents. You can easily discover
+more details about each property by running:
+
+```shell script
+$ tsbs_load load --help
+```
+This will list all the available flags configurable for all databases. These flags
+include the flags for `data-source` and `loader.runner`. The `--loader.runner.db-name` flag
+corresponds to the property:
+```yaml
+loader:
+  runner:
+    db-name: some-db
+```
+in the YAML config file. With the type, description, and default 
+value next to the flag name as :
+
+```string, Name of database (default "benchmark")```
+
+### Information about database specific flags
+
+Some of the properties are only valid for specific databases. These 
+properties go under the `loader.db-specific` section. To view information
+about them you can run:
+```shell script
+$ tsbs_load load <db_name> --help
+```
+
+For example for timescaledb, you can see the following:
+```shell script
+$ tsbs_load load timescaledb --help
+...
+--loader.db-specific.chunk-time 
+    duration
+    Duration that each chunk should represent, e.g., 12h (default 12h0m0s)
+...
+```
+
+### Overriding values
+
+* Each property has a default value, used if not otherwise overridden
+* An entry in the config YAML file overrides the default value
+* A flag passed at runtime overrides an entry in the YAML file
diff --git a/pkg/targets/timestream/config.go b/pkg/targets/timestream/config.go
@@ -36,7 +36,7 @@ func targetSpecificFlags(flagPrefix string, flagSet *pflag.FlagSet) {
 	)
 	flagSet.Bool(
 		flagPrefix+"use-current-time",
-		true,
+		false,
 		"Use the local current timestamp when generating the records to load")
 	flagSet.Int64(
 		"mag-store-retention-in-days",

diff --git a/pkg/targets/victoriametrics/implemented_target.go b/pkg/targets/victoriametrics/implemented_target.go
@@ -31,7 +31,7 @@ func (vm vmTarget) Serializer() serialize.PointSerializer {
 }
 
 func (vm vmTarget) TargetSpecificFlags(flagPrefix string, flagSet *pflag.FlagSet) {
-	pflag.String(
+	flagSet.String(
 		flagPrefix+"urls",
 		"http://localhost:8428/write",
 		"Comma-separated list of VictoriaMetrics ingestion URLs(single-node or VMInsert)",

diff --git a/scripts/full_cycle_minitest_clickhouse.sh → ...initest/full_cycle_minitest_clickhouse.sh b/scripts/full_cycle_minitest_clickhouse.sh → ...initest/full_cycle_minitest_clickhouse.sh
diff --git a/scripts/full_cycle_minitest_cratedb.sh → ...e_minitest/full_cycle_minitest_cratedb.sh b/scripts/full_cycle_minitest_cratedb.sh → ...e_minitest/full_cycle_minitest_cratedb.sh
diff --git a/scripts/full_cycle_minitest_timescaledb.sh → ...nitest/full_cycle_minitest_timescaledb.sh b/scripts/full_cycle_minitest_timescaledb.sh → ...nitest/full_cycle_minitest_timescaledb.sh
diff --git a/scripts/load_akumuli.sh → scripts/load/load_akumuli.sh b/scripts/load_akumuli.sh → scripts/load/load_akumuli.sh
diff --git a/scripts/load_cassandra.sh → scripts/load/load_cassandra.sh b/scripts/load_cassandra.sh → scripts/load/load_cassandra.sh
diff --git a/scripts/load_clickhouse.sh → scripts/load/load_clickhouse.sh b/scripts/load_clickhouse.sh → scripts/load/load_clickhouse.sh
diff --git a/scripts/load_common.sh → scripts/load/load_common.sh b/scripts/load_common.sh → scripts/load/load_common.sh
diff --git a/scripts/load_cratedb.sh → scripts/load/load_cratedb.sh b/scripts/load_cratedb.sh → scripts/load/load_cratedb.sh
diff --git a/scripts/load_influx.sh → scripts/load/load_influx.sh b/scripts/load_influx.sh → scripts/load/load_influx.sh
diff --git a/scripts/load_mongo.sh → scripts/load/load_mongo.sh b/scripts/load_mongo.sh → scripts/load/load_mongo.sh
diff --git a/scripts/load_siridb.sh → scripts/load/load_siridb.sh b/scripts/load_siridb.sh → scripts/load/load_siridb.sh
diff --git a/scripts/load_timescaledb.sh → scripts/load/load_timescaledb.sh b/scripts/load_timescaledb.sh → scripts/load/load_timescaledb.sh
diff --git a/scripts/load_victoriametrics.sh → scripts/load/load_victoriametrics.sh b/scripts/load_victoriametrics.sh → scripts/load/load_victoriametrics.sh
diff --git a/scripts/run_queries_akumuli.sh → scripts/run_queries/run_queries_akumuli.sh b/scripts/run_queries_akumuli.sh → scripts/run_queries/run_queries_akumuli.sh
diff --git a/scripts/run_queries_clickhouse.sh → ...pts/run_queries/run_queries_clickhouse.sh b/scripts/run_queries_clickhouse.sh → ...pts/run_queries/run_queries_clickhouse.sh
diff --git a/scripts/run_queries_influx.sh → scripts/run_queries/run_queries_influx.sh b/scripts/run_queries_influx.sh → scripts/run_queries/run_queries_influx.sh
diff --git a/scripts/run_queries_mongo.sh → scripts/run_queries/run_queries_mongo.sh b/scripts/run_queries_mongo.sh → scripts/run_queries/run_queries_mongo.sh
diff --git a/scripts/run_queries_siridb.sh → scripts/run_queries/run_queries_siridb.sh b/scripts/run_queries_siridb.sh → scripts/run_queries/run_queries_siridb.sh
diff --git a/scripts/run_queries_timescaledb.sh → ...ts/run_queries/run_queries_timescaledb.sh b/scripts/run_queries_timescaledb.sh → ...ts/run_queries/run_queries_timescaledb.sh
diff --git a/scripts/run_queries/run_queries_timestream.sh b/scripts/run_queries/run_queries_timestream.sh
@@ -0,0 +1,57 @@
+#!/bin/bash
+
+# Exit immediately if a command exits with a non-zero status.
+set -e
+
+# Ensure runner is available
+EXE_FILE_NAME=${EXE_FILE_NAME:-$(which tsbs_run_queries_timestream)}
+if [[ -z "$EXE_FILE_NAME" ]]; then
+    echo "tsbs_run_queries_timestream not available. It is not specified explicitly and not found in \$PATH"
+    exit 1
+fi
+
+# AWS region of database
+AWS_REGION=${AWS_REGION:"us-east-1"}
+
+# Queries folder
+BULK_DATA_DIR=${BULK_DATA_DIR:-"/tmp/bulk_queries"}
+
+# How many queries would be run
+MAX_QUERIES=${MAX_QUERIES:-"0"}
+
+# How many concurrent worker would run queries - match num of cores, or default to 4
+NUM_WORKERS=${NUM_WORKERS:-$(grep -c ^processor /proc/cpuinfo 2> /dev/null || echo 4)}
+
+
+for FULL_DATA_FILE_NAME in ${BULK_DATA_DIR}/queries_timestream*; do
+    # $FULL_DATA_FILE_NAME:  /full/path/to/file_with.ext
+    # $DATA_FILE_NAME:       file_with.ext
+    # $DIR:                  /full/path/to
+    # $EXTENSION:            ext
+    # NO_EXT_DATA_FILE_NAME: file_with
+
+    DATA_FILE_NAME=$(basename -- "${FULL_DATA_FILE_NAME}")
+    DIR=$(dirname "${FULL_DATA_FILE_NAME}")
+    EXTENSION="${DATA_FILE_NAME##*.}"
+    NO_EXT_DATA_FILE_NAME="${DATA_FILE_NAME%.*}"
+
+    # Several options on how to name results file
+    #OUT_FULL_FILE_NAME="${DIR}/result_${DATA_FILE_NAME}"
+    OUT_FULL_FILE_NAME="${DIR}/result_${NO_EXT_DATA_FILE_NAME}.out"
+    #OUT_FULL_FILE_NAME="${DIR}/${NO_EXT_DATA_FILE_NAME}.out"
+
+    if [ "${EXTENSION}" == "gz" ]; then
+        GUNZIP="gunzip"
+    else
+        GUNZIP="cat"
+    fi
+
+    echo "Running ${DATA_FILE_NAME}"
+    cat $FULL_DATA_FILE_NAME \
+        | $GUNZIP \
+        | $EXE_FILE_NAME \
+            --max-queries $MAX_QUERIES \
+            --workers $NUM_WORKERS \
+            --aws-region $AWS_REGION
+        | tee $OUT_FULL_FILE_NAME
+done