Skip to content

Commit

Permalink
Update documentation to include tsbs_load and timestream
Browse files Browse the repository at this point in the history
  • Loading branch information
Blagoj Atanasovski authored and atanasovskib committed Dec 1, 2020
1 parent 0cf8054 commit 354f236
Show file tree
Hide file tree
Showing 25 changed files with 282 additions and 6 deletions.
42 changes: 38 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Current databases supported:
+ MongoDB [(supplemental docs)](docs/mongo.md)
+ SiriDB [(supplemental docs)](docs/siridb.md)
+ TimescaleDB [(supplemental docs)](docs/timescaledb.md)
+ Timestream [(suplemental docs)](docs/timestream.md)
+ VictoriaMetrics [(supplemental docs)](docs/victoriametrics.md)

## Overview
Expand Down Expand Up @@ -76,6 +77,7 @@ cases are implemented for each database:
|MongoDB|X|
|SiriDB|X|
|TimescaleDB|X|X|
|Timestream|X||
|VictoriaMetrics|||

¹ Does not support the `groupby-orderby-limit` query
Expand All @@ -88,7 +90,8 @@ query execution performance. (It currently does not measure
concurrent insert and query performance, which is a future priority.)
To accomplish this in a fair way, the data to be inserted and the
queries to run are pre-generated and native Go clients are used
wherever possible to connect to each database (e.g., `mgo` for MongoDB).
wherever possible to connect to each database (e.g., `mgo` for MongoDB,
`aws sdk` for Timestream).

Although the data is randomly generated, TSBS data and queries are
entirely deterministic. By supplying the same PRNG (pseudo-random number
Expand Down Expand Up @@ -217,6 +220,37 @@ A full list of query types can be found in

### Benchmarking insert/write performance

TSBS has two ways to benchmark insert/write performance:
* On the fly simulation and load with `tsbs_load`
* Pre-generate data to a file and load it either with `tsbs_load` or the
db specific executables `tsbs_load_*`

#### Using the unified `tsbs_load` executable

The `tsbs_load` executable can load data in any of the supported databases.
It can use a pregenerated data file as input, or simulate the data on the
fly.

You first start by generating a config yaml file populated with the default
values for each property with:
```shell script
$ tsbs_load config --target=<db-name> --data-source=[FILE|SIMULATOR]
```
for example, to generate an example for TimescaleDB, loading the data from file
```shell script
$ tsbs_load config --target=timescaledb --data-source=FILE
Wrote example config to: ./config.yaml
```

You can then run tsbs_load with the generated config file with:
```shell script
$ tsbs_load load timescaledb --config=./config.yaml
```

For more details on how to use tsbs_load check out the [supplemental docs](docs/tsbs_load.md)

#### Using the database specific `tsbs_load_*` executables

TSBS measures insert/write performance by taking the data generated in
the previous step and using it as input to a database-specific command
line program. To the extent that insert programs can be shared, we have
Expand All @@ -241,15 +275,15 @@ cat /tmp/timescaledb-data.gz | gunzip | tsbs_load_timescaledb \
```

For simpler testing, especially locally, we also supply
`scripts/load_<database>.sh` for convenience with many of the flags set
`scripts/load/load_<database>.sh` for convenience with many of the flags set
to a reasonable default for some of the databases.
So for loading into TimescaleDB, ensure that TimescaleDB is running and
then use:
```bash
# Will insert using 2 clients, batch sizes of 10k, from a file
# named `timescaledb-data.gz` in directory `/tmp`
$ NUM_WORKERS=2 BATCH_SIZE=10000 BULK_DATA_DIR=/tmp \
scripts/load_timescaledb.sh
scripts/load/load_timescaledb.sh
```

This will create a new database called `benchmark` where the data is
Expand All @@ -263,7 +297,7 @@ Example for writing to remote host using `load_timescaledb.sh`:
# named `timescaledb-data.gz` in directory `/tmp`
$ NUM_WORKERS=2 BATCH_SIZE=10000 BULK_DATA_DIR=/tmp DATABASE_HOST=remotehostname
DATABASE_USER=user DATABASE \
scripts/load_timescaledb.sh
scripts/load/load_timescaledb.sh
```

---
Expand Down
76 changes: 76 additions & 0 deletions docs/timestream.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# TSBS Supplemental Guide: Timestream

Amazon Timestream is a serverless time series database service.
This supplemental guide explains how the data generated for TSBS is stored,
additional flags available when using the data importer (`tsbs_load load timestream`),
and additional flags available for the query runner (`tsbs_run_queries_timestream`). **This
should be read *after* the main README.**

## Data format

Data generated by `tsbs_generate_data` for Timestream is serialized in a
"pseudo-CSV" format, along with a custom header at the beginning. The
header is several lines long:
* one line composed of a comma-separated list of tag labels, with the literal string `tags` as the first value in the list
* one or more lines composed of a comma-separated list of field labels, with the table name as the first value in the list
* a blank line

An example for the `cpu-only` use case:
```text
tags,hostname,region,datacenter,rack,os,arch,team,service,service_version,service_environment
cpu,usage_user,usage_system,usage_idle,usage_nice,usage_iowait,usage_irq,usage_softirq,usage_steal,usage_guest,usage_guest_nice
```

Following this, each reading is composed of two rows:
1. a comma-separated list of tag values for the reading, with the literal string `tags` as the first value in the list
1. a comma-separated list of field values for the reading, with the hypertable the reading belongs to being the first value and the timestamp as the second value

An example for the `cpu-only` use case:
```text
tags,host_0,eu-central-1,eu-central-1b,21,Ubuntu15.10,x86,SF,6,0,test
cpu,1451606400000000000,58.1317132304976170,2.6224297271376256,24.9969495069947882,61.5854484633778867,22.9481393231639395,63.6499207106198313,6.4098777048301052,44.8799140503027445,80.5028770761136201,38.2431182911542820
```

---

## `tsbs_load load timestream` Additional Flags

#### loader.db-specific.aws-region (type: `string`, default `us-east-1`)

AWS region where the db is located

#### loader.db-specific.use-common-attributes (type: `boolean`, default `true`)

Timestream client makes write requests with common attributes.
If false, each value is written as a separate Record, and a request of 100 records at once is sent.

#### loader.db-specific.hash-property (type: `string`, default `hostname`)

Dimension to use when hasing points to different workers

#### loader.db-specific.use-current-time (type: `boolean`, default: `false`)

Use the current local timestamp when creating the records to load.
Usefull when you don't want to worry about the retention period vs simulated period.

#### loader.db-specific.mag-store-retention-in-days (type: `int`, default: `180`)

The duration for which data must be stored in the magnetic store

#### loader.db-specific.mem-store-retention-in-hours (type: `int`, default: `12`)

The duration for which data must be stored in the memory store.

---
## `tsbs_generate_queries` required `-db-name` flag

Timestream requires the database name be part of the WHERE clause
of every query, so the `--db-name` flag is a required flag

---
## `tsbs_run_queries_timestream` Additional Flags

#### `-aws-region` (type: `string`, default: `us-east-1`)

AWS region where the database is located
109 changes: 109 additions & 0 deletions docs/tsbs_load.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Supplemental Guide for `tsbs_load`

The `tsbs_load` executable can benchmark data ingestion
for all the implemented databases.

## Generating a config file

`tsbs_load` uses YAML files to specify the configuration for
running the load benchmark.

The config file is separated in two top-level sections:
```yaml
data-source:
...
loader:
...
```
* `data-source` contains the configuration for where to
read the data from (`type: SIMULATOR` or `type: FILE`)
* For `SIMULATOR` the configuration specifies the time range to be simulated,
the use-case, scale and other properties that regard the data
* For `FILE` the configuration only specifies the location of the pre-generated
file with `tsbs_generate_data`
* `loader` contains the configuration for the loading the data. Two sub-sections are
important here `db-specific` and `runner`
* The `db-specific` configuration varies depending of the target database
and for TimescaleDB contains information about user, password, ssl mode, while
for influx it contains information about backoff interval, replication factor etc.
* The `runner` configuration specifies the number of concurrent workers to use,
batch size, hashing and so on

To generate an example configuration file for a specific database run
```shell script
$ tsbs_load config --target=<db-name> --data-source=[FILE|SIMULATOR]
```
specifying db-name to one of the implemented databases and data-source to
FILE or SIMULATOR

⚠️ **The generated config file will be populated with the default values for each property.**

The generated config file is saved in `./config.yaml`

## On the fly simulation and load with `data-source: SIMULATOR`

When you run `tsbs_generate_data` a simulator is created for
the selected use case and the simulated data points are serialized
to a file. `tsbs_load` utilizes the same simulators but the
simulated points are directly piped to the worker clients that send batches
of data to the databases.

You can notice that the same properties you configure in the YAML file
are the same flags that you need to specify when running `tsbs_generate_data`.

You can run `tsbs_load` with
```shell script
$ tsbs_load load <db_name> --config=./path-to-config.yaml
```
Where `<db_name>` is one of the implemented databases or you can run
```shell script
$ tsbs_load load --help
```
for a list of the available databases.

## Information about a property and overriding

The generated yaml file with `tsbs_load config` does not contain
information about what each of the properties represents. You can easily discover
more details about each property by running:

```shell script
$ tsbs_load load --help
```
This will list all the available flags configurable for all databases. These flags
include the flags for `data-source` and `loader.runner`. The `--loader.runner.db-name` flag
corresponds to the property:
```yaml
loader:
runner:
db-name: some-db
```
in the YAML config file. With the type, description, and default
value next to the flag name as :

```string, Name of database (default "benchmark")```

### Information about database specific flags

Some of the properties are only valid for specific databases. These
properties go under the `loader.db-specific` section. To view information
about them you can run:
```shell script
$ tsbs_load load <db_name> --help
```

For example for timescaledb, you can see the following:
```shell script
$ tsbs_load load timescaledb --help
...
--loader.db-specific.chunk-time
duration
Duration that each chunk should represent, e.g., 12h (default 12h0m0s)
...
```

### Overriding values

* Each property has a default value, used if not otherwise overridden
* An entry in the config YAML file overrides the default value
* A flag passed at runtime overrides an entry in the YAML file
2 changes: 1 addition & 1 deletion pkg/targets/timestream/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ func targetSpecificFlags(flagPrefix string, flagSet *pflag.FlagSet) {
)
flagSet.Bool(
flagPrefix+"use-current-time",
true,
false,
"Use the local current timestamp when generating the records to load")
flagSet.Int64(
"mag-store-retention-in-days",
Expand Down
2 changes: 1 addition & 1 deletion pkg/targets/victoriametrics/implemented_target.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ func (vm vmTarget) Serializer() serialize.PointSerializer {
}

func (vm vmTarget) TargetSpecificFlags(flagPrefix string, flagSet *pflag.FlagSet) {
pflag.String(
flagSet.String(
flagPrefix+"urls",
"http://localhost:8428/write",
"Comma-separated list of VictoriaMetrics ingestion URLs(single-node or VMInsert)",
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
57 changes: 57 additions & 0 deletions scripts/run_queries/run_queries_timestream.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash

# Exit immediately if a command exits with a non-zero status.
set -e

# Ensure runner is available
EXE_FILE_NAME=${EXE_FILE_NAME:-$(which tsbs_run_queries_timestream)}
if [[ -z "$EXE_FILE_NAME" ]]; then
echo "tsbs_run_queries_timestream not available. It is not specified explicitly and not found in \$PATH"
exit 1
fi

# AWS region of database
AWS_REGION=${AWS_REGION:"us-east-1"}

# Queries folder
BULK_DATA_DIR=${BULK_DATA_DIR:-"/tmp/bulk_queries"}

# How many queries would be run
MAX_QUERIES=${MAX_QUERIES:-"0"}

# How many concurrent worker would run queries - match num of cores, or default to 4
NUM_WORKERS=${NUM_WORKERS:-$(grep -c ^processor /proc/cpuinfo 2> /dev/null || echo 4)}


for FULL_DATA_FILE_NAME in ${BULK_DATA_DIR}/queries_timestream*; do
# $FULL_DATA_FILE_NAME: /full/path/to/file_with.ext
# $DATA_FILE_NAME: file_with.ext
# $DIR: /full/path/to
# $EXTENSION: ext
# NO_EXT_DATA_FILE_NAME: file_with

DATA_FILE_NAME=$(basename -- "${FULL_DATA_FILE_NAME}")
DIR=$(dirname "${FULL_DATA_FILE_NAME}")
EXTENSION="${DATA_FILE_NAME##*.}"
NO_EXT_DATA_FILE_NAME="${DATA_FILE_NAME%.*}"

# Several options on how to name results file
#OUT_FULL_FILE_NAME="${DIR}/result_${DATA_FILE_NAME}"
OUT_FULL_FILE_NAME="${DIR}/result_${NO_EXT_DATA_FILE_NAME}.out"
#OUT_FULL_FILE_NAME="${DIR}/${NO_EXT_DATA_FILE_NAME}.out"

if [ "${EXTENSION}" == "gz" ]; then
GUNZIP="gunzip"
else
GUNZIP="cat"
fi

echo "Running ${DATA_FILE_NAME}"
cat $FULL_DATA_FILE_NAME \
| $GUNZIP \
| $EXE_FILE_NAME \
--max-queries $MAX_QUERIES \
--workers $NUM_WORKERS \
--aws-region $AWS_REGION
| tee $OUT_FULL_FILE_NAME
done

0 comments on commit 354f236

Please sign in to comment.