diff --git a/scripts/tpcx-ai/README.md b/scripts/tpcx-ai/README.md new file mode 100644 index 00000000000..c02284215ad --- /dev/null +++ b/scripts/tpcx-ai/README.md @@ -0,0 +1,108 @@ +# Implementation of TPCx-AI on Apache SystemDS + +The TPCx-AI is an express benchmark developed by the TPC (Transaction Processing Performance Council) +specifically tailored for end-to-end machine learning systems. +For further information, refer to the official documentation provided by the TPC: +[TPCx-AI documentation](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPCX-AI_v1.0.3.1.pdf) + +To run the TPCx-AI benchmark on SystemDS: + * Download the TPCx-AI benchmark kit from [TPC's website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp). + * Install and build SystemDS. + * Build the python package and copy the distribution to the TPCx-AI root directory. + * Copy the files from this directory (`tpcx-ai`) into the TPCx-AI benchmark kit root directory. + * Set values for scale factor and java paths in `setenv_sds.sh`. + * Set up TPCx-AI by running `setup_python_sds.sh`. + * Generate data with `generate_data.sh`. + * Lastly, execute the benchmark using `TPCx-AI_Benchmarkrun_sds.sh`. + +## Detailed Instructions + +### Prerequisites + +The following sections describe system prerequisites and steps to prepare and adapt the TPCx-AI benchmark kit to run on SystemDS. + +#### Downloading the TPCx-AI Benchmark Kit + +Go to [TPC's website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) and download the TPCx-AI benchmark kit. +Extract the archived directory, which from now on will be referred to as TPCx-AI root directory. + + +#### Building SystemDS + +Go back to the SystemDS root directory and follow the installation guide for SystemDS: . +Build the project with maven. +```bash +mvn package -P distribution +``` + +#### Building python package and copy to TPCx-AI root directory + +- From `SYSTEMDS_ROOT/src/main/python` run `create_python_dist.py`. + +```bash +python3 create_python_dist.py +``` + +- Now, in the `./dist` directory, there will exist the source distribution `systemds-VERSION.tar.gz` + and the wheel distribution `systemds-VERSION-py3-none-any.whl`, with `VERSION` being the current version number +- copy the `systemds-VERSION-py3-none-any.whl` to the TPCx-AI benchmark kit root directory + +#### Transfering Files to TPCx-AI Root Directory + +The following files need to be copied from this directory into the TPCx-AI root directory: + +- generate_data.sh +- setenv_sds.sh +- setup_python_sds.sh +- TPCx-AI_Benchmarkrun_sds.sh + +The following directories in the TPCx-AI benchmark kit directory need to be **replaced**: + +- Replace the driver directory with the driver directory from this directory. +- In the TPCx-AI root directory, navigate to workload/python/workload and replace the 10 use case files with the files in the `use_cases` directory. +- Replace `tpcxai_fdr.py` and `tpcxai_fdr_template.html` from the `TPCx-AI_ROOT\tools directory with the files in this directory. + +#### Setting Up TPCx-AI + +Now the benchmark kit is ready for set up and installation. +Prerequisites for running are: +* Java 8, +* Java 11, +* Python 3.6+ +* Anaconda3/Conda4+ + +* The binaries "java", "sbt" and "conda" must be included (and have "priority") in the PATH environment variable. +* Disk Space: Make sure you have enough disk space to store the test data that will be generated, in the `output/raw_data` +The value of "TPCxAI_SCALE_FACTOR" in the setenv.sh file will determine the approximate size (GB) of the dataset that will be generated and used during the benchmark execution. + +For more detailed information and optional setup possibilities refer to the official TPCs-AI documentation: +[TPCx-AI documentation](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPCX-AI_v1.0.3.1.pdf). + +#### Setting up Environment in setenv_sds.sh + +There are three variables that need to be set in the setenv_sds.sh files prior to set-up: +* JAVA8_HOME: Set this variable to the path to the home directory of your Java 8 version. +* JAVA11_HOME: Set this variable to the path to the home directory of your Java 11 version. +* TPCxAI_SCALE_FACTOR: Set to the desired scale factor of the data set; the default is 1. + +#### Running the Setup Script + +* This implementation is based on SystemdDS Version 3.3.0 (commit id: 5ad67e8). If you want to use a different version, make sure to set the correct filename for the systemds distribution in the `setup_python_sds.sh`file. +The filename should match the appropriate version and build of systemds for your environment. +* Run `setup_python_sds.sh` to automatically set up the benchmark, +install all the neccessary libraries, install SystemDS as and set up the virtual environments. + +## Benchmark Execution + +### Data Generation + +Before running the benchmark, data needs to be generated. To generate data, run the `generate_data.sh` script. The size of the generated data can be chosen by setting the +TPCxAI_SCALE_FACTOR variable from the setenv_sds.sh file. The default value is 1, which leads to the generation of a dataset with the size of 1 GB. + +### Running the Benchmark + +Now the benchmark can be executed by running the `TPCx-AI_Benchmarkrun_sds.sh` script. + +```bash +./TPCx-AI_Benchmarkrun_sds.sh +``` \ No newline at end of file diff --git a/scripts/tpcx-ai/TPCx-AI_Benchmarkrun_sds.sh b/scripts/tpcx-ai/TPCx-AI_Benchmarkrun_sds.sh new file mode 100755 index 00000000000..b2de36bc379 --- /dev/null +++ b/scripts/tpcx-ai/TPCx-AI_Benchmarkrun_sds.sh @@ -0,0 +1,90 @@ +#!/bin/bash + +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + +# Stop if any command fails +set -e + +. setenv_sds.sh + +LOG_DEST="tpcxai_benchmark_run" +TPCxAI_CONFIG_FILE_PATH=${TPCxAI_BENCHMARKRUN_CONFIG_FILE_PATH} +if [[ ${IS_VALIDATION_RUN} -eq "1" ]]; then + echo "Benchmark validation run. Setting scale factor value to 1..." + export TPCxAI_SCALE_FACTOR=1 + TPCxAI_CONFIG_FILE_PATH=${TPCxAI_VALIDATION_CONFIG_FILE_PATH} + LOG_DEST="tpcxai_benchmark_validation" +fi + +if [[ ${TPCx_AI_VERBOSE} == "True" ]]; then + VFLAG="-v" +fi + +echo "TPCx-AI_HOME directory: ${TPCx_AI_HOME_DIR}"; +echo "Using configuration file: ${TPCxAI_CONFIG_FILE_PATH} and Scale factor ${TPCxAI_SCALE_FACTOR}..." + +PATH=$JAVA11_HOME/bin:$PATH +export JAVA11_HOME +export PATH +echo "Using Java at $JAVA11_HOME" + +echo "Starting Benchmark run..." +sleep 1; + +bash ${TPCxAI_ENV_TOOLS_DIR}/clock_check.sh start + +./bin/tpcxai.sh --phase {CLEAN,DATA_GENERATION,LOADING,TRAINING,SERVING,SERVING,SERVING_THROUGHPUT,SCORING,CHECK_INTEGRITY} -sf ${TPCxAI_SCALE_FACTOR} --streams ${TPCxAI_SERVING_THROUGHPUT_STREAMS} -c ${TPCxAI_CONFIG_FILE_PATH} ${VFLAG} + +BENCHMARK_RUN_EXIT_CODE=$? + +if [ ${BENCHMARK_RUN_EXIT_CODE} -eq "0" ]; then + echo "Generating report..." + lib/python-venv/bin/python tools/tpcxai_fdr.py -d logs/tpcxai.db -f logs/report.txt + lib/python-venv/bin/python tools/tpcxai_fdr.py -d logs/tpcxai.db -t html -f logs/report.html + echo "Finished generating report"; + + echo "" + + echo "Saving output data. This may take a few minutes ..." + bash ${TPCxAI_ENV_TOOLS_DIR}/saveOutputData.sh + echo "Finished saving output data" + echo "" + + echo "Saving execution enviroment details..." + bash ${TPCxAI_ENV_TOOLS_DIR}/getEnvInfo.sh + echo "Finished saving environment details" + echo "" + echo "Saving data redundancy information..." + bash ${TPCxAI_ENV_TOOLS_DIR}/dataRedundancyInformation.sh + echo "Finished saving data redundancy" +fi + +bash ${TPCxAI_ENV_TOOLS_DIR}/clock_check.sh end +LOG_DEST="${LOG_DEST}_$(date +"%m%d%Y_%H%m%S")" +mkdir -p "logs/history/${LOG_DEST}" +find logs/ -mindepth 1 -maxdepth 1 -path logs/history -prune -o -print | xargs -I file mv file "logs/history/${LOG_DEST}" +zip -vr "logs/history/${LOG_DEST}.zip" "logs/history/${LOG_DEST}" -x logs/history/* diff --git a/scripts/tpcx-ai/driver/config/default-spark.yaml b/scripts/tpcx-ai/driver/config/default-spark.yaml new file mode 100644 index 00000000000..a63e93a6212 --- /dev/null +++ b/scripts/tpcx-ai/driver/config/default-spark.yaml @@ -0,0 +1,350 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# DEFAULT Configuration for the tpcxai Driver +# convenience configurations +# examples for different datastores + +# Local filesystem on Windows, Linux, or MacOS +local_fs: &LOCAL_FS + name: "local_fs" + create: "tools/python/create.sh $destination" + load: "tools/python/load.sh $destination $source" + copy: "cp -f $source $destination" + delete: "rm -rf $destination" + delete_parallel: "pssh -t 0 -P -h nodes rm -rf $destination" + download: "cp $source $destination" + +# HDFS = Hadoop Distributed Filesystem +hdfs: &HDFS_LOCAL + name: "hdfs" + create: "tools/spark/create_hdfs.sh $destination" + load: "tools/spark/load_hdfs.sh $destination $source" + copy: "hdfs dfs -cp -f $source $destination" + delete: "hdfs dfs -rm -r -f -skipTrash $destination" + download: 'hdfs dfs -cat $source/* | awk ''BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}'' > $destination/predictions.csv' + +hdfs_parallel: &HDFS + name: "hdfs" + create: "tools/spark/create_hdfs.sh $destination" + load: "tools/parallel-data-load.sh nodes 1 $destination $source" + copy: "hdfs dfs -cp -f $source $destination" + delete: "hdfs dfs -rm -r -f -skipTrash $destination" + download: 'hdfs dfs -cat $source/* | awk ''BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}'' > $destination/predictions.csv' + +workload: + # global definitions + engine_base : "spark-submit + --conf spark.executor.extraJavaOptions='-Xss128m' + --conf spark.executorEnv.NUMBA_CACHE_DIR=/tmp + --conf spark.kryoserializer.buffer.max=1g + --conf spark.rpc.message.maxSize=1024 + --deploy-mode client + --driver-java-options '-Xss128m' + --driver-memory 10g + --master yarn" + + engine_executable: &ENGINE "$engine_base + --num-executors 1 + --executor-cores 5 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.11/*.jar'" + + engine_executable_dl2: &ENGINE_DL2 "$engine_base + --num-executors 1 + --executor-cores 1 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.11/*.jar'" + + engine_executable_dl5: &ENGINE_DL5 "$engine_base + --num-executors 1 + --executor-cores 1 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.11/*.jar'" + + engine_executable_9: &SERVING_9 "$engine_base + --num-executors 1 + --executor-cores 5 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --conf spark.task.cpus=1 + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.11/*.jar'" + + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: &TRAINING_TEMPLATE "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --workdir $output $input/$file" + serving_template: &SERVING_TEMPLATE "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: &SERVING_THROUGHPUT_TEMPLATE "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$stream $input/$file" + training_data_url: &TRAINING_DATA_URL "output/data/training" + serving_data_url: &SERVING_DATA_URL "output/data/serving" + scoring_data_url: &SCORING_DATA_URL "output/raw_data/scoring" + datagen_datastore: *LOCAL_FS + # general/ benchmark-wide configuration parameters + pdgf_node_parallel: True + pdgf_home: "lib/pdgf" + raw_data_url: "output/raw_data" + temp_dir: '/tmp/tpcxai' + usecases: + 1: + # general + name: "org.tpc.tpcxai.UseCase01" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --num_clusters 4 --workdir $output $input/order.csv $input/lineitem.csv $input/order_returns.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$phase $input/order.csv $input/lineitem.csv $input/order_returns.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$stream $input/order.csv $input/lineitem.csv $input/order_returns.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc01" + output_url: "output/output/uc01" + scoring_output_url: "output/scoring/uc01" + 2: + # general + name: "UseCase02.py" + # engines + training_engine: *ENGINE_DL2 + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage training --epochs 25 --batch 32 --executor_cores_horovod 1 --task_cpus_horovod 1 --workdir $output '$input/$file' $input/CONVERSATION_AUDIO.seq" + serving_template: "$serving_engine $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage serving --batch 32 --workdir $model --output $output/$phase '$input/$file' $input/CONVERSATION_AUDIO.seq" + serving_throughput_template: "$serving_engine $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage serving --batch 32 --workdir $model --output $output/$stream '$input/$file' $input/CONVERSATION_AUDIO.seq" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc02" + output_url: "output/output/uc02" + scoring_output_url: "output/scoring/uc02" + working_dir: "/tmp" + 3: + # general + name: "org.tpc.tpcxai.UseCase03" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --workdir $output $input/order.csv $input/lineitem.csv $input/product.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$phase $input/store_dept.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$stream $input/store_dept.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc03" + output_url: "output/output/uc03" + scoring_output_url: "output/scoring/uc03" + 4: + # general + name: "org.tpc.tpcxai.UseCase04" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc04" + output_url: "output/output/uc04" + scoring_output_url: "output/scoring/uc04" + 5: + # general + name: "UseCase05.py" + # engines + training_engine: *ENGINE_DL5 + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + # add namenode if necessary by specifying + # $training_engine [path]/$name --namenode [namenode.url:port] + training_template: &TRAINING_TEMPLATE_PY "$training_engine $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage training --epochs 15 --batch 512 --workdir $output $input/$file" + serving_template: "$serving_engine $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage serving --batch 512 --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: "$serving_engine $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage serving --batch 512 --workdir $model --output $output/$stream $input/$file" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc05" + output_url: "output/output/uc05" + scoring_output_url: "output/scoring/uc05" + working_dir: "/tmp" + 6: + # general + name: "org.tpc.tpcxai.UseCase06" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc06" + output_url: "output/output/uc06" + scoring_output_url: "output/scoring/uc06" + 7: + # general + name: "org.tpc.tpcxai.UseCase07" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --num-blocks 20 --workdir $output $input/$file" + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc07" + output_url: "output/output/uc07" + scoring_output_url: "output/scoring/uc07" + 8: + # general + name: "org.tpc.tpcxai.UseCase08" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --num-workers 1 --num-threads 1 --num-rounds 100 --workdir $output $input/order.csv $input/lineitem.csv $input/product.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --num-workers 1 --num-threads 1 --workdir $model --output $output/$phase $input/order.csv $input/lineitem.csv $input/product.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --num-workers 1 --num-threads 1 --workdir $model --output $output/$stream $input/order.csv $input/lineitem.csv $input/product.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc08" + output_url: "output/output/uc08" + scoring_output_url: "output/scoring/uc08" + 9: + # general + name: "UseCase09.py" + # engines + training_engine: *ENGINE_DL2 + serving_engine: *SERVING_9 + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + # add namenode if necessary by specifying + # $training_engine [path]/$name --namenode [namenode.url:port] + training_template: "$training_engine --files $tpcxai_home/workload/spark/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage training --epochs_embedding=15 --batch=64 --executor_cores_horovod 1 --task_cpus_horovod 1 --workdir $output '$input/CUSTOMER_IMAGES_META.csv' '$input/CUSTOMER_IMAGES.seq'" + serving_template: "$serving_engine --files $tpcxai_home/workload/spark/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage serving --workdir $model --output $output/$phase '$input/CUSTOMER_IMAGES_META.csv' '$input/CUSTOMER_IMAGES.seq'" + serving_throughput_template: "$serving_engine --files $tpcxai_home/workload/spark/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat $tpcxai_home/workload/spark/pyspark/workload-pyspark/$name --stage serving --workdir $model --output $output/$stream '$input/CUSTOMER_IMAGES_META.csv' '$input/CUSTOMER_IMAGES.seq'" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc09" + output_url: "output/output/uc09" + scoring_output_url: "output/scoring/uc09" + working_dir: "/tmp" + 10: + # general + name: "org.tpc.tpcxai.UseCase10" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly-0.1.jar --stage training --workdir $output $input/financial_account.csv $input/financial_transactions.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly-0.1.jar --stage serving --workdir $model --output $output/$phase $input/financial_account.csv $input/financial_transactions.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly-0.1.jar --stage serving --workdir $model --output $output/$stream $input/financial_account.csv $input/financial_transactions.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc10" + output_url: "output/output/uc10" + scoring_output_url: "output/scoring/uc10" diff --git a/scripts/tpcx-ai/driver/config/default-spark3.yaml b/scripts/tpcx-ai/driver/config/default-spark3.yaml new file mode 100644 index 00000000000..1fd8ed53852 --- /dev/null +++ b/scripts/tpcx-ai/driver/config/default-spark3.yaml @@ -0,0 +1,350 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# DEFAULT Configuration for the tpcxai Driver +# convenience configurations +# examples for different datastores + +# Local filesystem on Windows, Linux, or MacOS +local_fs: &LOCAL_FS + name: "local_fs" + create: "tools/python/create.sh $destination" + load: "tools/python/load.sh $destination $source" + copy: "cp -f $source $destination" + delete: "rm -rf $destination" + delete_parallel: "pssh -t 0 -P -h nodes rm -rf $destination" + download: "cp $source $destination" + +# HDFS = Hadoop Distributed Filesystem +hdfs: &HDFS_LOCAL + name: "hdfs" + create: "tools/spark/create_hdfs.sh $destination" + load: "tools/spark/load_hdfs.sh $destination $source" + copy: "hdfs dfs -cp -f $source $destination" + delete: "hdfs dfs -rm -r -f -skipTrash $destination" + download: 'hdfs dfs -cat $source/* | awk ''BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}'' > $destination/predictions.csv' + +hdfs_parallel: &HDFS + name: "hdfs" + create: "tools/spark/create_hdfs.sh $destination" + load: "tools/parallel-data-load.sh nodes 1 $destination $source" + copy: "hdfs dfs -cp -f $source $destination" + delete: "hdfs dfs -rm -r -f -skipTrash $destination" + download: 'hdfs dfs -cat $source/* | awk ''BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}'' > $destination/predictions.csv' + +workload: + # global definitions + engine_base : "spark3-submit + --conf spark.executor.extraJavaOptions='-Xss128m' + --conf spark.executorEnv.NUMBA_CACHE_DIR=/tmp + --conf spark.kryoserializer.buffer.max=1g + --conf spark.rpc.message.maxSize=1024 + --deploy-mode client + --driver-java-options '-Xss128m' + --driver-memory 10g + --master yarn" + + engine_executable: &ENGINE "$engine_base + --num-executors 1 + --executor-cores 5 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.12/*.jar'" + + engine_executable_dl2: &ENGINE_DL2 "$engine_base + --num-executors 1 + --executor-cores 1 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.12/*.jar'" + + engine_executable_dl5: &ENGINE_DL5 "$engine_base + --num-executors 1 + --executor-cores 1 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.12/*.jar'" + + engine_executable_9: &SERVING_9 "$engine_base + --num-executors 1 + --executor-cores 5 + --executor-memory 40g + --conf spark.executor.memoryOverhead=4g + --conf spark.task.cpus=1 + --jars '$tpcxai_home/lib/config-1.4.2.jar,$tpcxai_home/lib/threeten-extra-0.9.jar,$tpcxai_home/lib/scala_2.12/*.jar'" + + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: &TRAINING_TEMPLATE "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --workdir $output $input/$file" + serving_template: &SERVING_TEMPLATE "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: &SERVING_THROUGHPUT_TEMPLATE "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$stream $input/$file" + training_data_url: &TRAINING_DATA_URL "output/data/training" + serving_data_url: &SERVING_DATA_URL "output/data/serving" + scoring_data_url: &SCORING_DATA_URL "output/raw_data/scoring" + datagen_datastore: *LOCAL_FS + # general/ benchmark-wide configuration parameters + pdgf_node_parallel: True + pdgf_home: "lib/pdgf" + raw_data_url: "output/raw_data" + temp_dir: '/tmp/tpcxai' + usecases: + 1: + # general + name: "org.tpc.tpcxai.UseCase01" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --num_clusters 4 --workdir $output $input/order.csv $input/lineitem.csv $input/order_returns.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$phase $input/order.csv $input/lineitem.csv $input/order_returns.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$stream $input/order.csv $input/lineitem.csv $input/order_returns.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc01" + output_url: "output/output/uc01" + scoring_output_url: "output/scoring/uc01" + 2: + # general + name: "UseCase02.py" + # engines + training_engine: *ENGINE_DL2 + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage training --epochs 25 --batch 32 --executor_cores_horovod 1 --task_cpus_horovod 1 --workdir $output '$input/$file' $input/CONVERSATION_AUDIO.seq" + serving_template: "$serving_engine $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage serving --batch 32 --workdir $model --output $output/$phase '$input/$file' $input/CONVERSATION_AUDIO.seq" + serving_throughput_template: "$serving_engine $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage serving --batch 32 --workdir $model --output $output/$stream '$input/$file' $input/CONVERSATION_AUDIO.seq" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc02" + output_url: "output/output/uc02" + scoring_output_url: "output/scoring/uc02" + working_dir: "/tmp" + 3: + # general + name: "org.tpc.tpcxai.UseCase03" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --workdir $output $input/order.csv $input/lineitem.csv $input/product.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$phase $input/store_dept.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --workdir $model --output $output/$stream $input/store_dept.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc03" + output_url: "output/output/uc03" + scoring_output_url: "output/scoring/uc03" + 4: + # general + name: "org.tpc.tpcxai.UseCase04" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc04" + output_url: "output/output/uc04" + scoring_output_url: "output/scoring/uc04" + 5: + # general + name: "UseCase05.py" + # engines + training_engine: *ENGINE_DL5 + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + # add namenode if necessary by specifying + # $training_engine [path]/$name --namenode [namenode.url:port] + training_template: &TRAINING_TEMPLATE_PY "$training_engine $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage training --epochs 15 --batch 512 --workdir $output $input/$file" + serving_template: "$serving_engine $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage serving --batch 512 --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: "$serving_engine $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage serving --batch 512 --workdir $model --output $output/$stream $input/$file" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc05" + output_url: "output/output/uc05" + scoring_output_url: "output/scoring/uc05" + working_dir: "/tmp" + 6: + # general + name: "org.tpc.tpcxai.UseCase06" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc06" + output_url: "output/output/uc06" + scoring_output_url: "output/scoring/uc06" + 7: + # general + name: "org.tpc.tpcxai.UseCase07" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --num-blocks 20 --workdir $output $input/$file" + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc07" + output_url: "output/output/uc07" + scoring_output_url: "output/scoring/uc07" + 8: + # general + name: "org.tpc.tpcxai.UseCase08" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage training --num-workers 1 --num-threads 1 --num-rounds 100 --workdir $output $input/order.csv $input/lineitem.csv $input/product.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --num-workers 1 --num-threads 1 --workdir $model --output $output/$phase $input/order.csv $input/lineitem.csv $input/product.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly_2.11-0.1.jar --stage serving --num-workers 1 --num-threads 1 --workdir $model --output $output/$stream $input/order.csv $input/lineitem.csv $input/product.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc08" + output_url: "output/output/uc08" + scoring_output_url: "output/scoring/uc08" + 9: + # general + name: "UseCase09.py" + # engines + training_engine: *ENGINE_DL2 + serving_engine: *SERVING_9 + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + # add namenode if necessary by specifying + # $training_engine [path]/$name --namenode [namenode.url:port] + training_template: "$training_engine --files $tpcxai_home/workload/spark3/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage training --epochs_embedding=15 --batch=64 --executor_cores_horovod 1 --task_cpus_horovod 1 --workdir $output '$input/CUSTOMER_IMAGES_META.csv' '$input/CUSTOMER_IMAGES.seq'" + serving_template: "$serving_engine --files $tpcxai_home/workload/spark3/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage serving --workdir $model --output $output/$phase '$input/CUSTOMER_IMAGES_META.csv' '$input/CUSTOMER_IMAGES.seq'" + serving_throughput_template: "$serving_engine --files $tpcxai_home/workload/spark3/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat $tpcxai_home/workload/spark3/pyspark/workload-pyspark/$name --stage serving --workdir $model --output $output/$stream '$input/CUSTOMER_IMAGES_META.csv' '$input/CUSTOMER_IMAGES.seq'" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc09" + output_url: "output/output/uc09" + scoring_output_url: "output/scoring/uc09" + working_dir: "/tmp" + 10: + # general + name: "org.tpc.tpcxai.UseCase10" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *HDFS # for storing the training data + model_datastore: *HDFS # for storing the trained models + serving_datastore: *HDFS # for storing the serving data + output_datastore: *HDFS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$training_engine --class $name lib/workload-assembly-0.1.jar --stage training --workdir $output $input/financial_account.csv $input/financial_transactions.csv" + serving_template: "$serving_engine --class $name lib/workload-assembly-0.1.jar --stage serving --workdir $model --output $output/$phase $input/financial_account.csv $input/financial_transactions.csv" + serving_throughput_template: "$serving_engine --class $name lib/workload-assembly-0.1.jar --stage serving --workdir $model --output $output/$stream $input/financial_account.csv $input/financial_transactions.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc10" + output_url: "output/output/uc10" + scoring_output_url: "output/scoring/uc10" diff --git a/scripts/tpcx-ai/driver/config/default.yaml b/scripts/tpcx-ai/driver/config/default.yaml new file mode 100644 index 00000000000..0bf8d5aa87d --- /dev/null +++ b/scripts/tpcx-ai/driver/config/default.yaml @@ -0,0 +1,300 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# DEFAULT Configuration for the TPCx-AI Driver +# convenience configurations +# examples for different datastores + +# Local filesystem on Windows, Linux, or MacOS + +local_fs: &LOCAL_FS + name: "local_fs" + create: "tools/python/create.sh $destination" + load: "tools/python/load.sh $destination $source" + copy: "cp -f $source $destination" + delete: "rm -rf $destination" + delete_parallel: "pssh -t 0 -P -h nodes rm -rf $destination" + download: "cp $source $destination" + +# HDFS = Hadoop Distributed Filesystem +hdfs: &HDFS + name: "hdfs" + create: "tools/spark/create_hdfs.sh $destination" + load: "tools/spark/load_hdfs.sh $destination $source" + copy: "hdfs dfs -cp -f $source $destination" + delete: "hdfs dfs -rm -r -f -skipTrash $destination" + download: 'hdfs dfs -cat $source/* | awk ''BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}'' > $destination/predictions.csv' + +workload: + # global definitions + engine_executable: &ENGINE "lib/python-venv/bin/python" + engine_executable_ks: &ENGINE_KS "lib/python-venv-ks/bin/python" + + training_template: &TRAINING_TEMPLATE "$engine $name --stage training --workdir $output $input/$file" + serving_template: &SERVING_TEMPLATE "$engine $name --stage serving --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: &SERVING_THROUGHPUT_TEMPLATE "$engine $name --stage serving --workdir $model --output $output/$stream $input/$file" + training_data_url: &TRAINING_DATA_URL "output/data/training" + serving_data_url: &SERVING_DATA_URL "output/data/serving" + scoring_data_url: &SCORING_DATA_URL "output/data/scoring" + datagen_datastore: *LOCAL_FS + include_datagen_in_tload: False + # general/ benchmark-wide configuration parameters + pdgf_node_parallel: False + pdgf_home: "lib/pdgf" + raw_data_url: "output/raw_data" + temp_dir: '/tmp/tpcxai' + usecases: + 1: + # general + name: "-m workload.UseCase01" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --num_clusters 4 --workdir $output $input/order.csv $input/lineitem.csv $input/order_returns.csv" + serving_template: "$engine $name --stage serving --workdir $model --output $output/$phase $input/order.csv $input/lineitem.csv $input/order_returns.csv" + serving_throughput_template: "$engine $name --stage serving --workdir $model --output $output/$stream $input/order.csv $input/lineitem.csv $input/order_returns.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc01" + output_url: "output/output/uc01" + scoring_output_url: "output/scoring/uc01" + 2: + # general + name: "-m workload.UseCase02" + # engines + training_engine: *ENGINE_KS + serving_engine: *ENGINE_KS + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --epochs 25 --batch 32 --workdir $output $input/$file" + serving_template: "$engine $name --stage serving --batch 32 --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: "$engine $name --stage serving --batch 32 --workdir $model --output $output/$stream $input/$file" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc02" + output_url: "output/output/uc02" + scoring_output_url: "output/scoring/uc02" + 3: + # general + name: "-m workload.UseCase03" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --workdir $output $input/order.csv $input/lineitem.csv $input/product.csv" + serving_template: "$engine $name --stage serving --workdir $model --output $output/$phase $input/store_dept.csv" + serving_throughput_template: "$engine $name --stage serving --workdir $model --output $output/$stream $input/store_dept.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc03" + output_url: "output/output/uc03" + scoring_output_url: "output/scoring/uc03" + 4: + # general + name: "-m workload.UseCase04" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc04" + output_url: "output/output/uc04" + scoring_output_url: "output/scoring/uc04" + 5: + # general + name: "-m workload.UseCase05" + # engines + training_engine: *ENGINE_KS + serving_engine: *ENGINE_KS + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --epochs 15 --batch 512 --workdir $output $input/$file" + serving_template: "$engine $name --stage serving --batch 512 --workdir $model --output $output/$phase $input/$file" + serving_throughput_template: "$engine $name --stage serving --batch 512 --workdir $model --output $output/$stream $input/$file" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc05" + output_url: "output/output/uc05" + scoring_output_url: "output/scoring/uc05" + 6: + # general + name: "-m workload.UseCase06" + # engines + training_engine: *ENGINE_KS + serving_engine: *ENGINE_KS + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc06" + output_url: "output/output/uc06" + scoring_output_url: "output/scoring/uc06" + 7: + # general + name: "-m workload.UseCase07" + # engines + training_engine: *ENGINE_KS + serving_engine: *ENGINE_KS + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: *TRAINING_TEMPLATE + serving_template: *SERVING_TEMPLATE + serving_throughput_template: *SERVING_THROUGHPUT_TEMPLATE + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc07" + output_url: "output/output/uc07" + scoring_output_url: "output/scoring/uc07" + 8: + # general + name: "-m workload.UseCase08" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --num-rounds 100 --workdir $output $input/order.csv $input/lineitem.csv $input/product.csv" + serving_template: "$engine $name --stage serving --workdir $model --output $output/$phase $input/order.csv $input/lineitem.csv $input/product.csv" + serving_throughput_template: "$engine $name --stage serving --workdir $model --output $output/$stream $input/order.csv $input/lineitem.csv $input/product.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc08" + output_url: "output/output/uc08" + scoring_output_url: "output/scoring/uc08" + 9: + # general + name: "-m workload.UseCase09" + # engines + training_engine: *ENGINE_KS + serving_engine: *ENGINE_KS + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --epochs_embedding=15 --batch=64 --workdir $output $input/CUSTOMER_IMAGES" + serving_template: "$engine $name --stage serving --batch=64 --workdir $model --output $output/$phase $input/CUSTOMER_IMAGES" + serving_throughput_template: "$engine $name --stage serving --batch=64 --workdir $model --output $output/$stream $input/CUSTOMER_IMAGES" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc09" + output_url: "output/output/uc09" + scoring_output_url: "output/scoring/uc09" + 10: + # general + name: "-m workload.UseCase10" + # engines + training_engine: *ENGINE + serving_engine: *ENGINE + # data stores + training_datastore: *LOCAL_FS # for storing the training data + model_datastore: *LOCAL_FS # for storing the trained models + serving_datastore: *LOCAL_FS # for storing the serving data + output_datastore: *LOCAL_FS # for storing the final output + # templates + datagen_template: "java -jar $pdgf -ns -sf $scale_factor -s $table" + training_template: "$engine $name --stage training --workdir $output $input/financial_account.csv $input/financial_transactions.csv" + serving_template: "$engine $name --stage serving --workdir $model --output $output/$phase $input/financial_account.csv $input/financial_transactions.csv" + serving_throughput_template: "$engine $name --stage serving --workdir $model --output $output/$stream $input/financial_account.csv $input/financial_transactions.csv" + # URLs + training_data_url: *TRAINING_DATA_URL + serving_data_url: *SERVING_DATA_URL + scoring_data_url: *SCORING_DATA_URL + model_url: "output/model/uc10" + output_url: "output/output/uc10" + scoring_output_url: "output/scoring/uc10" diff --git a/scripts/tpcx-ai/driver/dat/streams.yaml b/scripts/tpcx-ai/driver/dat/streams.yaml new file mode 100644 index 00000000000..581b4fcbe05 --- /dev/null +++ b/scripts/tpcx-ai/driver/dat/streams.yaml @@ -0,0 +1,204 @@ +workload: + # throughput streams + # any number of streams can be defined + # the naming scheme is `serving_througput_stream_[id] + serving_throughput_stream_1: [ 3, 5, 10, 6, 1, 7, 4, 8, 9, 2 ] + serving_throughput_stream_2: [ 1, 4, 5, 10, 3, 2, 9, 6, 7, 8 ] + serving_throughput_stream_3: [ 9, 5, 2, 6, 4, 10, 1, 7, 8, 3 ] + serving_throughput_stream_4: [ 4, 1, 3, 9, 5, 2, 10, 6, 7, 8 ] + serving_throughput_stream_5: [ 9, 8, 5, 4, 10, 1, 7, 3, 6, 2 ] + serving_throughput_stream_6: [ 5, 8, 9, 4, 1, 3, 10, 7, 2, 6 ] + serving_throughput_stream_7: [ 1, 8, 5, 2, 10, 9, 6, 7, 3, 4 ] + serving_throughput_stream_8: [ 3, 8, 7, 9, 4, 6, 1, 2, 10, 5 ] + serving_throughput_stream_9: [ 3, 5, 1, 6, 9, 2, 8, 7, 10, 4 ] + serving_throughput_stream_10: [ 6, 5, 3, 9, 10, 2, 4, 1, 8, 7 ] + serving_throughput_stream_11: [ 1, 5, 6, 4, 10, 8, 3, 7, 2, 9 ] + serving_throughput_stream_12: [ 4, 6, 2, 9, 8, 1, 10, 7, 5, 3 ] + serving_throughput_stream_13: [ 9, 1, 10, 5, 8, 7, 3, 4, 2, 6 ] + serving_throughput_stream_14: [ 10, 5, 4, 6, 7, 2, 9, 3, 1, 8 ] + serving_throughput_stream_15: [ 10, 3, 8, 5, 9, 4, 7, 1, 2, 6 ] + serving_throughput_stream_16: [ 10, 7, 5, 1, 6, 9, 8, 4, 3, 2 ] + serving_throughput_stream_17: [ 1, 8, 2, 5, 9, 10, 3, 7, 6, 4 ] + serving_throughput_stream_18: [ 4, 10, 6, 9, 3, 1, 2, 5, 7, 8 ] + serving_throughput_stream_19: [ 3, 5, 7, 6, 1, 4, 10, 2, 9, 8 ] + serving_throughput_stream_20: [ 9, 10, 1, 8, 3, 7, 6, 2, 4, 5 ] + serving_throughput_stream_21: [ 9, 2, 8, 7, 5, 10, 6, 4, 3, 1 ] + serving_throughput_stream_22: [ 9, 2, 7, 10, 5, 1, 3, 4, 6, 8 ] + serving_throughput_stream_23: [ 5, 7, 6, 4, 8, 3, 9, 10, 1, 2 ] + serving_throughput_stream_24: [ 6, 9, 2, 3, 7, 5, 1, 4, 8, 10 ] + serving_throughput_stream_25: [ 5, 10, 1, 3, 6, 4, 7, 2, 9, 8 ] + serving_throughput_stream_26: [ 5, 8, 3, 9, 4, 1, 10, 7, 6, 2 ] + serving_throughput_stream_27: [ 1, 6, 9, 7, 10, 4, 3, 2, 5, 8 ] + serving_throughput_stream_28: [ 3, 8, 7, 4, 5, 2, 1, 10, 9, 6 ] + serving_throughput_stream_29: [ 4, 7, 5, 9, 2, 10, 8, 1, 6, 3 ] + serving_throughput_stream_30: [ 7, 9, 5, 10, 1, 6, 8, 3, 2, 4 ] + serving_throughput_stream_31: [ 9, 1, 8, 4, 6, 10, 3, 7, 5, 2 ] + serving_throughput_stream_32: [ 4, 5, 9, 10, 2, 1, 3, 6, 8, 7 ] + serving_throughput_stream_33: [ 4, 1, 5, 10, 3, 8, 6, 2, 7, 9 ] + serving_throughput_stream_34: [ 2, 9, 1, 3, 5, 7, 4, 6, 10, 8 ] + serving_throughput_stream_35: [ 2, 6, 4, 10, 7, 5, 1, 3, 9, 8 ] + serving_throughput_stream_36: [ 10, 7, 1, 4, 5, 3, 9, 6, 2, 8 ] + serving_throughput_stream_37: [ 5, 4, 1, 3, 7, 9, 8, 2, 10, 6 ] + serving_throughput_stream_38: [ 4, 2, 5, 7, 8, 3, 10, 1, 9, 6 ] + serving_throughput_stream_39: [ 5, 9, 7, 1, 3, 4, 6, 10, 2, 8 ] + serving_throughput_stream_40: [ 1, 8, 7, 6, 4, 5, 2, 3, 10, 9 ] + serving_throughput_stream_41: [ 10, 5, 4, 8, 3, 7, 9, 2, 1, 6 ] + serving_throughput_stream_42: [ 5, 10, 6, 2, 1, 7, 8, 9, 4, 3 ] + serving_throughput_stream_43: [ 4, 7, 2, 6, 9, 3, 10, 8, 1, 5 ] + serving_throughput_stream_44: [ 5, 10, 4, 1, 8, 3, 7, 9, 2, 6 ] + serving_throughput_stream_45: [ 7, 5, 9, 6, 8, 2, 3, 4, 1, 10 ] + serving_throughput_stream_46: [ 3, 1, 6, 10, 8, 9, 4, 7, 5, 2 ] + serving_throughput_stream_47: [ 1, 4, 10, 2, 3, 5, 7, 8, 6, 9 ] + serving_throughput_stream_48: [ 7, 5, 1, 9, 8, 4, 6, 3, 10, 2 ] + serving_throughput_stream_49: [ 10, 8, 6, 1, 4, 2, 5, 3, 9, 7 ] + serving_throughput_stream_50: [ 8, 7, 5, 3, 6, 2, 10, 4, 9, 1 ] + serving_throughput_stream_51: [ 7, 3, 6, 9, 4, 8, 2, 5, 10, 1 ] + serving_throughput_stream_52: [ 6, 8, 5, 3, 10, 9, 2, 4, 7, 1 ] + serving_throughput_stream_53: [ 2, 5, 9, 10, 3, 8, 6, 4, 1, 7 ] + serving_throughput_stream_54: [ 6, 2, 3, 5, 10, 4, 7, 9, 8, 1 ] + serving_throughput_stream_55: [ 8, 4, 5, 1, 6, 7, 3, 9, 10, 2 ] + serving_throughput_stream_56: [ 8, 2, 7, 4, 5, 9, 6, 3, 1, 10 ] + serving_throughput_stream_57: [ 4, 10, 8, 5, 3, 7, 1, 6, 2, 9 ] + serving_throughput_stream_58: [ 2, 3, 1, 4, 6, 9, 5, 7, 8, 10 ] + serving_throughput_stream_59: [ 10, 4, 7, 1, 2, 3, 5, 8, 6, 9 ] + serving_throughput_stream_60: [ 10, 4, 3, 7, 5, 1, 6, 9, 8, 2 ] + serving_throughput_stream_61: [ 4, 6, 3, 7, 9, 5, 8, 10, 1, 2 ] + serving_throughput_stream_62: [ 5, 8, 9, 6, 10, 1, 7, 3, 4, 2 ] + serving_throughput_stream_63: [ 3, 9, 10, 7, 8, 2, 1, 5, 6, 4 ] + serving_throughput_stream_64: [ 8, 7, 3, 4, 2, 1, 5, 10, 6, 9 ] + serving_throughput_stream_65: [ 7, 5, 8, 4, 1, 3, 10, 6, 2, 9 ] + serving_throughput_stream_66: [ 1, 2, 6, 8, 7, 10, 3, 5, 4, 9 ] + serving_throughput_stream_67: [ 7, 9, 4, 5, 8, 2, 10, 3, 6, 1 ] + serving_throughput_stream_68: [ 5, 2, 4, 7, 9, 1, 8, 3, 10, 6 ] + serving_throughput_stream_69: [ 6, 8, 5, 9, 3, 4, 7, 2, 10, 1 ] + serving_throughput_stream_70: [ 3, 10, 7, 5, 1, 2, 9, 8, 6, 4 ] + serving_throughput_stream_71: [ 6, 10, 1, 4, 2, 5, 9, 7, 8, 3 ] + serving_throughput_stream_72: [ 1, 4, 3, 5, 7, 10, 9, 6, 8, 2 ] + serving_throughput_stream_73: [ 6, 2, 1, 3, 5, 7, 4, 10, 8, 9 ] + serving_throughput_stream_74: [ 2, 5, 3, 7, 10, 1, 9, 4, 6, 8 ] + serving_throughput_stream_75: [ 3, 8, 7, 10, 5, 1, 4, 9, 6, 2 ] + serving_throughput_stream_76: [ 4, 9, 8, 5, 1, 6, 10, 3, 2, 7 ] + serving_throughput_stream_77: [ 3, 4, 2, 9, 1, 6, 10, 7, 8, 5 ] + serving_throughput_stream_78: [ 9, 6, 8, 5, 4, 2, 7, 10, 3, 1 ] + serving_throughput_stream_79: [ 9, 3, 4, 2, 7, 1, 5, 8, 10, 6 ] + serving_throughput_stream_80: [ 8, 2, 3, 1, 7, 9, 6, 4, 10, 5 ] + serving_throughput_stream_81: [ 10, 5, 6, 2, 3, 8, 1, 9, 7, 4 ] + serving_throughput_stream_82: [ 6, 7, 2, 9, 8, 5, 10, 4, 3, 1 ] + serving_throughput_stream_83: [ 2, 9, 7, 5, 8, 6, 3, 10, 4, 1 ] + serving_throughput_stream_84: [ 5, 4, 10, 3, 6, 8, 7, 2, 9, 1 ] + serving_throughput_stream_85: [ 6, 5, 3, 9, 2, 10, 4, 8, 1, 7 ] + serving_throughput_stream_86: [ 3, 7, 9, 5, 2, 8, 10, 6, 1, 4 ] + serving_throughput_stream_87: [ 2, 6, 9, 8, 5, 7, 4, 1, 10, 3 ] + serving_throughput_stream_88: [ 8, 7, 1, 4, 10, 2, 3, 5, 6, 9 ] + serving_throughput_stream_89: [ 4, 9, 7, 8, 6, 10, 1, 2, 3, 5 ] + serving_throughput_stream_90: [ 7, 5, 1, 9, 8, 2, 10, 6, 3, 4 ] + serving_throughput_stream_91: [ 10, 1, 5, 7, 8, 6, 4, 3, 2, 9 ] + serving_throughput_stream_92: [ 8, 3, 4, 5, 9, 6, 7, 10, 1, 2 ] + serving_throughput_stream_93: [ 8, 5, 7, 10, 1, 2, 4, 6, 9, 3 ] + serving_throughput_stream_94: [ 9, 7, 2, 8, 6, 10, 5, 3, 1, 4 ] + serving_throughput_stream_95: [ 1, 10, 7, 3, 9, 4, 8, 2, 5, 6 ] + serving_throughput_stream_96: [ 8, 2, 1, 5, 10, 7, 4, 3, 9, 6 ] + serving_throughput_stream_97: [ 6, 2, 10, 7, 5, 4, 8, 9, 1, 3 ] + serving_throughput_stream_98: [ 2, 4, 3, 9, 5, 6, 1, 10, 7, 8 ] + serving_throughput_stream_99: [ 3, 5, 1, 7, 2, 4, 8, 9, 6, 10 ] + serving_throughput_stream_100: [ 9, 1, 6, 8, 10, 4, 7, 3, 5, 2 ] + serving_throughput_stream_101 : [1, 10, 3, 6, 9, 8, 2, 5, 7, 4] + serving_throughput_stream_102 : [3, 5, 4, 1, 10, 6, 2, 9, 7, 8] + serving_throughput_stream_103 : [6, 9, 8, 2, 1, 5, 10, 7, 4, 3] + serving_throughput_stream_104 : [2, 8, 10, 6, 1, 3, 7, 9, 5, 4] + serving_throughput_stream_105 : [5, 7, 9, 10, 3, 4, 2, 8, 1, 6] + serving_throughput_stream_106 : [3, 10, 2, 8, 6, 9, 5, 4, 7, 1] + serving_throughput_stream_107 : [1, 3, 5, 2, 10, 8, 4, 7, 6, 9] + serving_throughput_stream_108 : [10, 4, 2, 5, 1, 6, 7, 3, 9, 8] + serving_throughput_stream_109 : [6, 5, 9, 7, 3, 2, 10, 1, 8, 4] + serving_throughput_stream_110 : [7, 2, 6, 10, 3, 1, 4, 5, 9, 8] + serving_throughput_stream_111 : [6, 4, 8, 9, 10, 3, 7, 2, 5, 1] + serving_throughput_stream_112 : [2, 9, 10, 6, 7, 8, 3, 5, 1, 4] + serving_throughput_stream_113 : [7, 6, 3, 2, 5, 4, 10, 9, 1, 8] + serving_throughput_stream_114 : [3, 5, 2, 4, 7, 6, 1, 10, 8, 9] + serving_throughput_stream_115 : [1, 10, 2, 5, 9, 6, 7, 3, 4, 8] + serving_throughput_stream_116 : [5, 6, 7, 10, 9, 4, 1, 3, 8, 2] + serving_throughput_stream_117 : [2, 5, 6, 10, 9, 1, 8, 3, 4, 7] + serving_throughput_stream_118 : [6, 8, 5, 10, 9, 1, 2, 4, 7, 3] + serving_throughput_stream_119 : [3, 8, 7, 5, 2, 10, 9, 4, 1, 6] + serving_throughput_stream_120 : [3, 7, 6, 10, 1, 8, 4, 2, 9, 5] + serving_throughput_stream_121 : [10, 1, 2, 8, 6, 4, 3, 7, 9, 5] + serving_throughput_stream_122 : [4, 3, 6, 8, 9, 5, 10, 2, 1, 7] + serving_throughput_stream_123 : [8, 9, 10, 4, 5, 7, 6, 1, 3, 2] + serving_throughput_stream_124 : [2, 3, 7, 10, 8, 5, 6, 4, 9, 1] + serving_throughput_stream_125 : [10, 6, 2, 9, 8, 7, 4, 5, 1, 3] + serving_throughput_stream_126 : [2, 4, 6, 5, 1, 3, 7, 8, 9, 10] + serving_throughput_stream_127 : [9, 2, 8, 4, 1, 10, 3, 6, 7, 5] + serving_throughput_stream_128 : [10, 1, 7, 8, 5, 4, 9, 3, 2, 6] + serving_throughput_stream_129 : [1, 9, 7, 10, 3, 4, 2, 6, 5, 8] + serving_throughput_stream_130 : [2, 5, 1, 6, 4, 10, 7, 9, 8, 3] + serving_throughput_stream_131 : [3, 8, 1, 9, 7, 5, 4, 6, 2, 10] + serving_throughput_stream_132 : [5, 1, 3, 7, 6, 9, 10, 8, 2, 4] + serving_throughput_stream_133 : [2, 10, 9, 5, 6, 3, 4, 8, 7, 1] + serving_throughput_stream_134 : [3, 5, 10, 7, 1, 6, 2, 9, 8, 4] + serving_throughput_stream_135 : [4, 10, 5, 6, 9, 7, 2, 3, 8, 1] + serving_throughput_stream_136 : [5, 7, 6, 3, 1, 8, 10, 4, 9, 2] + serving_throughput_stream_137 : [3, 7, 1, 6, 10, 2, 9, 8, 5, 4] + serving_throughput_stream_138 : [4, 6, 9, 8, 5, 3, 1, 2, 10, 7] + serving_throughput_stream_139 : [7, 4, 8, 2, 10, 3, 1, 6, 9, 5] + serving_throughput_stream_140 : [2, 9, 8, 6, 7, 5, 10, 3, 4, 1] + serving_throughput_stream_141 : [2, 5, 6, 10, 1, 8, 3, 9, 7, 4] + serving_throughput_stream_142 : [10, 9, 2, 5, 3, 7, 6, 4, 1, 8] + serving_throughput_stream_143 : [10, 9, 3, 6, 2, 5, 1, 8, 4, 7] + serving_throughput_stream_144 : [1, 5, 6, 10, 9, 7, 2, 8, 4, 3] + serving_throughput_stream_145 : [6, 2, 3, 9, 8, 7, 5, 10, 4, 1] + serving_throughput_stream_146 : [2, 10, 6, 3, 5, 4, 1, 7, 8, 9] + serving_throughput_stream_147 : [10, 2, 5, 1, 4, 8, 3, 6, 9, 7] + serving_throughput_stream_148 : [1, 7, 8, 4, 3, 9, 2, 5, 6, 10] + serving_throughput_stream_149 : [4, 3, 1, 5, 9, 6, 7, 2, 10, 8] + serving_throughput_stream_150 : [3, 4, 7, 8, 9, 2, 6, 1, 10, 5] + serving_throughput_stream_151 : [4, 3, 9, 2, 7, 8, 10, 1, 6, 5] + serving_throughput_stream_152 : [6, 5, 4, 2, 9, 3, 8, 1, 10, 7] + serving_throughput_stream_153 : [10, 9, 1, 6, 4, 3, 2, 7, 8, 5] + serving_throughput_stream_154 : [10, 8, 3, 1, 4, 6, 9, 2, 7, 5] + serving_throughput_stream_155 : [5, 1, 2, 10, 9, 7, 8, 4, 3, 6] + serving_throughput_stream_156 : [6, 9, 10, 2, 1, 4, 8, 3, 5, 7] + serving_throughput_stream_157 : [10, 2, 5, 4, 7, 8, 1, 9, 3, 6] + serving_throughput_stream_158 : [7, 10, 1, 5, 4, 6, 3, 2, 8, 9] + serving_throughput_stream_159 : [8, 9, 7, 10, 6, 3, 2, 1, 5, 4] + serving_throughput_stream_160 : [3, 10, 9, 4, 2, 8, 5, 7, 6, 1] + serving_throughput_stream_161 : [3, 9, 4, 7, 8, 1, 5, 10, 6, 2] + serving_throughput_stream_162 : [1, 10, 9, 2, 8, 4, 5, 7, 6, 3] + serving_throughput_stream_163 : [6, 5, 10, 8, 4, 7, 2, 1, 9, 3] + serving_throughput_stream_164 : [10, 1, 8, 4, 7, 9, 6, 2, 3, 5] + serving_throughput_stream_165 : [7, 6, 3, 4, 1, 2, 5, 8, 10, 9] + serving_throughput_stream_166 : [9, 3, 2, 8, 10, 4, 5, 6, 7, 1] + serving_throughput_stream_167 : [10, 4, 2, 3, 6, 8, 9, 1, 5, 7] + serving_throughput_stream_168 : [6, 10, 3, 2, 9, 7, 4, 1, 8, 5] + serving_throughput_stream_169 : [4, 10, 9, 2, 1, 6, 3, 8, 7, 5] + serving_throughput_stream_170 : [2, 10, 9, 7, 6, 4, 8, 1, 5, 3] + serving_throughput_stream_171 : [10, 7, 6, 3, 8, 5, 9, 4, 1, 2] + serving_throughput_stream_172 : [1, 3, 9, 8, 5, 7, 10, 2, 6, 4] + serving_throughput_stream_173 : [10, 8, 4, 1, 3, 7, 2, 6, 9, 5] + serving_throughput_stream_174 : [10, 8, 6, 2, 7, 5, 4, 1, 3, 9] + serving_throughput_stream_175 : [4, 8, 2, 7, 3, 1, 9, 10, 6, 5] + serving_throughput_stream_176 : [2, 5, 3, 8, 10, 4, 6, 1, 7, 9] + serving_throughput_stream_177 : [2, 9, 4, 1, 7, 5, 6, 3, 10, 8] + serving_throughput_stream_178 : [7, 8, 1, 5, 10, 4, 6, 9, 3, 2] + serving_throughput_stream_179 : [8, 4, 1, 6, 7, 5, 10, 2, 3, 9] + serving_throughput_stream_180 : [6, 7, 8, 1, 2, 5, 9, 10, 3, 4] + serving_throughput_stream_181 : [2, 9, 4, 6, 10, 7, 8, 3, 1, 5] + serving_throughput_stream_182 : [7, 8, 3, 10, 6, 1, 5, 4, 9, 2] + serving_throughput_stream_183 : [1, 3, 2, 5, 9, 7, 4, 10, 8, 6] + serving_throughput_stream_184 : [5, 3, 10, 9, 8, 2, 1, 7, 4, 6] + serving_throughput_stream_185 : [8, 9, 7, 6, 4, 5, 10, 3, 1, 2] + serving_throughput_stream_186 : [10, 5, 2, 4, 9, 6, 1, 8, 7, 3] + serving_throughput_stream_187 : [4, 7, 9, 3, 10, 5, 1, 2, 8, 6] + serving_throughput_stream_188 : [9, 10, 8, 3, 2, 6, 5, 4, 7, 1] + serving_throughput_stream_189 : [7, 2, 10, 1, 8, 5, 9, 6, 4, 3] + serving_throughput_stream_190 : [1, 4, 9, 3, 8, 5, 6, 10, 2, 7] + serving_throughput_stream_191 : [3, 1, 8, 9, 6, 2, 5, 10, 7, 4] + serving_throughput_stream_192 : [5, 10, 2, 7, 9, 8, 1, 6, 4, 3] + serving_throughput_stream_193 : [4, 6, 7, 10, 8, 2, 3, 9, 5, 1] + serving_throughput_stream_194 : [4, 6, 7, 9, 10, 1, 8, 3, 5, 2] + serving_throughput_stream_195 : [10, 3, 1, 7, 9, 2, 4, 8, 6, 5] + serving_throughput_stream_196 : [5, 8, 4, 2, 3, 6, 7, 9, 10, 1] + serving_throughput_stream_197 : [5, 2, 9, 1, 4, 7, 3, 6, 10, 8] + serving_throughput_stream_198 : [8, 9, 10, 3, 6, 5, 1, 4, 2, 7] + serving_throughput_stream_199 : [7, 1, 3, 2, 9, 8, 4, 10, 5, 6] + serving_throughput_stream_200 : [7, 4, 1, 2, 3, 6, 10, 9, 5, 8] diff --git a/scripts/tpcx-ai/driver/dat/tables.yaml b/scripts/tpcx-ai/driver/dat/tables.yaml new file mode 100644 index 00000000000..352413f96e1 --- /dev/null +++ b/scripts/tpcx-ai/driver/dat/tables.yaml @@ -0,0 +1,88 @@ +workload: + delimiter: ',' # default delimiter + usecases: + 1: + tables: + - "customer" + - "order" + - "lineitem" + - "order_returns" + raw_data_files: + - "customer.csv" + - "order.csv" + - "lineitem.csv" + - "order_returns.csv" + label_column: 'c_cluster_id' + 2: + tables: + - "CONVERSATION_AUDIO" + raw_data_files: + - "CONVERSATION_AUDIO.csv" + raw_data_folder: "CONVERSATION_AUDIO" + label_column: 'transcript' + delimiter: '|' + 3: + tables: + - "order" + - "lineitem" + - "product" + - "store_dept" + - "order_weekly_sales" + raw_data_files: + - "store_dept.csv" + - "order.csv" + - "lineitem.csv" + - "product.csv" + label_column: 'weekly_sales' + 4: + tables: + - "Review" + raw_data_files: + - "Review.psv" + label_column: "spam" + delimiter: "|" + 5: + tables: + - "marketplace" + raw_data_files: + - "marketplace.csv" + label_column: 'price' + delimiter: '|' + 6: + tables: + - "failures" + raw_data_files: + - "failures.csv" + label_column: "failure" + 7: + tables: + - "ProductRating" + raw_data_files: + - "ProductRating.csv" + label_column: 'rating' + 8: + tables: + - "order" + - "lineitem" + - "product" + raw_data_files: + - "order.csv" + - "lineitem.csv" + - "product.csv" + label_column: "trip_type" + 9: + tables: + - "CUSTOMER_IMAGES" + - "CUSTOMER_IMAGES_META" + raw_data_files: + - "CUSTOMER_IMAGES_META.csv" + raw_data_folder: "CUSTOMER_IMAGES" + label_column: 'identity' + 10: + tables: + - "financial_transactions" + - "financial_account" + raw_data_files: + - "financial_transactions.csv" + - "financial_account.csv" + label_column: "isFraud" diff --git a/scripts/tpcx-ai/driver/dat/thresholds.yaml b/scripts/tpcx-ai/driver/dat/thresholds.yaml new file mode 100644 index 00000000000..935bbb68391 --- /dev/null +++ b/scripts/tpcx-ai/driver/dat/thresholds.yaml @@ -0,0 +1,52 @@ +workload: + usecases: + 1: + quality_metric: "" + quality_metric_kvargs: {} + quality_metric_threshold: -1 + quality_metric_larger_is_better: True + 2: + quality_metric: "word_error_rate" + quality_metric_kvargs: {} + quality_metric_threshold: 0.5 + quality_metric_larger_is_better: False + 3: + quality_metric: "mean_squared_log_error" + quality_metric_kvargs: {} + quality_metric_threshold: 5.4 + quality_metric_larger_is_better: False + 4: + quality_metric: "f1_score" + quality_metric_kvargs: {} + quality_metric_threshold: 0.65 + quality_metric_larger_is_better: True + 5: + quality_metric: "mean_squared_log_error" + quality_metric_kvargs: {} + quality_metric_threshold: 0.5 + quality_metric_larger_is_better: False + 6: + quality_metric: "matthews_corrcoef" + quality_metric_kvargs: {} + quality_metric_threshold: 0.19 + quality_metric_larger_is_better: True + 7: + quality_metric: "median_absolute_error" + quality_metric_kvargs: {} + quality_metric_threshold: 1.8 + quality_metric_larger_is_better: False + 8: + quality_metric: "accuracy_score" + quality_metric_kvargs: {} + quality_metric_threshold: 0.65 + quality_metric_larger_is_better: True + 9: + quality_metric: "accuracy_score" + quality_metric_kvargs: {} + quality_metric_threshold: 0.9 + quality_metric_larger_is_better: True + 10: + quality_metric: "accuracy_score" + quality_metric_kvargs: {} + quality_metric_threshold: 0.7 + quality_metric_larger_is_better: True diff --git a/scripts/tpcx-ai/driver/dat/timeseries.yaml b/scripts/tpcx-ai/driver/dat/timeseries.yaml new file mode 100644 index 00000000000..0958c050c86 --- /dev/null +++ b/scripts/tpcx-ai/driver/dat/timeseries.yaml @@ -0,0 +1,2 @@ +workload: + timeseries: [] diff --git a/scripts/tpcx-ai/driver/dat/tpcxai.sha256 b/scripts/tpcx-ai/driver/dat/tpcxai.sha256 new file mode 100644 index 00000000000..2a397135ec6 --- /dev/null +++ b/scripts/tpcx-ai/driver/dat/tpcxai.sha256 @@ -0,0 +1,430 @@ +bin/tpcxai.sh, dba99a764c079bede94e9a5bdee23a8c4920d7b324eed6ec63c020dfcc1a55a8 +driver/dat/thresholds.yaml, 129e9c473fa39c8807baa64acc7fca777bbf5a5c4e0b0ae7121ab4b2586ea5c6 +driver/dat/tables.yaml, d7839169da5a1819b9042cf9742cb7a2ca4c4384d30ecb9f99af5cb123d08a15 +driver/dat/streams.yaml, 8b60e6ae975c30ec7c60c6ef66ce44c4aeb585afd168d19bd762e7dd0b818a60 +driver/dat/timeseries.yaml, 438f535f7767ba7a39483cf4a0002e9b2a242dcb9c8d987a4ede14c1853bae52 +driver/tpcxai-driver/data.py, 14f1312d86cd0e9b695962e5767af0bb547e494b2bd43d668374f1a8ee4354c7 +driver/tpcxai-driver/database.py, 7aa8faa72201dc8221e9fd53348525126bc7b9f33446cf1bf0cfc05c0436b643 +driver/tpcxai-driver/data_generation.py, 105c8fb3862e890d96ff49621ccf7e742b468c077d6655bbda61e49894b4f8cd +driver/tpcxai-driver/logger.py, de510f2ac07f2a02b77d248925527fae199e0938b8ed43a994627678e5ae726c +driver/tpcxai-driver/__init__.py, e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +driver/tpcxai-driver/subprocess_util.py, 1740ae1dae699185891c00c0a01a90b013969c19c5a1d0d81f6cb104ce92297d +driver/tpcxai-driver/metrics.py, 068a5b9ed5de3e931248b978b3d04d62adb9ba1d31cf568bf938a97a6c7a1e45 +driver/tpcxai-driver/usecase.py, 03adb18bf2e2d7bc883833e97931fd1daa9c1302ffbdd0975c2f8e3af4041f6d +driver/tpcxai-driver/__main__.py, 96e6a90009d866f1bc609d9b9007d48fe2ec4d96ee2417ccbb0ac55d199a2a17 +data-gen/dicts/tpcxai/markovspam.bin, 775c32c7b1a35456609832d96e6d311bf9dfb144d4a5086a2967a6ac54b20d77 +data-gen/dicts/tpcxai/weekdays.dict, 7988849b81927984be05a6ae08e09e0926e8b03b88c2d18029a8aa61cbdb27f0 +data-gen/dicts/tpcxai/mail_provider.dict, 1ae89851301b2911f4ee0167df807ded2caf624adca3d49bdbaab74b5dcdfb95 +data-gen/dicts/tpcxai/scancount.dict, c2930967a01da3cdbe8c9c0b1f0f1d64b519c9715135693cc4390c8d53a49c47 +data-gen/dicts/tpcxai/department_list.dict, 416a1706dc275c74c957b3791c74aff99ff3f4f58295cefb9d2950b7c2e6ec61 +data-gen/dicts/tpcxai/Given-Names.dict, 50f50d5c4478d8c8412726da030b1ba6b618fe487cc021827df89de31ad023d3 +data-gen/dicts/tpcxai/bmfm_2020.h5, ba5b1955acb1a978e6fd0189c38d3b4cd0cdf7d9d4d8667a9cbd6f69124a56d0 +data-gen/dicts/tpcxai/Family-Names.dict, 1fcbb9e8762c266f3fe499cf7e8478325ec0b40c9860992df4e32fae76b9380c +data-gen/dicts/tpcxai/ds-genProbabilities.txt, 75813099a2dc9095f5a6887108f7bbf9cbdc8c22c1000da581a1386b1fe22128 +data-gen/dicts/tpcxai/categories.dict, 1868bd67de69f26abd086dad7f28f0b85c26528ac3fa16b64f2b5185014a1805 +data-gen/dicts/tpcxai/categoriesWeightedList.txt, b7703648f8ca1aec7ee3a050a1ddd410acbf45f70a75cfe9568eb9a8befc2996 +data-gen/dicts/tpcxai/Given-Names-Random.dict, 6edd05d6c16fad172d999cd292d171fd2fdc0bd1d3bd928c15b7de68b1cd9c14 +data-gen/dicts/tpcxai/Family-Names-Random.dict, 3eb32d8a9e51be81e2748f41d6769f60d7b50f495d54406e4f50827d87969d02 +data-gen/dicts/tpcxai/question.bin, 88a04d1d91969be7ce32f5e4d69907fb04acefbf1c6abc8394cd760054789d66 +data-gen/dicts/tpcxai/answer.bin, 208e73a4a7da061a659bf7bd43855abb1b73648b12abe15c21c6f1b89089a725 +data-gen/dicts/tpcxai/trip_type.dict, ac61234e489dded30db2661a54af06101161ef9d884c0f2a972acb75cf37ef0e +data-gen/dicts/tpcxai/markovham.bin, 6f7b566fd017059fccd4b129c20f167d9eb907e9b1dae6d9961105304d212a0b +data-gen/dicts/tpcxai/department.dict, e35612bc93f1bbca45fea16921d36f793167d9f0ac29e2ecf6888c9b23807ef1 +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Crewneck_3.bin, 889c1725829476491d608ceaf93d89da1a7f1e417e55f7f24cb4d837c0146dd2 +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Fashion Sneakers_1.bin, af45fde6630734366c2882ba88e61ee5ea3298cccb57b474fa5d90d47a1b57a1 +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Slim, Skinny_1.bin, e66cb462477bb33e917995125e3d7e015186213c578b1ce715de24c3bff6593b +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Tracksuits & Sweats_3.bin, 248bde1f0fc550ed2bd16c25af982ab214213cdca58727d44a0344c5852b747e +data-gen/dicts/tpcxai/models/product_description/Women_Underwear--Bras_2.bin, 74c8b1a595bec54cd506782c82876b0240630742fcbaa0ce03b77fe7d4f449d0 +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Cardigan_2.bin, db52f005d50f5fabd1828cdcb24a5254237258ee66ea5549a6673ee860f9987f +data-gen/dicts/tpcxai/models/product_description/Kids_Toys--Dolls & Accessories_1.bin, aefb5dab68d49f998e556afb9e1ecc6ea5cd99b6dd41d1805a07dea2bc9df419 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Pumps_2.bin, 9f1646573b5ee461e3a24e58a1aa14053295021eccdb549a34c356d309948664 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Messenger & Crossbody_3.bin, 0bb07df12c87a6a71cc0c2542b1e9fc723756b4f2576111aa67d7240d774fdaf +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Shoulder Bag_2.bin, 0bcf4941ce8cdf34b386cf5bbabea63fd1204ebcd7a376135330c99a5f691cce +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Blouse_3.bin, bfe954e9b217296bb8d2ac0aba778733273f7909cc77ada00dd6998498318b30 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Tracksuits & Sweats_2.bin, ad3a6785ee946f5667001d401b01df40418374a8d392e543f4e9b7931a51a9a5 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Wallets_3.bin, c64248474d068855f432987b3264d83bbfcd3ab8da36da1a1049da278b881e82 +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Makeup Palettes_1.bin, f3dfc2a8a3636d7d9ea35b7b286106172c87bb7cb22e8f0edc7159e3d7993bbd +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Sunglasses_1.bin, 8106b02c8de9ebd050fbed8e596cf4ccba6ecdb468ed95067420475cf5806483 +data-gen/dicts/tpcxai/models/product_description/Women_Swimwear--Two-Piece_2.bin, 417e3c1c358398815bde92ba781ce33eb30232a3870824ff7b5f41ab4f1ca2b0 +data-gen/dicts/tpcxai/models/product_description/Electronics_Video Games & Consoles--Games_2.bin, 98d993d662c987a70aed702979c38cc2ad19da22dc6c9e46204c31d70fcf7cc4 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tank, Cami_3.bin, 76279ca2f8568baab546cd8d47990a81c433ee1dfe099d3169b0a776dc7f2ad8 +data-gen/dicts/tpcxai/models/product_description/Kids_Toys--Action Figures & Statues_2.bin, ae01a2e7e9bf83dc5a712545abeecb9604e358f8032910c5e0c60cfee89c8ccf +data-gen/dicts/tpcxai/models/product_description/Beauty_Fragrance--Women_2.bin, 87e761faef0a0fb74e2950ed63c5d27110006d7815f1218f1cc780743530f483 +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Hooded_3.bin, 9806005c0c848c3634721a7ad36ee1b7d37635bd69dfab7164392b23e320901e +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Cardigan_3.bin, 8db6bd1f8129919338a46e5b469c92ff6244652e1d116ee2f7b63d9fbc100581 +data-gen/dicts/tpcxai/models/product_description/Women_Underwear--Bras_3.bin, cd52cb5d86513230ece9428bf6894283caebf7c3668e2f074782c6b16fc99881 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Jackets_2.bin, 56a93ccdffecd36d7e0093bc8ac942f563cd4cf50952242a1a11bbe4584a9856 +data-gen/dicts/tpcxai/models/product_description/Electronics_Video Games & Consoles--Games_1.bin, 8ed442f935cd6fc0ae359c92fd354fbf6154fc5e1d659e1c74003baa3860eb5d +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Totes & Shoppers_3.bin, 0faa8e283b0be40eea872e3acdb76666b5cd93815a3ab1acc73a967590823bbd +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Crewneck_1.bin, a2d3d6299f0a1f60c87e8dc19ad12a9f5df4cc72e972d2a9e9af65727e585cb4 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Pants, Tights, Leggings_1.bin, c78da99cfa3166420e2001d1164182ad75a1325e04ba0deb81abf53678eb00e6 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tunic_1.bin, 4bd7bbf25de49b22a78335a0b786d1d8e45ddc94165c2ba080e3126aefbca23f +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_3.bin, 52b79a96b0c4f6f812163e3900b8f02acf74254cca84e7bbbda190564fd210ea +data-gen/dicts/tpcxai/models/product_description/Women_Dresses--Knee-Length_3.bin, 5ef35f57c5459a294227a89ed67cd7d3e4127dd6e3f1bbfecfe6d9f5d0a84a21 +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Lips_2.bin, 5419c85bf0924a6b2fc8da80af4a827bdafbbbda64e38a484eea76f49f3d6fe7 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Jackets_1.bin, 6c8df5114dba993f22d12e167d6498956cf2d1d73a115962bdb82476980fb533 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shirts & Tops_2.bin, 460828ae6cee916829a90bbb18ff86d9200fd36bef7a7f396d4fba78b8c0cb15 +data-gen/dicts/tpcxai/models/product_description/Women_Dresses--Above Knee, Mini_3.bin, 38cf07954a2d72112aed5e11d03fd37d9c8d99ff702d30652ab06965acbed265 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Fashion Sneakers_3.bin, 03374ee02b70e8f95fb2889882008ff6eae52675cede5e42f98cef2cd1ec4f22 +data-gen/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cases, Covers & Skins_3.bin, f3dc34720195b8ebfb7a1636ebddddaeb13db09dcd1f4e5d8b1a105d79d54096 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Sandals_3.bin, 7a05de62abb29dc60c5d797959fef2ca872ed4c35e0e5d38a8c17a3c33c3de50 +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Crewneck_2.bin, 9b4546557d9e313ede70d5a5738cf00315592465a6d75f629002515c47583e9f +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Sunglasses_2.bin, 27e1bfeb3ebb79905b02c82aac469b698ad7169b727a6ea4f27127e4905ebf1f +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Pants, Tights, Leggings_3.bin, 7bd6a79ecc809a57fc7646e3786cb790b9cba5f9979f7abfda072e19d46b8bd8 +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Face_3.bin, 2d6f991f814645afa96fd7215d2f829cf24f079a05ee6b6b893a0949f0ead940 +data-gen/dicts/tpcxai/models/product_description/Women_Swimwear--Two-Piece_3.bin, 2ba228a2167704091dce5284aca0aa74f7d09562be236f60a93a85878fe8518a +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shirts & Tops_1.bin, df03b19b6885840c6f17480efeae90ef81ae542f83da7ad70e489a358d907793 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Blouse_2.bin, a1d4437b51ec69eeeeafa5c63d87003527b46f34fd3ba76849eb037e5c64d1be +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Shoulder Bag_1.bin, dcb80c77583e5c8c67e8f5c8fa047d87f898b263dd28795a75b4b569e9dabec0 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Wallets_1.bin, 8cc110e191494f2432561fa2b5290c6e0c84467aed6a29e02cbf067701ee5a6a +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Eyes_2.bin, f72140a10d287cb68b7239e2433ac32a5841b2937cc342fb9fb14d31ade8d7d1 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Pants, Tights, Leggings_2.bin, a9a5d6b4af3f65b0867618ba720574a9eed07782e6f73200da163a507ffa5822 +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Fashion Sneakers_2.bin, 1035f933468ab3cb42d943d0addeade79bd74b2606f95493fe936e7debbe8801 +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Necklaces_3.bin, 3ff955b7914bd13390adf0fae56d96e224506e8cf644c85bd5ce786f771018d7 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Boots_2.bin, 9d42e8f7b97591eb01ea0ac20337bb679801b5eb0cac8f85c10c75ffae48d69c +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shorts_3.bin, ea39172be511d3a04af7a2425adbcb4f4dd355727a72f2a9eda8932827036b39 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--T-Shirts_1.bin, cf42dbd6032e540e0a39bdb85ebd9032492f80d22f6c032a47b6a10958c20f60 +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Slim, Skinny_2.bin, ccd43fc745452b94946345819f83d5b82c73d6e3900bb989cb990b832d8f63e2 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Fashion Sneakers_1.bin, 89cf4cc3709d7c8a53e0cefb759f3b9148d5818d393a533bcdd1ec9fd2f6c176 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shirts & Tops_3.bin, fa53cc55e9a57260396121f1c9212c8a283c85182dc3535d87ceb84b6f5ae4fc +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--T-Shirts_2.bin, bb831f81dcaa02147b363ad25fe6507384a877c5f4cc7158f1524e3a3618102e +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Leggings_2.bin, 53e9531de9c84998ac24d74bcdeab12705f830e074a203b389dc9273a67e44ab +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Makeup Palettes_2.bin, 90e9a14832f5e1122e0f17a809e18d9767cba2afa8e451d0a9ae9ead6155bee0 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Fashion Sneakers_2.bin, 02f9327f5f43264dec93fa4fceeacc7814cb547ed0cd1e2337971c3200eab034 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Athletic_3.bin, 9f53438d0f0dcd6baf1286a4c21bd517bc149915cad654b5556d799283f0f985 +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Necklaces_2.bin, 09df70771d6115d83a65aed70bbf6781030c75846667dbc390b7400943a152d5 +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Makeup Palettes_3.bin, 0cf4c5c5ee3bec759e26441d011b7a4465b7e8f3a929372bdea65eb62cf5d7a0 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Button Down Shirt_2.bin, f06ccad9693a0d3c0c05ebaa0471979d1883699c6f6a44368aa790b3a9e0a87d +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Cardigan_1.bin, 57b59c51c267c74b4e6cb283272e46139f9ba3de4a96a829035688ffbbed34a5 +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Boot Cut_2.bin, 3085ebb6a317113dbd10267374bc8b8fee07ac420b4fc1e39fa365decaece577 +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_1.bin, 1b8eeee5aef4b01fb5f7f98a48204a6bc99882ccd8bda75653cef23ed995cd39 +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_4.bin, 1d6bb5200effe60646fb9e45bbba214d2a9c81d97e1277382b1e409f0e9b014a +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Face_2.bin, 2d091ee25189048b6b8e0b6a6a432b9e362bfe3e29a9c2d3b77a53ccff50fb7a +data-gen/dicts/tpcxai/models/product_description/Women_Dresses--Above Knee, Mini_2.bin, 8b7940b72fa6fd150dc70c30b621d894226d55165aa70dc55a110a28881b0e42 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Boots_1.bin, ce1b932d30e0092820e2a7851024bf95f94f7c0e57552322fe9c5789a5f37ff3 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Hats_2.bin, efdd68c5364cc548b7e99fa40b13fa163fb3de379e4ca63c81f51e8021f094b8 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Button Down Shirt_3.bin, d45b7a7918f89d941790c126358709601f5dba3800ff285ac45a8b71e0457209 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Sports Bras_1.bin, e036298b79d3c0930eb0dde930331b41edee6973cc7a73dc222d70ac67ab94ab +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Slim, Skinny_3.bin, 6e3d9014bd7cad5fab3d44a31c6a1b18e31dce925e290f51a4cbf279bd5ef169 +data-gen/dicts/tpcxai/models/product_description/Women_Dresses--Knee-Length_2.bin, 9e6f5cbebdaba3315432f48dc51d9888c79f0d2e6545fc002a7b3d89a362848f +data-gen/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cell Phones & Smartphones_3.bin, 638334950c98f797924e8162061417c0fbf0319e676c03177bd230610421f6ad +data-gen/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cases, Covers & Skins_1.bin, 6839dc9425d1d6f6c3e3732bcf28b1a458478adc54e1baf0f9b2e12e0a566c85 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Pumps_3.bin, d1c1d0c60c3d03d0e4ea6b9d436b68679e5fc7588aed42a29db806cea27ee752 +data-gen/dicts/tpcxai/models/product_description/Beauty_Fragrance--Women_1.bin, bf1f481558c8081fd50daa2d889987ddac42af7043d0dbb2acfb12cd76137e45 +data-gen/dicts/tpcxai/models/product_description/Women_Dresses--Above Knee, Mini_1.bin, b4b0e770407f12f57dc418a4b7b67bec66028771c72b3844a13b0d82b718db34 +data-gen/dicts/tpcxai/models/product_description/Kids_Toys--Dolls & Accessories_2.bin, 0b793fd20e7e85090dbec9fbe2b502f396ca8c306221193d98c1570e9e7180e9 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Boots_3.bin, 6a6caa6edb59b76536a6bdf0bff8092f6bf91d8c6c9698a53500d4282855e57d +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Jackets_3.bin, 6c19006ea0d0e6421da3d0fac02463b84d5bbe4ba2adb66be5d14a846ddaf7cf +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Messenger & Crossbody_1.bin, 5c0703b62c6789a619f402ec804fad0cea473f024866dcc35573435118e58b24 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Sandals_1.bin, e32acac40077a453d68d8f33bc2cbda446ea26b811d28cd3203521c230ed8602 +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Leggings_1.bin, bd048b39002c6ec591f091446482b436fa94a6bdf751ad958f308bae7ad9f74e +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Lips_3.bin, 322bbda079b06f41bc4c37eda7867021a25cfe1a57da82a1d52b6619e6960735 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Sports Bras_2.bin, fbf5b55b4b833712c9fb741acc21e89eb931954a3a4d82e4e3283c3a66e57899 +data-gen/dicts/tpcxai/models/product_description/Men_Tops--T-shirts_2.bin, 833712d756bae4e475f5e898b180ef1f1cf646cfbb2e75c7e07700c5c365c1a0 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shorts_2.bin, 7eb5edf13942869c41ef9031fbb68685e3d7f54842c33d4d2a824674fe82a16a +data-gen/dicts/tpcxai/models/product_description/Men_Men's Accessories--Hats_1.bin, 21a8f4a417360ffb821b7f999658335e382479aaca9b043c8a32b39a9e56c07b +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Earrings_2.bin, 4d7b7004dbc27b346a3246baa529f74dce3191df8a416c18810919c9403c8ef6 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Hats_1.bin, 3bb4c7f76523e427f5511296815997efdeb3de1ba83ad8637454439979c6251e +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Messenger & Crossbody_2.bin, e0f9daead028e61c045ed20497f7f001d44920f2c4d740ea1ebe269231aed9b4 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Blouse_1.bin, eb7b18cc42e2c8d5949bc16eacc91d89af22afd8ccf44cb22ca8cbd208bc2294 +data-gen/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cell Phones & Smartphones_2.bin, a5545bba08356cc8f5c94379d5c47d6b21d40af577809e72c8cf40aa48f28025 +data-gen/dicts/tpcxai/models/product_description/Men_Men's Accessories--Hats_2.bin, ae2eadab3d68f580f2247065daf9c3084c56c87c71d2d5100150ea671fec8349 +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Hooded_2.bin, 086c0f35e64dbe71f340644df93ead3a9f8631292c47ce3123a887befbe28dfc +data-gen/dicts/tpcxai/models/product_description/Women_Dresses--Knee-Length_1.bin, 4b95ad85011829740106e914aebb1ca3d54b0850463b5439bf16e715b2e6ea0a +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Athletic_1.bin, c2e918890173732fac3a54a2cb2999df1d027cfcd99e063ee9b492be7f5656b7 +data-gen/dicts/tpcxai/models/product_description/Electronics_Video Games & Consoles--Games_3.bin, 626616534b1d8bb4695394b1d4d6d9fbf8fed03c06f861005d8c840a2c3dbd5b +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Bracelets_1.bin, c1ace64f7041d2d1fb009c43d1280d1f9a38663bdf426aabb4dfe11bab2b9d90 +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Eyes_1.bin, 0ab809e3c1b2129bc3a80d35d40f325aa79fb55a4c6a514ac312534e10b445b4 +data-gen/dicts/tpcxai/models/product_description/Women_Sweaters--Hooded_1.bin, f19af0271dcfc692cdb4f8513b07f9f8bbf6cd6c30eaecce6fa0725b9eb98694 +data-gen/dicts/tpcxai/models/product_description/Women_Underwear--Bras_1.bin, 91bcbb6f86f08849b06346f15b542956c1f7fb0f6c8f27b7cec00323eb40c767 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Sunglasses_3.bin, 7d7370140c37cfa01a77c0a5afdeff95f98834bc8f5b650068ff3dbb5154bf2b +data-gen/dicts/tpcxai/models/product_description/Men_Tops--T-shirts_3.bin, c2ee3d82f94c4f8f84fe0c7a0304d14aafc126189e46937e98a9f9943d7474aa +data-gen/dicts/tpcxai/models/product_description/Women_Jeans--Boot Cut_3.bin, 1bfa8012fafe94c346c1f742dae2aeb38bf854f6348090c62739b7177f10e8b5 +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Shoulder Bag_3.bin, f2b718ebbdf5d4062de896cf2293bd5e73d1e98ed861e42b62ae13a3fcfd03a9 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tunic_2.bin, 1f558f72c56e1c4ef286a89f91f84f727f7bdcd46611e8fbb1abfeb44984bbbc +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Eyes_3.bin, 66d28fba3a22a9dc786f2a2e7f31361b3385b15bbb27beb6d9d799a6c6157d55 +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Bracelets_3.bin, d3f6fe1fda3d5ce3140479b050c8ba5687865a6c662f0e29090b9e8206877a3f +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Tracksuits & Sweats_1.bin, 68c87a67d73cae2571ec259ca2dea0bd2fd501ff1ac4b1d7c64f080f6b968179 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Athletic_2.bin, 82fe8b7d88f32daae759bff72cfdd94dfc141ff7e93dffce303e0155b73b644b +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Totes & Shoppers_2.bin, 6cd88f6a3184df31246478ad585213b7d3f36f76304de949a9927c9d65c0bbf2 +data-gen/dicts/tpcxai/models/product_description/Beauty_Fragrance--Women_3.bin, 22cf502cc2e5f0d41e2a69d12bf17ab903fdf9786c661ffe0fe2d65d86d18a8d +data-gen/dicts/tpcxai/models/product_description/Kids_Toys--Dolls & Accessories_3.bin, e64b3d3c36920499d1d4c1dd4c340868632b5f8687c616a02707689a4c4e9be6 +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Lips_1.bin, 64d6748884ca29016375a200be1f0e24db4afa8138c55d58789cb4ea6f5a9fad +data-gen/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cases, Covers & Skins_2.bin, 8dd74956db41fe7b6111298d1e1fe44cc5872325c7de7832ec3f16eb42bf874b +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tank, Cami_2.bin, 30e141ecd9a84855a117f2484669a147bd724ad597ad56ed3fd76fa62eb5fded +data-gen/dicts/tpcxai/models/product_description/Women_Women's Accessories--Wallets_2.bin, 16b98a300ef92f2060c3bbddb2ad8d1435ae3d306ab5c91649f64432dab640e9 +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Earrings_1.bin, a55681d66a6b7e799f51963c01834b3ef1282c1cfc011ee7155723825a85f391 +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Bracelets_2.bin, 08f8a7c32a010a5140a8a96e1ebd68d551182f0d7a7bd06e311b11c03e5e6a74 +data-gen/dicts/tpcxai/models/product_description/Kids_Toys--Action Figures & Statues_1.bin, 949c4665d04dad214452192072d334a161217c07e9e74c1cc1c673e3a6de6b76 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Sports Bras_3.bin, 48bfc53ba92d410c9c0208d1f4f5776e95682ed26b248deba0d4bcad1a6b1ccb +data-gen/dicts/tpcxai/models/product_description/Women_Jewelry--Necklaces_1.bin, 997dd85333bd0bc2f6d6f42d738a412694ff8795842f0d26ddedfee0a70e34b4 +data-gen/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shorts_1.bin, c7d76ab3d6289ce34093137882ea56387326c724b24aaf12fe85d14e51d0c756 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tank, Cami_1.bin, c63ecc6d6b1d394448124802620fbf74f6adbf4c4d3ed9a79c8cb101ed893db8 +data-gen/dicts/tpcxai/models/product_description/Women_Swimwear--Two-Piece_1.bin, f67c88e71ff800c524b2a678b42890facaf720383ce7a87f87b0428477087436 +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Fashion Sneakers_3.bin, 6dbbfae988fc25e80d72d0393f82b8f372a9c056204bcd6db78801451fa72f8a +data-gen/dicts/tpcxai/models/product_description/Beauty_Makeup--Face_1.bin, e6bde3402cf4bc70019a7433dbe72c75a01f637f4e5d0ac27b3231996a10ee6b +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--T-Shirts_3.bin, b275e4f514fc6152da2487cc4f940762426fc7456c3aecef5c8c75189dce5142 +data-gen/dicts/tpcxai/models/product_description/Men_Tops--T-shirts_1.bin, 561d5171f131b5b20af54dce23eed116561e7b3dd20c2a01033a0d65039e4f1f +data-gen/dicts/tpcxai/models/product_description/Kids_Toys--Action Figures & Statues_3.bin, da6817a84833aa9d9a114fc3d2bf718e2e26e34fbfc84447146eca2930b97e85 +data-gen/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_2.bin, 20de55826e379dba1bc514531c6ac0f0985c2f596200929e4cde9e882c4e4eb6 +data-gen/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tunic_3.bin, 9f78da8ec3dddfa8f91c20aefbb7ec61a1d2219e0185e30e615ba6f215db8244 +data-gen/dicts/tpcxai/models/product_description/Women_Shoes--Sandals_2.bin, 290d89a0d4f431d4e25ca46aeec18ea28c241341c1355447d51416dcb41999df +data-gen/dicts/tpcxai/models/product_description/Women_Women's Handbags--Totes & Shoppers_1.bin, a9f64197d6be80d002d9d018416611cae07870bdbf0ffe4bf24b2b60b3583a55 +data-gen/plugins/adabench/FaceImage.jar, c31a3c40976617fd8ca050e6d5eb31d79c45acbea1e932be3a9edf7da241d01a +data-gen/plugins/adabench/SpeechPlugin.jar, bd2d8beecdf928fe598d86ee9f0018d70f5081ab4f71b09f57d48a5e67fc0cc7 +data-gen/config/tpcxai-schema.xml, 3f23f70acfaeb8c5df8de4325f30b42dbb3ced8a720e181e8598c1c1d64b5249 +data-gen/config/tpcxai-generation.xml, ea2dfb523d0e80e7ea1e123b7d89c7a574aa84e38e6e86d8162f5f049b4fa550 +lib/pdgf/log4j2.xml, 99dead9f8eb5f5b10d554d4c6d6e67796147963f22f6df921530d46c9af47343 +lib/pdgf/pdgf.jar, b0779380b9ea46ee76446340ad927afe88feabe157aaff501f91880b9a8995d3 +lib/pdgf/LICENSE.txt, 00ba38395357b47281646cdac89c43ab1d1838d3346419c27a90991629785122 +lib/pdgf/THIRD-PARTY-LICENSE.txt, fc5c7061490de09a82fb7edec93546444aeff3af80a3e43fa3891c954a0ef70d +lib/pdgf/extlib/xercesImpl.jar, 175bbbd9def7a22a30e05f9c8db629c88fa5bdfc57d5658791901b54222abe1c +lib/pdgf/extlib/log4j-1.2-api-2.17.1.jar, f652ea6f4f47be136d7a0248106758082b480e829ca6f340529e6a6bea046930 +lib/pdgf/extlib/log4j-api-2.17.1.jar, 8ccb7074016565f0aeaa5981b63efa5bd0391bc996a86b51a4775584947464ad +lib/pdgf/extlib/javassist.jar, ef762cab76812291d01b152c5bb5f972bd8185c0c678f7a96598a5d530150fcd +lib/pdgf/extlib/commons.lang3.time.jar, 77cf983703ba0edb952b71436d0fea1108ecda3aa82b2970936c3c4041192d3b +lib/pdgf/extlib/FaceImage.jar, c31a3c40976617fd8ca050e6d5eb31d79c45acbea1e932be3a9edf7da241d01a +lib/pdgf/extlib/log4j-core-2.17.1-sources.jar, 88346c73859d9f4b7c6c0545bf1ecb0e9c3a31e27c4293563597fe2b7da2a62e +lib/pdgf/extlib/commons-cli-1.3.1.jar, 3a2f057041aa6a8813f5b59b695f726c5e85014a703d208d7e1689098e92d8c0 +lib/pdgf/extlib/SpeechPlugin.jar, bd2d8beecdf928fe598d86ee9f0018d70f5081ab4f71b09f57d48a5e67fc0cc7 +lib/pdgf/extlib/commons-net-3.3.jar, b35ad597f17a6f221575df2f729a9de8f70390509e047680771e713bad713fb9 +lib/pdgf/extlib/jgraphx.jar, ff109bc72f86aa977eee39fab51bf9dbc5c8493a4ea218b011daaa17c7aeae2b +lib/pdgf/extlib/log4j-core-2.17.1.jar, 7e9ee383f6c730557c133bb7a840b7a4225c14e786d543aeae079b3173b58017 +lib/pdgf/extlib/log4j-core-2.17.1-javadoc.jar, ba4fac686eafcb3fe95deaad6aee54c1de91f9dc786d9f58a468f702c9b160e0 +lib/pdgf/extlib/xml-apis.jar, a840968176645684bb01aed376e067ab39614885f9eee44abe35a5f20ebe7fad +lib/pdgf/dicts/pseudoTextGenerator/nouns.dict, 6324efecb6daeae33108c19543af3e3eb2ebf762cec8e23a4127cfb521d3eb16 +lib/pdgf/dicts/pseudoTextGenerator/auxiliaries.dict, 4bf06e53ea97dcdf7eff3018b602c17ba3f850d763d7c2b72472884165beede6 +lib/pdgf/dicts/pseudoTextGenerator/verbs.dict, 8822ebd6a45e4058211525fa9b2f00d6d5a571c35dc218bbff88e7f27a60796b +lib/pdgf/dicts/pseudoTextGenerator/PseudoTextGenerator.properties, 66e5bbc43142edc7555b0af3ec73187d85d508c03625499195aad77f20b53d21 +lib/pdgf/dicts/pseudoTextGenerator/adverbs.dict, 88d66d692a6d0d00e89588019df727e0e7816e79e245eca76a006b26e020012f +lib/pdgf/dicts/pseudoTextGenerator/prepositions.dict, 4e32dfad41720797e87b748962a8ff4b7732538ff1cb994892c44c03cf641062 +lib/pdgf/dicts/pseudoTextGenerator/adjectives.dict, e01d94e261ce5ebc5f226e6d330cadc6c1a850f61a7b8d6aa72f33c526b7be91 +lib/pdgf/dicts/pseudoTextGenerator/terminators.dict, c1a0f116e970014c2af04cfcf9acb6a6be14d418691ba1df1e1e46b877a55e4d +lib/pdgf/dicts/tpcxai/markovspam.bin, 775c32c7b1a35456609832d96e6d311bf9dfb144d4a5086a2967a6ac54b20d77 +lib/pdgf/dicts/tpcxai/weekdays.dict, 7988849b81927984be05a6ae08e09e0926e8b03b88c2d18029a8aa61cbdb27f0 +lib/pdgf/dicts/tpcxai/mail_provider.dict, 1ae89851301b2911f4ee0167df807ded2caf624adca3d49bdbaab74b5dcdfb95 +lib/pdgf/dicts/tpcxai/scancount.dict, c2930967a01da3cdbe8c9c0b1f0f1d64b519c9715135693cc4390c8d53a49c47 +lib/pdgf/dicts/tpcxai/department_list.dict, 416a1706dc275c74c957b3791c74aff99ff3f4f58295cefb9d2950b7c2e6ec61 +lib/pdgf/dicts/tpcxai/Given-Names.dict, 50f50d5c4478d8c8412726da030b1ba6b618fe487cc021827df89de31ad023d3 +lib/pdgf/dicts/tpcxai/bmfm_2020.h5, ba5b1955acb1a978e6fd0189c38d3b4cd0cdf7d9d4d8667a9cbd6f69124a56d0 +lib/pdgf/dicts/tpcxai/Family-Names.dict, 1fcbb9e8762c266f3fe499cf7e8478325ec0b40c9860992df4e32fae76b9380c +lib/pdgf/dicts/tpcxai/ds-genProbabilities.txt, 75813099a2dc9095f5a6887108f7bbf9cbdc8c22c1000da581a1386b1fe22128 +lib/pdgf/dicts/tpcxai/categories.dict, 1868bd67de69f26abd086dad7f28f0b85c26528ac3fa16b64f2b5185014a1805 +lib/pdgf/dicts/tpcxai/categoriesWeightedList.txt, b7703648f8ca1aec7ee3a050a1ddd410acbf45f70a75cfe9568eb9a8befc2996 +lib/pdgf/dicts/tpcxai/Given-Names-Random.dict, 6edd05d6c16fad172d999cd292d171fd2fdc0bd1d3bd928c15b7de68b1cd9c14 +lib/pdgf/dicts/tpcxai/Family-Names-Random.dict, 3eb32d8a9e51be81e2748f41d6769f60d7b50f495d54406e4f50827d87969d02 +lib/pdgf/dicts/tpcxai/question.bin, 88a04d1d91969be7ce32f5e4d69907fb04acefbf1c6abc8394cd760054789d66 +lib/pdgf/dicts/tpcxai/answer.bin, 208e73a4a7da061a659bf7bd43855abb1b73648b12abe15c21c6f1b89089a725 +lib/pdgf/dicts/tpcxai/trip_type.dict, ac61234e489dded30db2661a54af06101161ef9d884c0f2a972acb75cf37ef0e +lib/pdgf/dicts/tpcxai/markovham.bin, 6f7b566fd017059fccd4b129c20f167d9eb907e9b1dae6d9961105304d212a0b +lib/pdgf/dicts/tpcxai/department.dict, e35612bc93f1bbca45fea16921d36f793167d9f0ac29e2ecf6888c9b23807ef1 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Crewneck_3.bin, 889c1725829476491d608ceaf93d89da1a7f1e417e55f7f24cb4d837c0146dd2 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Fashion Sneakers_1.bin, af45fde6630734366c2882ba88e61ee5ea3298cccb57b474fa5d90d47a1b57a1 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Slim, Skinny_1.bin, e66cb462477bb33e917995125e3d7e015186213c578b1ce715de24c3bff6593b +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Tracksuits & Sweats_3.bin, 248bde1f0fc550ed2bd16c25af982ab214213cdca58727d44a0344c5852b747e +lib/pdgf/dicts/tpcxai/models/product_description/Women_Underwear--Bras_2.bin, 74c8b1a595bec54cd506782c82876b0240630742fcbaa0ce03b77fe7d4f449d0 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Cardigan_2.bin, db52f005d50f5fabd1828cdcb24a5254237258ee66ea5549a6673ee860f9987f +lib/pdgf/dicts/tpcxai/models/product_description/Kids_Toys--Dolls & Accessories_1.bin, aefb5dab68d49f998e556afb9e1ecc6ea5cd99b6dd41d1805a07dea2bc9df419 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Pumps_2.bin, 9f1646573b5ee461e3a24e58a1aa14053295021eccdb549a34c356d309948664 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Messenger & Crossbody_3.bin, 0bb07df12c87a6a71cc0c2542b1e9fc723756b4f2576111aa67d7240d774fdaf +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Shoulder Bag_2.bin, 0bcf4941ce8cdf34b386cf5bbabea63fd1204ebcd7a376135330c99a5f691cce +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Blouse_3.bin, bfe954e9b217296bb8d2ac0aba778733273f7909cc77ada00dd6998498318b30 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Tracksuits & Sweats_2.bin, ad3a6785ee946f5667001d401b01df40418374a8d392e543f4e9b7931a51a9a5 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Wallets_3.bin, c64248474d068855f432987b3264d83bbfcd3ab8da36da1a1049da278b881e82 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Makeup Palettes_1.bin, f3dfc2a8a3636d7d9ea35b7b286106172c87bb7cb22e8f0edc7159e3d7993bbd +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Sunglasses_1.bin, 8106b02c8de9ebd050fbed8e596cf4ccba6ecdb468ed95067420475cf5806483 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Swimwear--Two-Piece_2.bin, 417e3c1c358398815bde92ba781ce33eb30232a3870824ff7b5f41ab4f1ca2b0 +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Video Games & Consoles--Games_2.bin, 98d993d662c987a70aed702979c38cc2ad19da22dc6c9e46204c31d70fcf7cc4 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tank, Cami_3.bin, 76279ca2f8568baab546cd8d47990a81c433ee1dfe099d3169b0a776dc7f2ad8 +lib/pdgf/dicts/tpcxai/models/product_description/Kids_Toys--Action Figures & Statues_2.bin, ae01a2e7e9bf83dc5a712545abeecb9604e358f8032910c5e0c60cfee89c8ccf +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Fragrance--Women_2.bin, 87e761faef0a0fb74e2950ed63c5d27110006d7815f1218f1cc780743530f483 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Hooded_3.bin, 9806005c0c848c3634721a7ad36ee1b7d37635bd69dfab7164392b23e320901e +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Cardigan_3.bin, 8db6bd1f8129919338a46e5b469c92ff6244652e1d116ee2f7b63d9fbc100581 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Underwear--Bras_3.bin, cd52cb5d86513230ece9428bf6894283caebf7c3668e2f074782c6b16fc99881 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Jackets_2.bin, 56a93ccdffecd36d7e0093bc8ac942f563cd4cf50952242a1a11bbe4584a9856 +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Video Games & Consoles--Games_1.bin, 8ed442f935cd6fc0ae359c92fd354fbf6154fc5e1d659e1c74003baa3860eb5d +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Totes & Shoppers_3.bin, 0faa8e283b0be40eea872e3acdb76666b5cd93815a3ab1acc73a967590823bbd +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Crewneck_1.bin, a2d3d6299f0a1f60c87e8dc19ad12a9f5df4cc72e972d2a9e9af65727e585cb4 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Pants, Tights, Leggings_1.bin, c78da99cfa3166420e2001d1164182ad75a1325e04ba0deb81abf53678eb00e6 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tunic_1.bin, 4bd7bbf25de49b22a78335a0b786d1d8e45ddc94165c2ba080e3126aefbca23f +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_3.bin, 52b79a96b0c4f6f812163e3900b8f02acf74254cca84e7bbbda190564fd210ea +lib/pdgf/dicts/tpcxai/models/product_description/Women_Dresses--Knee-Length_3.bin, 5ef35f57c5459a294227a89ed67cd7d3e4127dd6e3f1bbfecfe6d9f5d0a84a21 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Lips_2.bin, 5419c85bf0924a6b2fc8da80af4a827bdafbbbda64e38a484eea76f49f3d6fe7 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Jackets_1.bin, 6c8df5114dba993f22d12e167d6498956cf2d1d73a115962bdb82476980fb533 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shirts & Tops_2.bin, 460828ae6cee916829a90bbb18ff86d9200fd36bef7a7f396d4fba78b8c0cb15 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Dresses--Above Knee, Mini_3.bin, 38cf07954a2d72112aed5e11d03fd37d9c8d99ff702d30652ab06965acbed265 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Fashion Sneakers_3.bin, 03374ee02b70e8f95fb2889882008ff6eae52675cede5e42f98cef2cd1ec4f22 +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cases, Covers & Skins_3.bin, f3dc34720195b8ebfb7a1636ebddddaeb13db09dcd1f4e5d8b1a105d79d54096 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Sandals_3.bin, 7a05de62abb29dc60c5d797959fef2ca872ed4c35e0e5d38a8c17a3c33c3de50 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Crewneck_2.bin, 9b4546557d9e313ede70d5a5738cf00315592465a6d75f629002515c47583e9f +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Sunglasses_2.bin, 27e1bfeb3ebb79905b02c82aac469b698ad7169b727a6ea4f27127e4905ebf1f +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Pants, Tights, Leggings_3.bin, 7bd6a79ecc809a57fc7646e3786cb790b9cba5f9979f7abfda072e19d46b8bd8 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Face_3.bin, 2d6f991f814645afa96fd7215d2f829cf24f079a05ee6b6b893a0949f0ead940 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Swimwear--Two-Piece_3.bin, 2ba228a2167704091dce5284aca0aa74f7d09562be236f60a93a85878fe8518a +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shirts & Tops_1.bin, df03b19b6885840c6f17480efeae90ef81ae542f83da7ad70e489a358d907793 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Blouse_2.bin, a1d4437b51ec69eeeeafa5c63d87003527b46f34fd3ba76849eb037e5c64d1be +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Shoulder Bag_1.bin, dcb80c77583e5c8c67e8f5c8fa047d87f898b263dd28795a75b4b569e9dabec0 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Wallets_1.bin, 8cc110e191494f2432561fa2b5290c6e0c84467aed6a29e02cbf067701ee5a6a +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Eyes_2.bin, f72140a10d287cb68b7239e2433ac32a5841b2937cc342fb9fb14d31ade8d7d1 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Pants, Tights, Leggings_2.bin, a9a5d6b4af3f65b0867618ba720574a9eed07782e6f73200da163a507ffa5822 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Fashion Sneakers_2.bin, 1035f933468ab3cb42d943d0addeade79bd74b2606f95493fe936e7debbe8801 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Necklaces_3.bin, 3ff955b7914bd13390adf0fae56d96e224506e8cf644c85bd5ce786f771018d7 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Boots_2.bin, 9d42e8f7b97591eb01ea0ac20337bb679801b5eb0cac8f85c10c75ffae48d69c +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shorts_3.bin, ea39172be511d3a04af7a2425adbcb4f4dd355727a72f2a9eda8932827036b39 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--T-Shirts_1.bin, cf42dbd6032e540e0a39bdb85ebd9032492f80d22f6c032a47b6a10958c20f60 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Slim, Skinny_2.bin, ccd43fc745452b94946345819f83d5b82c73d6e3900bb989cb990b832d8f63e2 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Fashion Sneakers_1.bin, 89cf4cc3709d7c8a53e0cefb759f3b9148d5818d393a533bcdd1ec9fd2f6c176 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shirts & Tops_3.bin, fa53cc55e9a57260396121f1c9212c8a283c85182dc3535d87ceb84b6f5ae4fc +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--T-Shirts_2.bin, bb831f81dcaa02147b363ad25fe6507384a877c5f4cc7158f1524e3a3618102e +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Leggings_2.bin, 53e9531de9c84998ac24d74bcdeab12705f830e074a203b389dc9273a67e44ab +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Makeup Palettes_2.bin, 90e9a14832f5e1122e0f17a809e18d9767cba2afa8e451d0a9ae9ead6155bee0 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Fashion Sneakers_2.bin, 02f9327f5f43264dec93fa4fceeacc7814cb547ed0cd1e2337971c3200eab034 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Athletic_3.bin, 9f53438d0f0dcd6baf1286a4c21bd517bc149915cad654b5556d799283f0f985 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Necklaces_2.bin, 09df70771d6115d83a65aed70bbf6781030c75846667dbc390b7400943a152d5 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Makeup Palettes_3.bin, 0cf4c5c5ee3bec759e26441d011b7a4465b7e8f3a929372bdea65eb62cf5d7a0 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Button Down Shirt_2.bin, f06ccad9693a0d3c0c05ebaa0471979d1883699c6f6a44368aa790b3a9e0a87d +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Cardigan_1.bin, 57b59c51c267c74b4e6cb283272e46139f9ba3de4a96a829035688ffbbed34a5 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Boot Cut_2.bin, 3085ebb6a317113dbd10267374bc8b8fee07ac420b4fc1e39fa365decaece577 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_1.bin, 1b8eeee5aef4b01fb5f7f98a48204a6bc99882ccd8bda75653cef23ed995cd39 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_4.bin, 1d6bb5200effe60646fb9e45bbba214d2a9c81d97e1277382b1e409f0e9b014a +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Face_2.bin, 2d091ee25189048b6b8e0b6a6a432b9e362bfe3e29a9c2d3b77a53ccff50fb7a +lib/pdgf/dicts/tpcxai/models/product_description/Women_Dresses--Above Knee, Mini_2.bin, 8b7940b72fa6fd150dc70c30b621d894226d55165aa70dc55a110a28881b0e42 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Boots_1.bin, ce1b932d30e0092820e2a7851024bf95f94f7c0e57552322fe9c5789a5f37ff3 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Hats_2.bin, efdd68c5364cc548b7e99fa40b13fa163fb3de379e4ca63c81f51e8021f094b8 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Button Down Shirt_3.bin, d45b7a7918f89d941790c126358709601f5dba3800ff285ac45a8b71e0457209 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Sports Bras_1.bin, e036298b79d3c0930eb0dde930331b41edee6973cc7a73dc222d70ac67ab94ab +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Slim, Skinny_3.bin, 6e3d9014bd7cad5fab3d44a31c6a1b18e31dce925e290f51a4cbf279bd5ef169 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Dresses--Knee-Length_2.bin, 9e6f5cbebdaba3315432f48dc51d9888c79f0d2e6545fc002a7b3d89a362848f +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cell Phones & Smartphones_3.bin, 638334950c98f797924e8162061417c0fbf0319e676c03177bd230610421f6ad +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cases, Covers & Skins_1.bin, 6839dc9425d1d6f6c3e3732bcf28b1a458478adc54e1baf0f9b2e12e0a566c85 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Pumps_3.bin, d1c1d0c60c3d03d0e4ea6b9d436b68679e5fc7588aed42a29db806cea27ee752 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Fragrance--Women_1.bin, bf1f481558c8081fd50daa2d889987ddac42af7043d0dbb2acfb12cd76137e45 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Dresses--Above Knee, Mini_1.bin, b4b0e770407f12f57dc418a4b7b67bec66028771c72b3844a13b0d82b718db34 +lib/pdgf/dicts/tpcxai/models/product_description/Kids_Toys--Dolls & Accessories_2.bin, 0b793fd20e7e85090dbec9fbe2b502f396ca8c306221193d98c1570e9e7180e9 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Boots_3.bin, 6a6caa6edb59b76536a6bdf0bff8092f6bf91d8c6c9698a53500d4282855e57d +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Jackets_3.bin, 6c19006ea0d0e6421da3d0fac02463b84d5bbe4ba2adb66be5d14a846ddaf7cf +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Messenger & Crossbody_1.bin, 5c0703b62c6789a619f402ec804fad0cea473f024866dcc35573435118e58b24 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Sandals_1.bin, e32acac40077a453d68d8f33bc2cbda446ea26b811d28cd3203521c230ed8602 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Leggings_1.bin, bd048b39002c6ec591f091446482b436fa94a6bdf751ad958f308bae7ad9f74e +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Lips_3.bin, 322bbda079b06f41bc4c37eda7867021a25cfe1a57da82a1d52b6619e6960735 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Sports Bras_2.bin, fbf5b55b4b833712c9fb741acc21e89eb931954a3a4d82e4e3283c3a66e57899 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Tops--T-shirts_2.bin, 833712d756bae4e475f5e898b180ef1f1cf646cfbb2e75c7e07700c5c365c1a0 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shorts_2.bin, 7eb5edf13942869c41ef9031fbb68685e3d7f54842c33d4d2a824674fe82a16a +lib/pdgf/dicts/tpcxai/models/product_description/Men_Men's Accessories--Hats_1.bin, 21a8f4a417360ffb821b7f999658335e382479aaca9b043c8a32b39a9e56c07b +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Earrings_2.bin, 4d7b7004dbc27b346a3246baa529f74dce3191df8a416c18810919c9403c8ef6 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Hats_1.bin, 3bb4c7f76523e427f5511296815997efdeb3de1ba83ad8637454439979c6251e +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Messenger & Crossbody_2.bin, e0f9daead028e61c045ed20497f7f001d44920f2c4d740ea1ebe269231aed9b4 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Blouse_1.bin, eb7b18cc42e2c8d5949bc16eacc91d89af22afd8ccf44cb22ca8cbd208bc2294 +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cell Phones & Smartphones_2.bin, a5545bba08356cc8f5c94379d5c47d6b21d40af577809e72c8cf40aa48f28025 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Men's Accessories--Hats_2.bin, ae2eadab3d68f580f2247065daf9c3084c56c87c71d2d5100150ea671fec8349 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Hooded_2.bin, 086c0f35e64dbe71f340644df93ead3a9f8631292c47ce3123a887befbe28dfc +lib/pdgf/dicts/tpcxai/models/product_description/Women_Dresses--Knee-Length_1.bin, 4b95ad85011829740106e914aebb1ca3d54b0850463b5439bf16e715b2e6ea0a +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Athletic_1.bin, c2e918890173732fac3a54a2cb2999df1d027cfcd99e063ee9b492be7f5656b7 +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Video Games & Consoles--Games_3.bin, 626616534b1d8bb4695394b1d4d6d9fbf8fed03c06f861005d8c840a2c3dbd5b +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Bracelets_1.bin, c1ace64f7041d2d1fb009c43d1280d1f9a38663bdf426aabb4dfe11bab2b9d90 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Eyes_1.bin, 0ab809e3c1b2129bc3a80d35d40f325aa79fb55a4c6a514ac312534e10b445b4 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Sweaters--Hooded_1.bin, f19af0271dcfc692cdb4f8513b07f9f8bbf6cd6c30eaecce6fa0725b9eb98694 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Underwear--Bras_1.bin, 91bcbb6f86f08849b06346f15b542956c1f7fb0f6c8f27b7cec00323eb40c767 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Sunglasses_3.bin, 7d7370140c37cfa01a77c0a5afdeff95f98834bc8f5b650068ff3dbb5154bf2b +lib/pdgf/dicts/tpcxai/models/product_description/Men_Tops--T-shirts_3.bin, c2ee3d82f94c4f8f84fe0c7a0304d14aafc126189e46937e98a9f9943d7474aa +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jeans--Boot Cut_3.bin, 1bfa8012fafe94c346c1f742dae2aeb38bf854f6348090c62739b7177f10e8b5 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Shoulder Bag_3.bin, f2b718ebbdf5d4062de896cf2293bd5e73d1e98ed861e42b62ae13a3fcfd03a9 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tunic_2.bin, 1f558f72c56e1c4ef286a89f91f84f727f7bdcd46611e8fbb1abfeb44984bbbc +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Eyes_3.bin, 66d28fba3a22a9dc786f2a2e7f31361b3385b15bbb27beb6d9d799a6c6157d55 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Bracelets_3.bin, d3f6fe1fda3d5ce3140479b050c8ba5687865a6c662f0e29090b9e8206877a3f +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Tracksuits & Sweats_1.bin, 68c87a67d73cae2571ec259ca2dea0bd2fd501ff1ac4b1d7c64f080f6b968179 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Athletic_2.bin, 82fe8b7d88f32daae759bff72cfdd94dfc141ff7e93dffce303e0155b73b644b +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Totes & Shoppers_2.bin, 6cd88f6a3184df31246478ad585213b7d3f36f76304de949a9927c9d65c0bbf2 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Fragrance--Women_3.bin, 22cf502cc2e5f0d41e2a69d12bf17ab903fdf9786c661ffe0fe2d65d86d18a8d +lib/pdgf/dicts/tpcxai/models/product_description/Kids_Toys--Dolls & Accessories_3.bin, e64b3d3c36920499d1d4c1dd4c340868632b5f8687c616a02707689a4c4e9be6 +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Lips_1.bin, 64d6748884ca29016375a200be1f0e24db4afa8138c55d58789cb4ea6f5a9fad +lib/pdgf/dicts/tpcxai/models/product_description/Electronics_Cell Phones & Accessories--Cases, Covers & Skins_2.bin, 8dd74956db41fe7b6111298d1e1fe44cc5872325c7de7832ec3f16eb42bf874b +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tank, Cami_2.bin, 30e141ecd9a84855a117f2484669a147bd724ad597ad56ed3fd76fa62eb5fded +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Accessories--Wallets_2.bin, 16b98a300ef92f2060c3bbddb2ad8d1435ae3d306ab5c91649f64432dab640e9 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Earrings_1.bin, a55681d66a6b7e799f51963c01834b3ef1282c1cfc011ee7155723825a85f391 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Bracelets_2.bin, 08f8a7c32a010a5140a8a96e1ebd68d551182f0d7a7bd06e311b11c03e5e6a74 +lib/pdgf/dicts/tpcxai/models/product_description/Kids_Toys--Action Figures & Statues_1.bin, 949c4665d04dad214452192072d334a161217c07e9e74c1cc1c673e3a6de6b76 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Sports Bras_3.bin, 48bfc53ba92d410c9c0208d1f4f5776e95682ed26b248deba0d4bcad1a6b1ccb +lib/pdgf/dicts/tpcxai/models/product_description/Women_Jewelry--Necklaces_1.bin, 997dd85333bd0bc2f6d6f42d738a412694ff8795842f0d26ddedfee0a70e34b4 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Athletic Apparel--Shorts_1.bin, c7d76ab3d6289ce34093137882ea56387326c724b24aaf12fe85d14e51d0c756 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tank, Cami_1.bin, c63ecc6d6b1d394448124802620fbf74f6adbf4c4d3ed9a79c8cb101ed893db8 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Swimwear--Two-Piece_1.bin, f67c88e71ff800c524b2a678b42890facaf720383ce7a87f87b0428477087436 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Fashion Sneakers_3.bin, 6dbbfae988fc25e80d72d0393f82b8f372a9c056204bcd6db78801451fa72f8a +lib/pdgf/dicts/tpcxai/models/product_description/Beauty_Makeup--Face_1.bin, e6bde3402cf4bc70019a7433dbe72c75a01f637f4e5d0ac27b3231996a10ee6b +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--T-Shirts_3.bin, b275e4f514fc6152da2487cc4f940762426fc7456c3aecef5c8c75189dce5142 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Tops--T-shirts_1.bin, 561d5171f131b5b20af54dce23eed116561e7b3dd20c2a01033a0d65039e4f1f +lib/pdgf/dicts/tpcxai/models/product_description/Kids_Toys--Action Figures & Statues_3.bin, da6817a84833aa9d9a114fc3d2bf718e2e26e34fbfc84447146eca2930b97e85 +lib/pdgf/dicts/tpcxai/models/product_description/Men_Shoes--Athletic_2.bin, 20de55826e379dba1bc514531c6ac0f0985c2f596200929e4cde9e882c4e4eb6 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Tops & Blouses--Tunic_3.bin, 9f78da8ec3dddfa8f91c20aefbb7ec61a1d2219e0185e30e615ba6f215db8244 +lib/pdgf/dicts/tpcxai/models/product_description/Women_Shoes--Sandals_2.bin, 290d89a0d4f431d4e25ca46aeec18ea28c241341c1355447d51416dcb41999df +lib/pdgf/dicts/tpcxai/models/product_description/Women_Women's Handbags--Totes & Shoppers_1.bin, a9f64197d6be80d002d9d018416611cae07870bdbf0ffe4bf24b2b60b3583a55 +lib/pdgf/config/tpcxai-schema.xml, 3f23f70acfaeb8c5df8de4325f30b42dbb3ced8a720e181e8598c1c1d64b5249 +lib/pdgf/config/tpcxai-generation.xml, ea2dfb523d0e80e7ea1e123b7d89c7a574aa84e38e6e86d8162f5f049b4fa550 +lib/scala_2.12/akka-actor_2.12-2.5.32.jar, 5f2dc9cd737aa76b6559dd4ee5e7ecd25eea6b1a39dfdbc9614cf6d6ad4d90e8 +lib/scala_2.12/scallop_2.12-4.1.0.jar, 8855dfe6074699e62b03030339d095aea5974bf23ab06acc059a29356c37d2e7 +lib/scala_2.12/sparkts_2.12-0.4.1.jar, ca604c05eccb29f454cb2b1f9eed435832de2e7119202b0fb77f5f28a128cf89 +lib/scala_2.12/xgboost4j_2.12-1.5.0.jar, 595e468e90deb41099ca71daf02b00e640162aebe98b9676534b63a308704254 +lib/scala_2.12/xgboost4j-spark_2.12-1.5.0.jar, ea6b72daa6de780f50b0ee031e0756a1f51951e8df66c01679ea22d462fd560c +TPCx-AI_Benchmarkrun.sh, ee72869c5c3a9236abb67dfd908569e84c16eb60a88bf2ee799e8005233d6bfd +TPCx-AI_Validation.sh, 212be073a2246df92897247757e607c9399c910d7f33b06bc2e3aca067e47a37 +workload/spark3/spark-log4j.properties, d98e355efc34ee70de829374cb15df1d52dc58969e873add75f6554a05a25bee +workload/spark3/pyspark/setup.py, 3989a8f320dccf22f2f33d9bbeda0d87ff006c68744e9ca31a776e409d1f836b +workload/spark3/pyspark/workload-pyspark/UseCase09.py, 6c6b69699e392476a67ba1355f3488834318cc6b29181a8a331909c47ea04dd6 +workload/spark3/pyspark/workload-pyspark/__init__.py, e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +workload/spark3/pyspark/workload-pyspark/UseCase05.py, 82d78f4567a6420e9da72ca0fd6fc9a22f068801379b812be45c32047fd4711b +workload/spark3/pyspark/workload-pyspark/UseCase02.py, 8c510ffb5848996260fc4964d719efa953e3907defd46d2cef675509290e77b3 +workload/spark3/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat, c4b1e9804792707d3a405c2c16a80a20269e6675021f64a41d30fffafbc41888 +workload/spark3/pyspark/workload-pyspark/resources/uc09/nn4.small2.v1.h5, ef11d7c45a0d52304a073eba21f99786150ac2f499f1a0e0068e8209cab8bd8f +workload/python/setup.py, 0139927b1cc95f3f7b0c59bf8fa7cf9e13593d916cf95d40b952a614e4d52ade +workload/python/workload/UseCase09.py, 68300f1436753bd6a4259fe7f61c36b697493490bda0caa541c51a387ea55b98 +workload/python/workload/UseCase01.py, 3282cef020855ab91400a485bd78bf744f1740f8a033f3f91e09815afd11f501 +workload/python/workload/UseCase08.py, 8be84ea57fc0edb0d8f7aedcc19b83f50d6aa7848debe825331c122ce87f8db3 +workload/python/workload/UseCase04.py, 3961d979bb50a8315714242a074ed44d42c29a2f60614b1f5f27d4bb81000c88 +workload/python/workload/__init__.py, e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +workload/python/workload/UseCase03.py, 494faa4ca462621c6465b3e770eaf280327d88f57dd4fd35f6842b626cf06959 +workload/python/workload/UseCase06.py, de50195a850dad51fc02b2a3bb19faba0b18b0a4e2e90efe30284755ed3a7886 +workload/python/workload/UseCase10.py, 8e1eb4b641e5db2edc5b6c777f0ccdcca5720a1e345bc85ac786f4b9a04f378a +workload/python/workload/UseCase07.py, 796d488791488f77412ddd854b1b1fff08ea962b885e2bbdba6c7aa0929f6a24 +workload/python/workload/UseCase05.py, 887d98ba4f8709a61d4da0ee8ee3850758ad82c420352c2f7192342f822a17de +workload/python/workload/UseCase02.py, 601131bcc1b98f13c56c63df063af66fac5a8c81da5e40f7ab1805f3849ebdbb +workload/python/workload/resources/uc09/shape_predictor_5_face_landmarks.dat, c4b1e9804792707d3a405c2c16a80a20269e6675021f64a41d30fffafbc41888 +workload/python/workload/resources/uc09/nn4.small2.v1.h5, ef11d7c45a0d52304a073eba21f99786150ac2f499f1a0e0068e8209cab8bd8f +workload/python/workload/openface/model.py, df9569284c1fc3b519c4e0b8b54d7611a63a8bb5f28acf6bfe5f386e37e59995 +workload/python/workload/openface/align.py, 99f7fa986ea5c77e0da27a0c45401837f37f083ae62f380b2fa87d88edf3d6e1 +workload/python/workload/openface/utils.py, 2f6597d45be3f88b347c78f74be8e780614b4bd660a13bf0d355a05e76a76ec1 +workload/python/workload/openface/__init__.py, e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +workload/spark/spark-log4j.properties, d98e355efc34ee70de829374cb15df1d52dc58969e873add75f6554a05a25bee +workload/spark/pyspark/setup.py, 3989a8f320dccf22f2f33d9bbeda0d87ff006c68744e9ca31a776e409d1f836b +workload/spark/pyspark/workload-pyspark/UseCase09.py, 961e9b2184a2f79c8ff38de3b03da985a2c177bc1b799bb61643321e3176f50c +workload/spark/pyspark/workload-pyspark/__init__.py, e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +workload/spark/pyspark/workload-pyspark/UseCase05.py, 82d78f4567a6420e9da72ca0fd6fc9a22f068801379b812be45c32047fd4711b +workload/spark/pyspark/workload-pyspark/UseCase02.py, 56dc11993103b9be816e8dd232e5017e3759ba27aca26de50f12aef08a62f129 +workload/spark/pyspark/workload-pyspark/resources/uc09/shape_predictor_5_face_landmarks.dat, c4b1e9804792707d3a405c2c16a80a20269e6675021f64a41d30fffafbc41888 +workload/spark/pyspark/workload-pyspark/resources/uc09/nn4.small2.v1.h5, ef11d7c45a0d52304a073eba21f99786150ac2f499f1a0e0068e8209cab8bd8f +tools/dir2seq.jar, f1f089a991d77869cb96493d20d3f1a6a873931e04ca66530cbfdf56cada8a22 +tools/tpcxai_fdr_template.html, e83ccffe2cb3742a0f42a65b457c549cbfdc2ba19e14c2cd39c88fb340195f3d +tools/parallel-data-gen.sh, 35919bafbe0b262eda4bb49a6a09c643a680c0cab2f22b9afb28208f79287459 +tools/dir2seq.sh, 62b3369efee51d77260c1d2ae362f5038bd6f3dab38c1a9c48c5153ea6840f02 +tools/check_environment.sh, 4e939b64a2cc47d09089c437859e73b1ee3aa2fb07a3efbc5cf84dfa86e28daf +tools/mkhash.py, 9173883c50b072cc0794bfd7d73427cacfa66a32fcd8a3b9fa44ba902bbc4309 +tools/tpcxai_fdr.py, f33f2fd540462f3eef28e222522bc309d264f431e42b296622d732815c9020e8 +tools/enable_parallel_datagen.sh, 443f048e26327d475b6dc77d2efcd7f3df1100a889f296ca6a430f4550dd04e4 +tools/parallel-data-load.sh, 42ffbee65fbd9125ad4fd879d496b4604e6ccd5f69935e050f8741f38cadc18a +tools/python/clock_check.sh, c224c6f0f1744136a3bd4454072d663ef4870d1868e1c0420b49d84e3ce86dd7 +tools/python/getEnvInfo.sh, 9b42ba9da6da6aeadb61fbb510fd0a91478f360aed1de15ad37486a04b050867 +tools/python/dataRedundancyInformation.sh, 002b406335ba0b3b82720eef47d547027ff8b0e8e3e34058adfea7eb5851ddfc +tools/python/create.sh, 2f7444e5dadf7b2d8d03fa44251640b5ee27c5eefb33dc0baa28eb1c15200c49 +tools/python/saveOutputData.sh, 2b045c8ce61c451c6df67fad546e0e798ff055ea04aee422ae7c6c36c29b7dd6 +tools/python/load.sh, 1f98d6f81898208f4269dac94149ae0e47eed0b48cd2513b1e1c389371cb3d7c +tools/spark/clock_check.sh, 87d3e857e1304b9f4249a9e7ba815ce6a9da3c5bdca12c7014ad7f96076aff9f +tools/spark/inventory, 30720a1aa750b3825b0118a13285714de49d0f3648cafe07964d724c4e0327f3 +tools/spark/getEnvInfo.sh, e42a17a0f0097a64caa40408639ab70af3b521d242127397988af10a46972ef1 +tools/spark/getEnvInfoWorker.sh, 673d5a9f852cddf64b1c130bbba88af0ac3f4556cb7a8478e9464adee94c8011 +tools/spark/horovod_test.py, 0c0d5349b90397d8cefcf0b2d6cb90624b8e764fc2ac7145f6507a2930a08ba1 +tools/spark/create_hdfs.sh, a715341d0cd822932cd28462a918937980df875bfe9c7ef79d019ddaf6a27c45 +tools/spark/load_hdfs.sh, 563e334b7638aeb5978cb02d3dbd66fc5ac07fcb6c0f0168169899e89234c93d +tools/spark/build_dl-spark3.yml, a5c90a2e615401b53827056c460c49eaeaf3dd94129c8305c8d66c81bb85dcba +tools/spark/horovod_test_yarn.py, 5b13b5d3b1e80d9b9c65cb6c2c7cac3abcc511fb4a16c3d0d0a0653eab2025ac +tools/spark/ansible_site.yml, 6fc56d1c5d95e0c229e4869c5d219440bd256e76d457b664f99dc6ab434a1851 +tools/spark/dataRedundancyInformation.sh, 4ed6ee0058766c7fe54374ae1c5f41544689e1bdf1cda6ddb929e56d15b758f8 +tools/spark/saveOutputData.sh, 5be7b6f87754cd7d74c9182744b31d00195286bdde28926b6d8ed2a105e8295f diff --git a/scripts/tpcx-ai/driver/setup.py b/scripts/tpcx-ai/driver/setup.py new file mode 100644 index 00000000000..eec28d4dddd --- /dev/null +++ b/scripts/tpcx-ai/driver/setup.py @@ -0,0 +1,34 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +from setuptools import setup, find_packages +setup( + name="tpcxai-driver", + version="0.8", + packages=find_packages() + #install_requires=['pyyaml', 'numpy', 'pandas', 'scikit-learn', 'matplotlib', 'jinja2', 'python-benedict'] +) diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/__init__.py b/scripts/tpcx-ai/driver/tpcxai-driver/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/__main__.py b/scripts/tpcx-ai/driver/tpcxai-driver/__main__.py new file mode 100644 index 00000000000..222bd4485c4 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/__main__.py @@ -0,0 +1,1385 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import collections +import copy +import logging +import math +import os +import re +import shutil +import socket +import sqlite3 +import sys +import threading +import time +from hashlib import sha256 +# from Crypto.Hash import SHA256 +from pathlib import Path +from itertools import groupby +import pdb +import numpy as np +import pandas as pd +import sklearn.metrics +import yaml + +from . import metrics +from .data import * +from .data_generation import DataGeneration +from .database import DatabaseQueue +from .logger import FileAndDBLogger, LogDbHandler, LogPerformanceDbHandler +from .subprocess_util import run_and_capture, run_as_daemon, Stream, stop_daemons, state # nosec - see the subprocess_util module for justification + +TPCXAI_VERSION = "1.0.3.1" +#Phase.DATA_GENERATION,Phase.SCORING_DATAGEN, +DEFAULT_PHASES = [Phase.CLEAN, Phase.DATA_GENERATION, Phase.SCORING_DATAGEN,Phase.LOADING, Phase.TRAINING, Phase.SERVING, Phase.SERVING_THROUGHPUT, + Phase.SCORING_LOADING, Phase.SCORING, Phase.VERIFICATION, Phase.CHECK_INTEGRITY] +# +CHOICES_PHASES = [Phase.CLEAN, Phase.DATA_GENERATION, Phase.SCORING_DATAGEN,Phase.LOADING, Phase.TRAINING, Phase.SERVING, + Phase.SCORING_LOADING, Phase.SCORING, Phase.VERIFICATION, Phase.SERVING_THROUGHPUT, + Phase.CHECK_INTEGRITY] +#DEFAULT_USE_CASES = [1,11,3,4,6,16,7,17,9,19,10,20] +#CHOICES_USE_CASES = [1,11,3,4,6,16,7,17,9,19,10,20] +DEFAULT_USE_CASES = [1] +CHOICES_USE_CASES = [1] +DEFAULT_CONFIG_PATH = Path('config/default.yaml') + +FILE_PATH_REGEX = r'[A-Za-z0-9/\_\-\.]+' +PHASE_PLACEHOLDER = "_PHASE_PLACEHOLDER_" +STREAM_PLACEHOLDER ="_STREAM_PLACEHOLDER_" + + +def load_configuration(dat_path: Path, config_path: Path): + """ + Loads all configuration file and merges them, the config files from the dat directory always take precedence + :param dat_path: Directory containing the config files the must not be change by the average user + :param config_path: File containing all properties defined by the user + :return: + """ + with open(config_path, 'r') as config_file: + config = yaml.safe_load(config_file) + #config = benedict(config) + for c in dat_path.glob('*.yaml'): + with open(c, 'r') as dat_file: + dat_config = yaml.safe_load(dat_file) + if len(dat_config.items()) > 0: + config=merge_dict(config, dat_config) + + return config + +# based on python-benedict's merge (https://github.com/fabiocaccamo/python-benedict: MIT License) +def merge_dict(d1, d2): + for key, value in d2.items(): + merge_item(d1, key, value) + return d1 + +# based on python-benedict's merge (https://github.com/fabiocaccamo/python-benedict: MIT License) +def merge_item(d1, key, value): + if key in d1: + item = d1.get(key, None) + if isinstance(item,dict) and isinstance(value,dict): + merge_dict(item, value) + elif isinstance(item,list) and isinstance(value,list): + item += value + else: + d1[key] = value + else: + d1[key] = value + + +def get_estimated_sf_for_size(desired_value, config_dir): + def calculate_size(scaling_factor): + # number of rows + customer_size = math.sqrt(scaling_factor) * 100000 + weeks = (math.log(scaling_factor, 10) + 2) * 52 + product_size = math.sqrt(scaling_factor) * 1000 + order_size = customer_size * 0.5 * weeks + lineitem_size = order_size * 6.5 + order_returns_size = lineitem_size * 0.1 / 2 + financial_customer_size = customer_size * 0.1 + financial_transactions_size = financial_customer_size * weeks * 10 + failure_samples = math.sqrt(scaling_factor) * 100 + disk_size = math.sqrt(scaling_factor) * 1000 + failures_size = failure_samples * disk_size + marketplace_size = customer_size * 0.1 * 10 + identities = customer_size * 0.0001 + images_per_identity = math.sqrt(scaling_factor) * 10 + images_size = identities * images_per_identity + ratings_per_customer = math.log(scaling_factor, 10) + 20 + ratings_size = customer_size * ratings_per_customer * 0.1 + conversation_size = math.sqrt(30.0*scaling_factor) * 100 + + # size estimates in kB + customer_kb = customer_size * 0.12 + product_kb = product_size * 0.034 + order_kb = order_size * 0.04 + lineitem_kb = lineitem_size * 0.021 + order_returns_kb = order_returns_size * 0.016 + financial_customer_kb = financial_customer_size * 0.0137 + financial_transactions_kb = financial_transactions_size * 0.078 + failures_kb = failures_size * 0.167 + marketplace_kb = marketplace_size * 0.176 + images_kb = images_size * 160.914 + ratings_kb = ratings_size * 0.013 + conversation_kb = conversation_size * 68.207 + + esitmated_size = math.fsum( + [customer_kb, product_kb, order_kb, lineitem_kb, order_returns_kb, financial_customer_kb, + financial_transactions_kb, failures_kb, marketplace_kb, images_kb, ratings_kb, conversation_kb]) + + return esitmated_size / 1024 / 1024 + + init_range = 1e17 + found = False + current_sf = init_range // 2 + last_three_sf = collections.deque([-1, -1, -1], maxlen=3) + correction = current_sf // 2 + # save the best sf to prevent oscillation + tolerance = 1e-5 + i = 0 + + while not found: + i += 1 + current_size = calculate_size(current_sf) + # terminate if current size is within tolerance + if abs(current_size - desired_value) / desired_value < tolerance: + found = True + # or worse than the best value the has been found the latter means oscillation + elif current_sf == last_three_sf[-2]: + found = True + elif current_size < desired_value: + last_three_sf.append(current_sf) + current_sf = current_sf + correction + correction = correction // 2 + if correction < 0.125: + correction = 0.125 + else: + last_three_sf.append(current_sf) + current_sf = current_sf - correction + correction = correction // 2 + if correction < 0.125: + correction = 0.125 + + def sf_size_error_tuple(sf, target_value): + size = calculate_size(sf) + error = abs(size - desired_value) + return sf, size, error + + best_lst = list(sorted(map(lambda sf: sf_size_error_tuple(sf, desired_value), last_three_sf), key=lambda t: t[2])) + best_sf = best_lst[0][0] + return best_sf + + +def validate_arg_is_printable(value, regex): + if value.isprintable() is False or re.match(regex, value) is None: + print("Argument contains invalid values") + exit(1) + return True + + +def validate_argument(value, regex): + if type(value) is list: + for element in value: + validate_arg_is_printable(str(element), regex) + elif type(value) is str: + validate_arg_is_printable(value, regex) + else: + print("Argument contains invalid values") + exit(1) + return True + + +def mangle_url(datastore, url): + if datastore.name == 'local_fs': + return Path(url) + else: + return url + + +def guess_prediction_label(columns, label_column, + common_names=None): + if common_names is None: + common_names = ['prediction', 'predictions', 'pred', 'preds', 'forecast', 'forecasts'] + if label_column in columns: + return label_column + else: + # search all columns for a column with a common name used for predictions + # it is expected that only one such column exists + column_candidates = [c for c in columns if c in common_names] + if len(column_candidates) > 0: + return column_candidates[0] + else: + return None + + +def load_metric(modules, metric_name): + metric_found = False + metric = None + for module in modules: + try: + metric = getattr(module, metric_name) + metric_found = True + except AttributeError: + continue + + if metric_found: + return metric + else: + raise AttributeError(f"No {metric_name} found in ") + + +def scoring(true_labels, pred_labels, label_column, metric_name, labels=None, delimiter=',', sort_predictions=False, + **kwargs): + if not metric_name: + return -1.0 + csv_engine = 'python' if len(delimiter.encode('utf-8')) > 1 else 'c' + true_labels = pd.read_csv(true_labels, delimiter=delimiter, engine=csv_engine) + pred_labels = pd.read_csv(pred_labels, delimiter=delimiter, engine=csv_engine) + prediction_label = guess_prediction_label(pred_labels.columns, label_column) + if sort_predictions: + sort_columns = [c for c in pred_labels.columns if c != prediction_label] + pred_labels = pred_labels.sort_values(by=sort_columns) + metric_fun = load_metric([metrics, sklearn.metrics], metric_name) + + if labels is not None: + kwargs['labels'] = labels + if prediction_label is not None: + cols_in_common = set(true_labels.columns).intersection(pred_labels.columns) + join_cols = list(cols_in_common.difference([prediction_label])) + data = true_labels.merge(pred_labels, on=join_cols) + metric = metric_fun(data[f"{label_column}_x"], data[f"{prediction_label}_y"], **kwargs) + else: + metric = metric_fun(true_labels[label_column], pred_labels, **kwargs) + return metric + + +def init_db(database_path): + database = sqlite3.connect(str(database_path)) + + # create schema + database.execute(''' + CREATE TABLE IF NOT EXISTS benchmark ( + benchmark_sk INTEGER NOT NULL, -- UUID of the benchmark + version TEXT, -- version of the benchmark kit + hostname TEXT NOT NULL, -- hostname of the tpcxai-driver (where the benchmark was started) + start_time FLOAT NOT NULL, -- timestamp when the benchmark was initiated + end_time FLOAT, -- timestamp when the benchmark was stopped/finished + scale_factor INT NOT NULL, -- scale factor for the data generation + tpcxai_home TEXT NOT NULL, -- home directory of the benchmark + config_path TEXT NOT NULL, -- path to the config file + config TEXT NOT NULL, -- content of the config file + cmd_flags TEXT NOT NULL, -- flags that were used to start the benchmark `bin/tpcxai.sh ...` + benchmark_name TEXT, -- user specified name for the benchmark + successful INTEGER, -- was the benchmark successful (all use-cases and phases were run) + PRIMARY KEY (benchmark_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS command ( + command_sk INTEGER NOT NULL, + benchmark_fk INTEGER NOT NULL, + use_case INTEGER NOT NULL, + phase TEXT NOT NULL, + phase_run INTEGER, + sub_phase TEXT NOT NULL, + command TEXT NOT NULL, + start_time FLOAT, + end_time FLOAT, + runtime INTEGER, + return_code INTEGER, -- command finished successfully (0) or failed (!= 0) + PRIMARY KEY (command_sk), + FOREIGN KEY (benchmark_fk) REFERENCES benchmark(benchmark_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS log_std_out ( + use_case_fk INTEGER NOT NULL, -- part of key + part INT, -- which part of the log file + log TEXT, -- actual content of this part of the log + PRIMARY KEY (use_case_fk, part), + FOREIGN KEY (use_case_fk) REFERENCES command(command_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS log_std_err ( + use_case_fk INTEGER NOT NULL, -- part of key + part INT, -- which part of the log file + log TEXT, -- actual content of this part of the log + PRIMARY KEY (use_case_fk, part), + FOREIGN KEY (use_case_fk) REFERENCES command(command_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS stream ( + use_case_fk INTEGER NOT NULL, + stream TEXT, + PRIMARY KEY (use_case_fk, stream), + FOREIGN KEY (use_case_fk) REFERENCES command(command_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS quality_metric ( + use_case_fk INTEGER NOT NULL, + metric_name TEXT, + metric_value FLOAT, + PRIMARY KEY (use_case_fk, metric_name), + FOREIGN KEY (use_case_fk) REFERENCES command(command_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS performance_metric ( + benchmark_fk INTEGER NOT NULL, + metric_name TEXT, + metric_value FLOAT, + metric_time FLOAT, + PRIMARY KEY (benchmark_fk, metric_name), + FOREIGN KEY (benchmark_fk) REFERENCES benchmark(benchmark_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS benchmark_files ( + benchmark_fk INTEGER NOT NULL, -- benchmark id + relative_path TEXT, -- path to the file, relative to TPCxAI_HOME + absolute_path TEXT, -- absolute path of the file + sha256 TEXT, -- sha256 checksum of the file + PRIMARY KEY (benchmark_fk, relative_path), + FOREIGN KEY (benchmark_fk) REFERENCES benchmark(benchmark_sk) + ); + ''') + database.execute(''' + CREATE TABLE IF NOT EXISTS timeseries ( + benchmark_fk INTEGER NOT NULL, -- benchmark id + hostname TEXT, -- hostname where this event occured + name TEXT, -- name of the timeseries + timestamp INTEGER, -- instant (timestamp) of the event + value TEXT, -- value of the event + unit TEXT, -- unit of the timeseries + PRIMARY KEY (benchmark_fk, hostname, name, timestamp), + FOREIGN KEY (benchmark_fk) REFERENCES benchmark(benchmark_sk) + ); + ''') + + database.commit() + cursor = database.cursor() + res = cursor.execute(f"SELECT * FROM pragma_table_info('command') WHERE name == 'start_time'") + is_empty = True + if res.fetchone(): + is_empty = False + if is_empty: + database.execute('ALTER TABLE command ADD COLUMN start_time FLOAT NOT NULL DEFAULT 0.0') + is_empty = True + res = cursor.execute(f"SELECT * FROM pragma_table_info('command') WHERE name == 'end_time'") + if res.fetchone(): + is_empty = False + if is_empty: + database.execute('ALTER TABLE command ADD COLUMN end_time FLOAT NOT NULL DEFAULT 0.0') + res = cursor.execute(f"SELECT * FROM pragma_table_info('command') WHERE name == 'phase_run'") + if not res.fetchone(): + database.execute('ALTER TABLE command ADD COLUMN phase_run INTEGER') + res = cursor.execute(f"SELECT * FROM pragma_table_info('benchmark') WHERE name == 'version'") + if not res.fetchone(): + database.execute('ALTER TABLE benchmark ADD COLUMN version TEXT') + database.commit() + + return database + + +def compute_tpcxai_metric(logs,SF,aiucpm_logger,datagen_in_tload=False,num_streams=2): + TLD = TPTT = TPST1 = TPST2 = TPST = TTT = AICUPM = -1 + TOTAL_USE_CASES = 10 + + #DATAGEN + TDATA_GEN=0.0 + datagen_times=logs[(logs['sub_phase']==SubPhase.WORK) & (logs['phase']==Phase.DATA_GENERATION)][['phase','sub_phase','metric', 'start_time', 'end_time']] + TDATA_GEN=np.sum(datagen_times[['metric']].values) + if TDATA_GEN>0: + aiucpm_logger.info(f'DATAGEN: {round(TDATA_GEN,3)}') + + #TLD - LOADING + Datagen(if included) + loading_times = logs[(logs['sub_phase'] == SubPhase.WORK) & (logs['phase'] == Phase.LOADING)][['phase', 'sub_phase', 'metric', 'start_time', 'end_time']] + TLOAD = loading_times['end_time'].max() - loading_times['start_time'].min() + if TLOAD>0: + aiucpm_logger.info(f'TLOAD: {round(TLOAD,3)}') + + if datagen_in_tload == True and (TLOAD+TDATA_GEN)>0: + TLOAD += TDATA_GEN + aiucpm_logger.info(f'TLOAD(TLOAD+TDATA_GEN): {round(TLOAD,3)}') + + + TLD=1.0*TLOAD + if TLOAD>0: + aiucpm_logger.info(f'TLD: {round(TLD,3)}') + + #TPTT - TRAINING + training_times=logs[(logs['sub_phase']==SubPhase.WORK) & (logs['phase']==Phase.TRAINING)][['metric']].values + if len(training_times) == TOTAL_USE_CASES: + TPTT=(np.prod(training_times)**(1.0/len(training_times))) + if TPTT>0: + aiucpm_logger.info(f'TPTT: {round(TPTT,3)}') + + #TPST1 - SERVING1 + serving_times1=logs[(logs['sub_phase']==SubPhase.WORK) & (logs['phase']==Phase.SERVING) & (logs['phase_run']==1)][['metric']].values + if len(serving_times1) == TOTAL_USE_CASES: + TPST1=(np.prod(serving_times1)**(1.0/len(serving_times1))) + if TPST1>0: + aiucpm_logger.info(f'TPST1: {round(TPST1,3)}') + + #TPST2 - SERVING2 + serving_times2=logs[(logs['sub_phase']==SubPhase.WORK) & (logs['phase']==Phase.SERVING) & (logs['phase_run']==2)][['metric']].values + if len(serving_times2) == TOTAL_USE_CASES: + TPST2=(np.prod(serving_times2)**(1.0/len(serving_times2))) + if TPST2>0: + aiucpm_logger.info(f'TPST2: {round(TPST2,3)}') + + #TPST + TPST=max(TPST1,TPST2) + if TPST>0: + aiucpm_logger.info(f'TPST: {round(TPST,3)}') + + #THROUGHPUT + throughput_results = logs[(logs['sub_phase'] == SubPhase.WORK) & (logs['phase'] == Phase.SERVING_THROUGHPUT)] + max_throughput_results = throughput_results['end_time'].max() - throughput_results['start_time'].min() + throughput_results = throughput_results.groupby(['stream'])['metric'].sum() + TTT=-1 + if len(throughput_results) == num_streams: + TTT = max_throughput_results / (len(throughput_results) * TOTAL_USE_CASES) + if TTT > 0: + aiucpm_logger.info(f'TTT: {round(TTT, 3)}') + + #AIUCPM + if datagen_in_tload==True and TDATA_GEN<=0: + return -1 + + if TLD>0 and TPTT>0 and TPST>0 and TTT>0 and SF>0: + AIUCPM_numerator=60.0*SF*TOTAL_USE_CASES + AIUCPM_denominator=(TLD*TPTT*TPST*TTT)**(1.0/4.0) + if AIUCPM_denominator>0 and AIUCPM_numerator>0: + AIUCPM = 1.0*(AIUCPM_numerator/AIUCPM_denominator) + return round(AIUCPM,3) + + return -1 + + +def merge_actions(actions: List[Action], phase_to_merge: Phase) -> List[Action]: + filtered_actions = list(filter(lambda a: a.phase.value == phase_to_merge.value, actions)) + unified_actions = {} + for action in filtered_actions: + if isinstance(action, SubprocessAction) and action.command not in unified_actions: + new_action = copy.deepcopy(action) + new_action.use_case = 0 + unified_actions[action.command] = new_action + + new_actions = [copy.deepcopy(a) for a in actions if a not in filtered_actions] + new_actions.extend(unified_actions.values()) + return new_actions + + +def hash_file(file: Path, buffer_size=64*1024): + if not file.exists(): + print(f"{file} does not exist") + return '-1' + done = False + checksum = sha256() + with open(file, 'rb', buffering=0) as bin_file: + while not done: + data = bin_file.read(buffer_size) + if len(data) > 0: + checksum.update(data) + else: + done = True + + return checksum.hexdigest() + + +def main(): + # driver_home + driver_home = Path('.').resolve() + + # adabench_home + tpcxai_home = Path("..").resolve() + + # adabench data generator + datagen_home = tpcxai_home / 'data-gen' + datagen_config = datagen_home / 'config' + + parser = argparse.ArgumentParser() + parser.add_argument('-c', '--config', required=False, default=DEFAULT_CONFIG_PATH) + parser.add_argument('--dat', required=False, default=Path('dat')) + parser.add_argument('--phase', required=False, nargs='*', default=DEFAULT_PHASES, + choices=list(map(lambda p: p.name, CHOICES_PHASES))) + parser.add_argument('-sf', '--scale-factor', metavar='scale_factor', required=False, default=1, type=float) + parser.add_argument('-uc', '--use-case', metavar='use_case', type=int, required=False, nargs='*', + default=DEFAULT_USE_CASES, choices=CHOICES_USE_CASES) + parser.add_argument('-eo', '--execution-order', metavar='order', required=False, default='phase', + choices=['phase', 'use-case']) + parser.add_argument('--ttvf', required=False, metavar='factor', default=0.1, type=float) + parser.add_argument('--streams', required=False, metavar='N', default=2, type=int) + #parser.add_argument('--data-gen', required=False, metavar='data_gen', default=False, type=bool) + parser.add_argument('--data-gen', required=False, action='store_true', help='Enable data generation') + # flags + parser.add_argument('-v', '--verbose', required=False, default=False, action='store_true') + parser.add_argument('--dry-run', required=False, default=False, action='store_true') + parser.add_argument('--no-unified-load', required=False, default=False, action='store_true') + + args = parser.parse_args() + + # validate arguments + # validate_argument(args.config, FILE_PATH_REGEX) + + # handle arguments + config_path = Path(args.config) if isinstance(args.config, str) else args.config + dat_path = Path(args.dat) if isinstance(args.dat, str) else args.dat + phases = list(map(lambda p: Phase[p] if type(p) is str else p, args.phase)) + if Phase.SCORING in phases and Phase.SCORING_DATAGEN not in phases: + phases.append(Phase.SCORING_DATAGEN) + phases.append(Phase.SCORING_LOADING) + data_gen = args.data_gen + print(f"data_gen: {data_gen}") + + # add clean before data generation + if Phase.DATA_GENERATION in phases and Phase.CLEAN not in phases: + phases.append(Phase.CLEAN) + + # phases = args.phase + scale_factor = args.scale_factor + internal_scale_factor = get_estimated_sf_for_size(args.scale_factor, driver_home / 'config') + ttvf = args.ttvf + use_cases = args.use_case + execution_order = args.execution_order + num_streams = args.streams + + # handle flags + verbose = args.verbose + dry_run = args.dry_run + no_unified_load = args.no_unified_load + + config = load_configuration(dat_path, config_path) + + # FIXME check if default.yaml is loaded and replace all slashes in URLs with backslashes if driver is run on Windows + + workload = config['workload'] + engine_base = workload.get('engine_base','') + defaul_delimiter = workload['delimiter'] + + streams_mapping_keys = [x for x in workload.keys() if x.startswith('serving_throughput_stream')] + if num_streams < 2: + print(f'number of desired streams ({num_streams}) has to be at least 2') + exit(1) + if num_streams > len(streams_mapping_keys): + print(f'number of desired streams ({num_streams}) exceeds the number of available streams ({len(streams_mapping_keys)})') + exit(1) + streams_mapping_keys = streams_mapping_keys[:num_streams] + streams_mapping = [workload[s] for s in streams_mapping_keys] + usecases_config = workload['usecases'] + + pdgf_home = Path(workload['pdgf_home']) + if not pdgf_home.is_absolute(): + pdgf_home = tpcxai_home / pdgf_home + pdgf_java_opts = os.getenv('TPCxAI_PDGF_JAVA_OPTS', '') + datagen_output = Path(workload['raw_data_url']) + if not datagen_output.is_absolute(): + # if the url is relative make it relative to tpcxai_home + datagen_output = tpcxai_home / datagen_output + + temp_dir = Path(workload['temp_dir']) + + timeseries = [DaemonAction(0, ts, Phase.INIT, SubPhase.WORK, working_dir=tpcxai_home) for ts in workload['timeseries']] + + # print benchmark info + print(f"tpcxai_home: {tpcxai_home}") + + is_datagen_parallel = bool(workload['pdgf_node_parallel']) + if data_gen: + print("flas works") + datagen = DataGeneration(1234, tpcxai_home, pdgf_home, pdgf_java_opts, datagen_home, datagen_config, datagen_output, + scale_factor=internal_scale_factor, ttvf=ttvf, parallel=is_datagen_parallel) + # make sure that the scoring data is locally available for the tpcxai-driver + # this is only necessary if the node parallel data gen is used, + # since it's not guranteed to be run on the tpcxai-driver node. + # the local scoring data resides in the specified temp directory, `/tmp` by default + + if is_datagen_parallel: + datagen_output_local = temp_dir / 'scoring' + pdgf_home_local = tpcxai_home / 'lib' / 'pdgf' + if not datagen_output_local.is_absolute(): + # if the url is relative make it relative to tpcxai_home + datagen_output_local = tpcxai_home / datagen_output_local + datagen_local = DataGeneration(1234, tpcxai_home, pdgf_home_local, pdgf_java_opts, datagen_home, datagen_config, datagen_output_local, + scale_factor=internal_scale_factor, ttvf=ttvf, parallel=False) + + + actions = [] + collected_tables = set() + loaded_files_set = set() + + # read the configuration for each use case + # create the appropriate actions for all phases and sub-phases + # relevant actions are filtered later, when creating an execution plan + for uc_key in use_cases: + uc = usecases_config[uc_key] + name = uc['name'] + tables = uc['tables'] + raw_files = uc['raw_data_files'] + if 'raw_data_folder' in uc: + raw_folder = uc['raw_data_folder'] + else: + raw_folder = None + label_column = uc['label_column'] + if 'delimiter' in uc: + delimiter = uc['delimiter'] + else: + delimiter = defaul_delimiter + if 'labels' in uc.keys(): + labels = uc['labels'] + else: + labels = None + + # quality metrics + if 'scoring_sort' in uc.keys(): + scoring_sort = uc['scoring_sort'] + else: + scoring_sort = False + scoring_metric = uc['quality_metric'] + scoring_kvargs = uc['quality_metric_kvargs'] if 'quality_metrics_kvargs' in uc.keys() else {} + quality_metric_threshold = uc['quality_metric_threshold'] + quality_metric_larger_is_better = uc['quality_metric_larger_is_better'] + + # engines + training_engine = Template(uc['training_engine']).substitute(tpcxai_home=tpcxai_home, engine_base=engine_base) + serving_engine = Template(uc['serving_engine']).substitute(tpcxai_home=tpcxai_home, engine_base=engine_base) + + # data stores + datagen_datastore = datastore_from_dict(workload['datagen_datastore']) + training_datastore = datastore_from_dict(uc['training_datastore']) + serving_datastore = datastore_from_dict(uc['serving_datastore']) + model_datastore = datastore_from_dict(uc['model_datastore']) + output_datastore = datastore_from_dict(uc['output_datastore']) + + # templates + training_template = Template(uc['training_template']) + serving_template = Template(uc['serving_template']) + serving_throughput_template = Template(uc['serving_throughput_template']) + if 'postwork_training_template' in uc: + postwork_training_template = Template(uc['postwork_training_template']) + else: + postwork_training_template = None + + # URLs + training_data_url = mangle_url(training_datastore, uc['training_data_url']) + serving_data_url = mangle_url(serving_datastore, uc['serving_data_url']) + scoring_data_url = mangle_url(serving_datastore, uc['scoring_data_url']) + model_url = mangle_url(model_datastore, uc['model_url']) + output_url = mangle_url(output_datastore, uc['output_url']) + scoring_output_url = mangle_url(output_datastore, uc['scoring_output_url']) + working_dir = None + if 'working_dir' in uc: + working_dir = uc['working_dir'] + + # generate data for this use case, i.e. add relevant tables + collected_tables.update(tables) + if data_gen: + # clean up generated data <<--- + for raw_file in raw_files: + file_path = datagen_output / 'training' / raw_file + rm_training_data_gen = datagen_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_training_data_gen, Phase.CLEAN, SubPhase.WORK)) + + file_path = datagen_output / 'serving' / raw_file + rm_serving_data_gen = datagen_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_serving_data_gen, Phase.CLEAN, SubPhase.WORK)) + + file_path = datagen_output / 'scoring' / raw_file + rm_scoring_data_gen = datagen_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_scoring_data_gen, Phase.CLEAN, SubPhase.WORK)) + + if raw_folder: + folder_path = datagen_output / 'training' / raw_folder + folder_path = f"{folder_path} {folder_path}.seq" + rm_training_data_gen = datagen_datastore.delete.substitute(destination=folder_path) + actions.append(SubprocessAction(uc_key, rm_training_data_gen, Phase.CLEAN, SubPhase.WORK)) + + folder_path = datagen_output / 'serving' / raw_folder + folder_path = f"{folder_path} {folder_path}.seq" + rm_serving_data_gen = datagen_datastore.delete.substitute(destination=folder_path) + actions.append(SubprocessAction(uc_key, rm_serving_data_gen, Phase.CLEAN, SubPhase.WORK)) + + folder_path = datagen_output / 'scoring' / raw_folder + folder_path = f"{folder_path} {folder_path}.seq" + rm_scoring_data_gen = datagen_datastore.delete.substitute(destination=folder_path) + actions.append(SubprocessAction(uc_key, rm_scoring_data_gen, Phase.CLEAN, SubPhase.WORK)) + + # clean up parallel-generated data in other nodes + if is_datagen_parallel: + if datagen_datastore.delete_parallel: + folder_path = datagen_output + rm_parallel_data_gen = datagen_datastore.delete_parallel.substitute(destination=folder_path) + #print(rm_parallel_data_gen) + actions.append(SubprocessAction(uc_key, rm_parallel_data_gen, Phase.CLEAN, SubPhase.WORK)) + else: + print('The delete_parallel configuration for the datagen_datastore object does not exist.',file=sys.stderr) + sys.exit(1) + + # clean loaded data + for raw_file in raw_files: + file_path = f"{training_data_url}/{raw_file}" + rm_training_data = training_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_training_data, Phase.CLEAN, SubPhase.WORK)) + + file_path = f"{serving_data_url}/{raw_file}" + rm_serving_data = training_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_serving_data, Phase.CLEAN, SubPhase.WORK)) + + file_path = f"{scoring_data_url}/{raw_file}" + rm_scoring_data = training_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_scoring_data, Phase.CLEAN, SubPhase.WORK)) + + if raw_folder: + folder_path = f"{training_data_url}/{raw_folder}" + folder_path = f"{folder_path} {folder_path}.seq" + rm_training_data = training_datastore.delete.substitute(destination=folder_path) + actions.append(SubprocessAction(uc_key, rm_training_data, Phase.CLEAN, SubPhase.WORK)) + + folder_path = f"{serving_data_url}/{raw_folder}" + folder_path = f"{folder_path} {folder_path}.seq" + rm_serving_data = training_datastore.delete.substitute(destination=folder_path) + actions.append(SubprocessAction(uc_key, rm_serving_data, Phase.CLEAN, SubPhase.WORK)) + + folder_path = f"{scoring_data_url}/{raw_folder}" + folder_path = f"{folder_path} {folder_path}.seq" + rm_scoring_data = training_datastore.delete.substitute(destination=folder_path) + actions.append(SubprocessAction(uc_key, rm_scoring_data, Phase.CLEAN, SubPhase.WORK)) + + # clean scoring labels + raw_file_name, raw_file_extension = str.split(raw_files[0], '.')[0:2] + data_dir = datagen_output / 'scoring' #if not is_datagen_parallel #else datagen_output_local / 'scoring' + file_path = data_dir / (raw_file_name + '_labels.' + raw_file_extension) + rm_scoring_data = training_datastore.delete.substitute(destination=file_path) + actions.append(SubprocessAction(uc_key, rm_scoring_data, Phase.CLEAN, SubPhase.WORK)) + + # clean models and predictions + rm_model = training_datastore.delete.substitute(destination=model_url) + actions.append(SubprocessAction(uc_key, rm_model, Phase.CLEAN, SubPhase.WORK)) + rm_output = output_datastore.delete.substitute(destination=output_url) + actions.append(SubprocessAction(uc_key, rm_output, Phase.CLEAN, SubPhase.WORK)) + + + + # load training and serving data + create_training_cmd = training_datastore.create.substitute(destination=training_data_url) + actions.append(SubprocessAction(uc_key, create_training_cmd, Phase.LOADING, SubPhase.PREPARATION)) + + raw_files_set = set() + for raw_file in raw_files: + if datagen_output / 'training' / raw_file not in loaded_files_set: + raw_files_set.add(datagen_output / 'training' / raw_file) + if raw_folder: + if datagen_output / 'training' / raw_folder not in loaded_files_set: + raw_files_set.add(datagen_output / 'training' / raw_folder) + + raw_files_as_posix_str=' '.join([x.as_posix() for x in list(raw_files_set)]) + load_training_cmd = training_datastore.load.substitute(destination=training_data_url,source=raw_files_as_posix_str) + actions.append(SubprocessAction(uc_key, load_training_cmd, Phase.LOADING, SubPhase.WORK)) + loaded_files_set.update(raw_files_set) + + create_serving_cmd = serving_datastore.create.substitute(destination=serving_data_url) + actions.append(SubprocessAction(uc_key, create_serving_cmd, Phase.LOADING, SubPhase.PREPARATION)) + + raw_files_set.clear() + for raw_file in raw_files: + if datagen_output / 'serving' / raw_file not in loaded_files_set: + raw_files_set.add(datagen_output / 'serving' / raw_file) + if raw_folder: + if datagen_output / 'serving' / raw_folder not in loaded_files_set: + raw_files_set.add(datagen_output / 'serving' / raw_folder) + raw_files_as_posix_str = ' '.join([x.as_posix() for x in list(raw_files_set)]) + load_serving_cmd = serving_datastore.load.substitute(destination=serving_data_url,source=raw_files_as_posix_str) + actions.append(SubprocessAction(uc_key, load_serving_cmd, Phase.LOADING, SubPhase.WORK)) + loaded_files_set.update(raw_files_set) + + # training phase + training_store_cmd = model_datastore.create.substitute(destination=model_url) + actions.append(SubprocessAction(uc_key, training_store_cmd, Phase.TRAINING, SubPhase.PREPARATION)) + training_cmd = training_template.substitute(training_engine=training_engine, tpcxai_home=tpcxai_home, name=name, + engine=training_engine, + input=training_data_url, file=raw_files[0], output=model_url) + actions.append(SubprocessAction(uc_key, training_cmd, Phase.TRAINING, SubPhase.WORK, working_dir=working_dir)) + if postwork_training_template is not None: + postwork_training_cmd = postwork_training_template.substitute(tpcxai_home=tpcxai_home, name=name, output=model_url) + actions.append(SubprocessAction(uc_key, postwork_training_cmd, Phase.TRAINING, SubPhase.POSTWORK, working_dir=working_dir)) + + # serving phase + serving_store_cmd = output_datastore.create.substitute(destination=output_url) + actions.append(SubprocessAction(uc_key, serving_store_cmd, Phase.SERVING, SubPhase.PREPARATION)) + serving_cmd = serving_template.substitute(serving_engine=serving_engine, tpcxai_home=tpcxai_home, name=name, + engine=serving_engine, + input=serving_data_url, file=raw_files[0], model=model_url, + output=output_url, phase=PHASE_PLACEHOLDER) + actions.append(SubprocessAction(uc_key, serving_cmd, Phase.SERVING, SubPhase.WORK, working_dir=working_dir)) + + # serving throughput phase + serving_throughput_store_cmd = output_datastore.create.substitute(destination=output_url) + actions.append(SubprocessAction(uc_key, serving_throughput_store_cmd, Phase.SERVING_THROUGHPUT, SubPhase.PREPARATION)) + serving_throughput_cmd = serving_throughput_template.substitute( + serving_engine=serving_engine, tpcxai_home=tpcxai_home, name=name, + engine=serving_engine, + input=serving_data_url, file=raw_files[0], model=model_url, output=output_url, + stream=STREAM_PLACEHOLDER, phase=PHASE_PLACEHOLDER + ) + actions.append(SubprocessAction(uc_key, serving_throughput_cmd, Phase.SERVING_THROUGHPUT, SubPhase.WORK, working_dir=working_dir)) + + # scoring + # loading for scoring + create_scoring_cmd = serving_datastore.create.substitute(destination=scoring_data_url) + actions.append(SubprocessAction(uc_key, create_scoring_cmd, Phase.SCORING_LOADING, SubPhase.PREPARATION)) + + raw_files_set.clear() + datagen_output_scoring = datagen_output + for raw_file in raw_files: + if datagen_output_scoring / 'scoring' / raw_file not in loaded_files_set: + raw_files_set.add(datagen_output_scoring / 'scoring' / raw_file) + if raw_folder: + if datagen_output_scoring / 'scoring' / raw_folder not in loaded_files_set: + raw_files_set.add(datagen_output_scoring / 'scoring' / raw_folder) + raw_files_as_posix_str = ' '.join([x.as_posix() for x in list(raw_files_set)]) + load_scoring_cmd = serving_datastore.load.substitute(source=raw_files_as_posix_str, destination=scoring_data_url) + actions.append(SubprocessAction(uc_key, load_scoring_cmd, Phase.SCORING_LOADING, SubPhase.WORK)) + loaded_files_set.update(raw_files_set) + + # serving for scoring + scoring_store_cmd = output_datastore.create.substitute(destination=output_url) + actions.append(SubprocessAction(uc_key, scoring_store_cmd, Phase.SCORING, SubPhase.INIT)) + scoring_serving_cmd = serving_template.substitute(serving_engine=serving_engine, tpcxai_home=tpcxai_home, name=name, + engine=serving_engine, + input=scoring_data_url, file=raw_files[0], model=model_url, + output=output_url, phase=PHASE_PLACEHOLDER) + actions.append(SubprocessAction(uc_key, scoring_serving_cmd, Phase.SCORING, SubPhase.INIT, working_dir=working_dir)) + + # copy serving output to local filesystem (download) + mod_path = tpcxai_home / Path(output_url) + mod_path.mkdir(exist_ok=True, parents=True) + scoring_source = str(output_url) + '/' + PHASE_PLACEHOLDER + '/predictions.csv' + scoring_download_cmd = serving_datastore.download.safe_substitute(source=scoring_source, destination=output_url) + actions.append(SubprocessAction(uc_key, scoring_download_cmd, Phase.SCORING, SubPhase.PREPARATION)) + + # calculate the metric + raw_file_name, raw_file_extension = str.split(raw_files[0], '.')[0:2] + data_dir = datagen_output / 'scoring' #if not is_datagen_parallel else datagen_output_local / 'scoring' <--- + true_labels = data_dir / (raw_file_name + '_labels.' + raw_file_extension) + pred_labels = tpcxai_home / output_url / 'predictions.csv' + scoring_params = {'true_labels': true_labels, 'pred_labels': pred_labels, 'label_column': label_column, + 'metric_name': scoring_metric, 'labels': labels, 'delimiter': delimiter, + 'sort_predictions': scoring_sort, **scoring_kvargs} + scoring_cmd = ScoringAction(uc_key, scoring_params, Phase.SCORING, SubPhase.WORK) + # verification + if Phase.SCORING in phases: + scoring_cmd.add_verification(scoring_metric, quality_metric_threshold, quality_metric_larger_is_better) + actions.append(scoring_cmd) + if data_gen: + # add data generation actions <--- + action_datagen_train = datagen.run(Phase.TRAINING, collected_tables) + actions.append(action_datagen_train) + action_datagen_serve = datagen.run(Phase.SERVING, collected_tables) + actions.append(action_datagen_serve) + action_datagen_scoring = datagen.run(Phase.SCORING, collected_tables) + actions.append(action_datagen_scoring) + if is_datagen_parallel: + action_datagen_scoring_local = datagen_local.run(Phase.SCORING, collected_tables) + actions.append(action_datagen_scoring_local) + + # unified load + # just keep a single load command per file + # otherwise use-cases with same data would individually load, i.e. overwrite, the data + if not no_unified_load: + actions = merge_actions(actions, Phase.LOADING) + + # unify CLEAN + actions = merge_actions(actions, Phase.CLEAN) + + # filter actions + # only keep actions for the phases that were specified + actions = list(filter(lambda a: a.phase in phases, actions)) + # duplicate phases if they were specified multiple times + tmp_actions = [] + # rename duplicate phases + # e.g. TRAINING, TRAINING => TRAINING_1, TRAINING_2 + duplicate_phases = set([x for x in phases if phases.count(x) > 1]) + phase_counter = {} + for phase in phases: + if phase in phase_counter: + phase_counter[phase] += 1 + else: + phase_counter[phase] = int(1) + phase_num = phase_counter[phase] + acs = list(filter(lambda a: a.phase == phase, actions)) + renamed_acs = [] + for a in acs: + a = copy.deepcopy(a) + a.run = phase_num + renamed_acs.append(a) + tmp_actions.extend(renamed_acs) + actions = tmp_actions + + stream_names = [] + ucs = [] + phases_col = [] + phases_run_col = [] + sub_phases = [] + commands = [] + start_times = [] + end_times = [] + timings = [] + std_logs = [] + err_logs = [] + qualities = {} + + # TODO define order of execution / run rules here + # TODO for now execute one stage after another, e.g. + # generating (uc1, uc2, ..., ucn) + # loading (uc1, uc2, ..., ucn) + # training (uc1, uc2, ..., ucn) + # serving (uc1, uc2, ..., ucn) + if execution_order == 'phase': + actions_in_order = [] + for key, group in groupby(actions, key=lambda a: a.use_case): + actions_in_order.extend(list(group)) + #actions_in_order = list(sorted(actions, key=lambda a: (a.phase.value[1], a.phase.value[0], a.run, a.subphase.value, a.use_case))) + elif execution_order == 'use-case': + actions_in_order = [] + for key, group in groupby(actions, key=lambda a: a.use_case): + actions_in_order.extend(list(group)) + #actions_in_order = list(sorted(actions, key=lambda a: (a.use_case, a.phase.value[1], a.run, a.subphase.value))) + else: + actions_in_order = [] + exit(1) + if data_gen: + if Phase.DATA_GENERATION in phases or Phase.SCORING in phases: + actions_in_order.insert(0, datagen.prepare()) + #extract all actions related to SERVING_THROUGHPUT + #keep THROUGHPUT and non-THROUGHPUT actions separate + actions_in_order_serving_throuhgput = list(filter(lambda a: a.phase.value == Phase.SERVING_THROUGHPUT.value, actions_in_order)) + actions_in_order = list(filter(lambda a: a.phase.value != Phase.SERVING_THROUGHPUT.value, actions_in_order)) + + # prepare logging + log_dir = tpcxai_home / 'logs' + if not log_dir.exists(): + log_dir.mkdir() + + current_time = time.localtime() + current_timestamp = time.time() + log_suffix = time.strftime('%Y%m%d-%H%M', current_time) + log_file = log_dir / f"tpcxai-sf{scale_factor}-{log_suffix}.csv" + i = 1 + while log_file.exists(): + log_file = log_dir / f"tpcxai-sf{scale_factor}-{log_suffix}-{i}.csv" + i += 1 + + # setup and initialize database + benchmark_sk = None + db_cursor = None + if not dry_run: + database_path = log_dir / 'tpcxai.db' + database = init_db(database_path) + + # log benchmark meta data + hostname = socket.gethostname() + db_cursor = database.cursor() + db_cursor.execute( + ''' + INSERT INTO benchmark (hostname, start_time, scale_factor, tpcxai_home, config_path, config, cmd_flags, benchmark_name, version) + VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?) + ''', + (hostname, current_timestamp, scale_factor, str(tpcxai_home), str(config_path), yaml.dump(config), ' '.join(sys.argv), '', TPCXAI_VERSION) + ) + benchmark_sk = db_cursor.lastrowid + database.commit() + # close the current connection + # all future writes should happen through the DatabaseQueue + database.close() + db_queue = DatabaseQueue(database_path) + + # start the resource stat collection + logger = LogPerformanceDbHandler(benchmark_sk, db_queue) + daemon_stop_event = threading.Event() + + daemon_threads = [ + run_as_daemon(action.command, logger, daemon_stop_event, cwd=action.working_dir) + for action in timeseries + ] + for thread in daemon_threads: + thread.start() + + # run the benchmark + if Phase.CLEAN in phases and not dry_run: + # remove the temp dir + if temp_dir.exists(): + shutil.rmtree(temp_dir, ignore_errors=True) + + if Phase.CHECK_INTEGRITY in phases and not dry_run: + files_changed = 0 + hashes_file = Path(tpcxai_home, 'driver', 'dat', 'tpcxai.sha256') + if not hashes_file.exists(): + print(f"Hashes file {hashes_file} was not found") + stop(daemon_stop_event, db_queue) + exit(2) + with open(hashes_file, 'r') as f: + for line in f: + parts = line.rstrip().split(', ') + if len(parts) == 2: + file = parts[0] + checksum = parts[1] + check_file_path = Path(tpcxai_home, file) + #if check_file_path.exists()==False: + # print(f"{check_file_path} does not exist") + file_hash = hash_file(check_file_path) + if file_hash != checksum: + print(f"{file} was changed") + files_changed += 1 + # add if file was already logged to db + # if not log it with its relative path, absolute path, and sha256 hash + res = db_queue.query( + "SELECT COUNT(1) FROM benchmark_files WHERE benchmark_fk = ? AND relative_path = ?", + (benchmark_sk, file) + ) + if res.fetchone()[0] == 0: + db_queue.insert("INSERT INTO benchmark_files VALUES (?, ?, ?, ?)", + (benchmark_sk, file, check_file_path.as_posix(), file_hash)) + if files_changed > 0: + print(f"{files_changed} files have been changed") + time.sleep(2) + #stop(daemon_stop_event, db_queue) + #exit(2) + + threshold_checks = list() + for action in actions_in_order: + phase_run = action.run + phase_run_str = phase_run + phase_name = f"{str(action.phase).replace('Phase.', '')}_{phase_run}" + if not phase_run_str: + phase_run_str = '' + else: + phase_run_str = f'({phase_run_str})' + if action.subphase.value == SubPhase.INIT.value: + print(f"initializing {action.phase} {phase_run_str} for uc {action.use_case}") + elif action.subphase.value == SubPhase.PREPARATION.value: + print(f"preparing {action.phase} {phase_run_str} for uc {action.use_case}") + elif action.subphase.value == SubPhase.CLEANUP.value: + print(f"cleaning up {action.phase} {phase_run_str} for uc {action.use_case}") + else: + print(f"running {action.phase} {phase_run_str} for uc {action.use_case}") + try: + # get the command or scoring parameters as command + command = '' + if isinstance(action, SubprocessAction): + command = action.command + command = command.replace(PHASE_PLACEHOLDER, f"{phase_name}") + action.command = command + elif isinstance(action, ScoringAction): + command = action.scoring_params + + if dry_run: + print(command) + else: + query_last_use_case_sk = "SELECT max(command_sk) FROM command" + use_case_sk = db_queue.query(query_last_use_case_sk).fetchone()[0] + use_case_sk = use_case_sk + 1 if use_case_sk else 1 + query_command = ''' + INSERT INTO command (command_sk, benchmark_fk, use_case, phase, phase_run, sub_phase, command) + VALUES (?, ?, ?, ?, ?, ?, ?) + ''' + query_command_params = (use_case_sk, benchmark_sk, action.use_case, str(action.phase), action.run, str(action.subphase), str(command)) + db_queue.insert(query_command, query_command_params) + db_queue.insert('INSERT INTO stream (use_case_fk, stream) VALUES (?, ?)', (use_case_sk, 'POWER_TEST')) + action_log_file = log_dir / f"{log_file.stem}-{action.phase}-{action.run}-{action.use_case}.out" + if verbose or action.phase.value == Phase.CLEAN.value: + print(command) + if action.phase.value == Phase.DATA_GENERATION.value or action.phase.value == Phase.SCORING_DATAGEN.value: + uses_parallel_data_gen = 'parallel-data-gen.sh' in action.command + current_wd = datagen_home if not uses_parallel_data_gen else tpcxai_home + if action.subphase.value == SubPhase.INIT.value: + with FileAndDBLogger(use_case_sk, action_log_file, db_queue) as logger: + run_and_capture(action.command.split(), logger, verbose=True, cwd=current_wd) + else: + if isinstance(action, SubprocessAction) and action.working_dir: + current_wd = action.working_dir + else: + current_wd = tpcxai_home + + start = time.perf_counter() + start_time = time.time() + if isinstance(action, ScoringAction): + metric_name = action.scoring_params['metric_name'] + print(metric_name) + result = scoring(**action.scoring_params) + metric_msg = f"Metric ({metric_name}): {result}" + print(metric_msg) + db_queue.insert('INSERT INTO quality_metric VALUES (?, ?, ?)', (use_case_sk, metric_name, result)) + if action.verification: + st = time.time() + s = time.perf_counter() + a_ver = action.verification + a_ver.add_metric(result) + meets_quality_threshold = a_ver.meets_quality_threshold() + e = time.perf_counter() + et = time.time() + rt = e - s + rc = 0 if meets_quality_threshold else 1 + threshold_checks.append(rc) + verification_usecase_sk = use_case_sk + 1 + qry = '''INSERT INTO command (command_sk, benchmark_fk, use_case, phase, phase_run, sub_phase, command, + start_time, end_time, runtime, return_code) + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''' + qry_params = (verification_usecase_sk, benchmark_sk, a_ver.use_case, str(a_ver.phase), 1, + str(a_ver.subphase), a_ver.get_metric_command(), st, et, rt, rc) + db_queue.insert(qry, qry_params) + db_queue.insert('INSERT INTO stream (use_case_fk, stream) VALUES (?, ?)', + (verification_usecase_sk, 'POWER_TEST')) + print(f"running {a_ver.phase} for uc {a_ver.use_case}") + if meets_quality_threshold: + print(f"checking threshold: {a_ver.get_metric_command()} OK") + else: + print(f"checking threshold: {a_ver.get_metric_command()} FAILURE") + # set return code to 0 if scoring was successful + db_queue.insert('UPDATE command SET return_code = ? WHERE command_sk = ?', (0, use_case_sk)) + std_logs.append(metric_msg) + err_logs.append('') + qualities[action.use_case] = (metric_name, result) + else: + with FileAndDBLogger(use_case_sk, action_log_file, db_queue) as logger: + result = run_and_capture(action.command, logger, verbose=verbose, cwd=current_wd, shell=True) + end = time.perf_counter() + end_time = time.time() + duration = round(end - start, 3) + action.duration = duration + db_queue.insert(''' + UPDATE command + SET start_time = ?, end_time = ?, runtime = ? + WHERE command_sk = ?''', (start_time, end_time, duration, use_case_sk)) + print(f"time: {duration}s") + + stream_names.append('POWER_TEST') + ucs.append(action.use_case) + phases_col.append(action.phase) + phases_run_col.append(phase_run) + sub_phases.append(action.subphase) + if isinstance(action, SubprocessAction): + commands.append(action.command) + else: + commands.append(action.scoring_params) + start_times.append(start_time) + end_times.append(end_time) + timings.append(duration) + if not isinstance(action, PythonAction) and not isinstance(action, DaemonAction): + err_logs.append(result.stderr) + std_logs.append(result.stdout) + db_queue.insert('UPDATE command SET return_code = ? WHERE command_sk = ?', + (result.returncode, use_case_sk)) + + if result.returncode == 0: + print('SUCCESS') + else: + print('FAILURE') + print(result.stdout) + print(result.stderr, file=sys.stderr) + print(f"command was: {action.command}") + stop(daemon_stop_event, db_queue) + exit(2) + + except Exception as e: + print(f"An error occured while running the action for use-case {action.use_case} in phase {action.phase}.{action.subphase}:") + print(action) + print(e) + benchmark_end_time = time.time() + db_queue.insert('UPDATE benchmark SET successful = ?, end_time = ? WHERE benchmark_sk = ?', + ('FALSE', benchmark_end_time, benchmark_sk)) + stop(daemon_stop_event, db_queue) + exit(1) + + # run the throughput test + if Phase.SERVING_THROUGHPUT in phases: + # assemble the streams + streams = [] + for stream in streams_mapping: + stream_int = list(map(lambda s: int(s), stream)) + acs = list(filter(lambda a: a.use_case in stream_int, actions_in_order_serving_throuhgput)) + acs = list(sorted(acs, key=lambda a: (a.phase, a.subphase, stream_int.index(a.use_case)))) + streams.append(copy.deepcopy(acs)) + + # run the streams + stream_threads = [] + killall_event = threading.Event() + i = 0 + index = 0 + for stream in streams: + name = streams_mapping_keys[i] + stream_commands = [] + stream_actions = [] + verboses = [] + for action in stream: + # append stream name to output + action.command = action.command.replace(STREAM_PLACEHOLDER, name) + if dry_run: + print(action.command) + else: + stream_commands.append(action.command) + stream_actions.append(action) + verboses.append(verbose) + if not dry_run: + s = Stream(index, name, stream_actions, db_queue, benchmark_sk, log_file, verboses, tpcxai_home, + killall_event, shell=True) + stream_threads.append(s) + i += 1 + index += len(stream_actions) + + if dry_run: + exit(0) + else: + for thread in stream_threads: + thread.start() + + stream_exceptions = {} + for thread in stream_threads: + thread.join() + if thread.exc: + stream_exceptions[thread.name] = thread.exc + + if len(stream_exceptions) > 0: + print(f"phase STREAMING_THROUGHPUT failed") + for stream_name, exc in stream_exceptions.items(): + print(f"stream {stream_name} failed") + print(f"{exc}") + stop(daemon_stop_event, db_queue) + exit(1) + + for thread, stream in zip(stream_threads, streams): + for result, action, start_time, end_time, duration in zip(thread.results, stream, thread.start_times, thread.end_times, thread.timings): + if result: + stream_names.append(thread.name) + ucs.append(action.use_case) + phases_col.append(action.phase) + phases_run_col.append(action.run) + sub_phases.append(action.subphase) + commands.append(action.command) + start_times.append(start_time) + end_times.append(end_time) + timings.append(duration) + std_logs.append(result.stdout) + err_logs.append(result.stderr) + + if dry_run: + exit(0) + + benchmark_end_time = time.time() + db_queue.insert('UPDATE benchmark SET end_time = ? WHERE benchmark_sk = ?', + (benchmark_end_time, benchmark_sk)) + + # Extract and print timings + logs = pd.DataFrame({'stream': stream_names, 'use_case': ucs, 'phase': phases_col, 'phase_run': phases_run_col, 'sub_phase': sub_phases, + 'command': commands, 'metric': timings, 'std_out': std_logs, 'std_err': err_logs, + 'start_time': start_times, 'end_time': end_times}) + # change name of phase in presence of multiple runs of said this phase + logs['phase_name'] = logs['phase'].astype(str) + '_' + logs['phase_run'].astype(str) + + print('========== RESULTS ==========') + tmp = logs[(logs['sub_phase'] == SubPhase.WORK) & + ((logs['phase'] == Phase.TRAINING) | (logs['phase'] == Phase.SERVING))] + has_metric = tmp.size != 0 + if has_metric: + output = tmp.pivot(index='use_case', columns='phase_name', values='metric') + for uc in ucs: + if uc == 0: + continue + try: + quality = qualities[uc] + qn = quality[0] + qv = quality[1] + output.loc[uc, 'qualitity_metric_name'] = qn + output.loc[uc, 'qualitity_metric_value'] = qv + except KeyError: + continue + print(output) + + + # Include datagen as part of TLOAD? + datagen_time = logs[(logs['sub_phase'] == SubPhase.WORK) & (logs['phase'] == Phase.DATA_GENERATION)] + datagen_in_tload = workload.get('include_datagen_in_tload',False)==True + # + + throughput_results = logs[(logs['sub_phase'] == SubPhase.WORK) & (logs['phase'] == Phase.SERVING_THROUGHPUT)] + throughput_results = throughput_results.groupby(['stream'])['metric'].sum().reset_index() + if len(throughput_results) > 0: + print('SERVING THROUGHPUT') + print(throughput_results) + + i = 1 + while log_file.exists(): + log_file = log_dir / f"tpcxai-sf{scale_factor}-{log_suffix}-{i}.csv" + i += 1 + logs.to_csv(log_file, index=False) + + if has_metric: + metrics_file = log_dir / f"tpcxai-metrics-sf{scale_factor}-{log_suffix}.csv" + i = 1 + while metrics_file.exists(): + metrics_file = log_dir / f"tpcxai-metrics-sf{scale_factor}-{log_suffix}-{i}.csv" + i += 1 + output.to_csv(metrics_file, index=True) + + aiucpm_metrics_file = log_dir / f"adabench-aiucpm-metrics-sf{scale_factor}-{log_suffix}.csv" + i=1 + while aiucpm_metrics_file.exists(): + aiucpm_metrics_file = log_dir / f"adabench-aiucpm-metrics-sf{scale_factor}-{log_suffix}-{i}.csv" + i += 1 + + aiucpm_logger = logging.getLogger() + aiucpm_logger.setLevel(logging.INFO) + aiucpm_logger.addHandler(logging.FileHandler(aiucpm_metrics_file)) + aiucpm_logger.addHandler(logging.StreamHandler(sys.stdout)) + if not dry_run: + aiucpm_logger.addHandler(LogDbHandler(benchmark_sk, db_queue)) + AIUCpm_metric = compute_tpcxai_metric(logs,scale_factor,aiucpm_logger,datagen_in_tload,num_streams) + + if AIUCpm_metric > 0: + aiucpm_logger.info(f'AIUCpm@{scale_factor}={AIUCpm_metric}') + db_queue.insert('UPDATE benchmark SET successful = ? WHERE benchmark_sk = ?', + ('TRUE', benchmark_sk)) + elif not data_gen: + aiucpm_logger.info(f'Unable to compute AIUCpm@{scale_factor}. One or more required phases or values couldn\'t be executed or computed for this benchmark run.') + db_queue.insert('UPDATE benchmark SET successful = ? WHERE benchmark_sk = ?', + ('FALSE', benchmark_sk)) + + stop(daemon_stop_event, db_queue) + + +def stop(stop_event, db_queue): + stop_daemons(stop_event) + db_queue.stop_gracefully() + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/data.py b/scripts/tpcx-ai/driver/tpcxai-driver/data.py new file mode 100644 index 00000000000..821ac9cab59 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/data.py @@ -0,0 +1,174 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +from enum import Enum +from functools import total_ordering +from string import Template +from typing import List, Dict, Union + + +@total_ordering +class Phase(Enum): + CHECK_INTEGRITY = (-2, -2) + CLEAN = (-1, 1) + INIT = (0, 2) + DATA_GENERATION = (1, 3) + LOADING = (2, 4) + TRAINING = (3, 5) + SERVING = (4, 6) + SERVING_THROUGHPUT = (5, 6) + SCORING_DATAGEN = (6, 6) + SCORING_LOADING = (7, 7) + SCORING = (8, 8) + VERIFICATION = (9, 9) + + def __lt__(self, other): + return self.value < other.value + + +@total_ordering +class SubPhase(Enum): + NONE = -1 + INIT = 0 + PREPARATION = 1 + WORK = 2 + POSTWORK = 3 + CLEANUP = 4 + + def __lt__(self, other): + return self.value < other.value + + +class Action: + + def __init__(self, use_case: int, phase: Phase, + sub_phase: SubPhase = SubPhase.NONE, metadata=None, run=1): + self.use_case = use_case + self.phase = phase + self.subphase = sub_phase + self.duration = None + self.metadata = metadata + self.run = run + + +class SubprocessAction(Action): + + def __init__(self, use_case: int, command: str, phase: Phase, sub_phase: SubPhase = SubPhase.NONE, + working_dir=None, metadata=None): + super().__init__(use_case, phase, sub_phase, metadata) + self.command = command + self.working_dir = working_dir + + +class DaemonAction(SubprocessAction): + + def __init__(self, use_case: int, command: str, phase: Phase, sub_phase: SubPhase = SubPhase.NONE, + working_dir=None, metadata=None): + super().__init__(use_case, command, phase, sub_phase, working_dir, metadata) + + +class PythonAction(Action): + + def __init__(self, use_case: int, phase: Phase, sub_phase: SubPhase = SubPhase.NONE, metadata=None): + super().__init__(use_case, phase, sub_phase, metadata) + + +class ScoringAction(PythonAction): + + def __init__(self, use_case: int, scoring_params: Dict, phase: Phase, sub_phase: SubPhase = SubPhase.NONE, + metadata=None): + super().__init__(use_case, phase, sub_phase, metadata) + self.scoring_params = scoring_params + self.verification: Union[None, VerificationAction] = None + + def add_verification(self, metric_name: str, metric_threshold: float, metric_higher_is_better: bool): + self.verification = VerificationAction(self.use_case, metric_name, metric_threshold, metric_higher_is_better, + Phase.VERIFICATION, SubPhase.WORK, self.metadata) + + +class VerificationAction(PythonAction): + + def __init__(self, use_case: int, metric_name: str, metric_threshold: float, metric_larger_is_better: bool, + phase: Phase, sub_phase: SubPhase = SubPhase.NONE, metadata=None): + super().__init__(use_case, phase, sub_phase, metadata) + self.metric_name = metric_name + self.metric_threshold = metric_threshold + self.metric_larger_is_better = metric_larger_is_better + self.metric = None + + def add_metric(self, metric: float): + self.metric = metric + + def meets_quality_threshold(self): + if self.metric: + if self.metric_larger_is_better: + return self.metric >= self.metric_threshold + else: + return self.metric <= self.metric_threshold + else: + raise RuntimeError('No metric has been set. Please set a metric with `add_metric`.') + + def get_metric_command(self): + comparator = '>=' if self.metric_larger_is_better else '<=' + return f"{self.metric} {comparator} {self.metric_threshold}" + + +class DataStore: + + def __init__(self, name: str, create_template: Template, copy_template: Template, + load_template: Template, load_dir_template: Template, + delete_template: Template, download_template: Template, delete_parallel_template: Template = None): + self.name = name + self.create = create_template + self.copy = copy_template + self.load = load_template + self.load_dir = load_dir_template + self.delete = delete_template + self.download = download_template + self.delete_parallel = delete_parallel_template + + +def datastore_from_dict(dictionary: dict): + load_dir = dictionary['load_directory'] if 'load_directory' in dictionary else dictionary['load'] + load_dir = Template(load_dir) + if 'delete_parallel' in dictionary: + return DataStore(dictionary['name'], + Template(dictionary['create']), Template(dictionary['copy']), + Template(dictionary['load']), load_dir, + Template(dictionary['delete']), Template(dictionary['download']), + Template(dictionary['delete_parallel'])) + else: + return DataStore(dictionary['name'], + Template(dictionary['create']), Template(dictionary['copy']), + Template(dictionary['load']), load_dir, + Template(dictionary['delete']), Template(dictionary['download'])) + + +class Metadata: + + def __init__(self, num_records): + self.num_records = num_records diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/data_generation.py b/scripts/tpcx-ai/driver/tpcxai-driver/data_generation.py new file mode 100644 index 00000000000..934fa9976b2 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/data_generation.py @@ -0,0 +1,133 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + +from string import Template + +from .data import Phase, SubprocessAction, SubPhase + + +class DataGeneration: + _pdgf_init_template = Template("java -jar $pdgf_path -ns") + _pdgf_training_template = \ + Template("java -Djava.awt.headless=true $pdgf_java_opts -jar $pdgf_path -l $schema -l $generation -ns -sf $scale_factor -sp MY_SEED $seed " + "-sp includeLabels 1.0 -sp TTVF 1.0 -s $tables -output '\"$output/\"'") + + _pdgf_training_template_parallel =\ + Template("$tpcxai_home/tools/parallel-data-gen.sh -p $pdgf_path -h nodes -o \"-l $schema -l $generation -ns " + "-sf $scale_factor -sp MY_SEED $seed -sp includeLabels 1.0 -sp TTVF 1.0 " + "-s $tables -output \\'\\\"$output/\\\"\\'\"") + + _pdgf_serving_template = \ + Template("java -Djava.awt.headless=true $pdgf_java_opts -jar $pdgf_path -l $schema -l $generation -ns " + "-sf $scale_factor -sp SF_TRAINING $scale_factor_training -sp MY_SEED $seed " + "-sp includeLabels 0.0 -sp TTVF $ttvf -s $tables -output '\"$output/\"'") + + _pdgf_serving_template_parallel = \ + Template("$tpcxai_home/tools/parallel-data-gen.sh -p $pdgf_path -h nodes -o \"-l $schema -l $generation -ns " + "-sf $scale_factor -sp SF_TRAINING $scale_factor_training -sp MY_SEED $seed " + "-sp includeLabels 0.0 -sp TTVF $ttvf -s $tables -output \\'\\\"$output/\\\"\\'\"") + + _pdgf_scoring_template = \ + Template("java -Djava.awt.headless=true $pdgf_java_opts -jar $pdgf_path -l $schema -l $generation -ns " + "-sf $scale_factor -sp SF_TRAINING $scale_factor_training -sp MY_SEED $seed " + "-sp includeLabels 2.0 -sp TTVF $ttvf -s $tables -output '\"$output/\"'") + + _pdgf_scoring_template_parallel = \ + Template("$tpcxai_home/tools/parallel-data-gen.sh -p $pdgf_path -h nodes -o \"-l $schema -l $generation -ns " + "-sf $scale_factor -sp SF_TRAINING $scale_factor_training -sp MY_SEED $seed " + "-sp includeLabels 2.0 -sp TTVF $ttvf -s $tables -output \\'\\\"$output/\\\"\\'\"") + + def __init__(self, seed, tpcxai_home, pdgf_home, pdgf_java_opts, datagen_home, datagen_config, datagen_output, scale_factor=0.1, ttvf=0.01, + parallel=False): + self.seed = seed + self.tpcxai_home = tpcxai_home + self.pdgf_home = pdgf_home + self.pdgf_java_opts = pdgf_java_opts + self.datagen_home = datagen_home + self.scale_factor = scale_factor + self.output = datagen_output + self.datagen_schema = datagen_config / "tpcxai-schema.xml" + self.datagen_generation = datagen_config / "tpcxai-generation.xml" + self.ttvf = ttvf + self.parallel = parallel + self.meta_data = {} + + def prepare(self): + pdgf = self.pdgf_home / 'pdgf.jar' + cmd = self._pdgf_init_template.substitute(pdgf_path=pdgf) + return SubprocessAction(0, cmd, Phase.DATA_GENERATION, SubPhase.INIT, metadata=None) + + def run(self, phase, tables): + if not self.parallel: + pdgf = self.pdgf_home / 'pdgf.jar' + else: + pdgf = self.pdgf_home + + # training + training_template = self._pdgf_training_template if not self.parallel else self._pdgf_training_template_parallel + pdgf_train = training_template.substitute( + pdgf_java_opts=self.pdgf_java_opts, + tpcxai_home=self.tpcxai_home, pdgf_path=pdgf, seed=self.seed, + schema=self.datagen_schema, generation=self.datagen_generation, + scale_factor=self.scale_factor, has_labels=1.0, ttvf=1, + tables=' '.join(tables), output=self.output / 'training' + ) + + # serving + serving_template = self._pdgf_serving_template if not self.parallel else self._pdgf_serving_template_parallel + pdgf_serve = serving_template.substitute( + pdgf_java_opts=self.pdgf_java_opts, + tpcxai_home=self.tpcxai_home, pdgf_path=pdgf, + schema=self.datagen_schema, generation=self.datagen_generation, + seed=self.seed + 1, scale_factor=self.scale_factor, has_labels=0.0, ttvf=self.ttvf, + scale_factor_training=self.scale_factor, + tables=' '.join(tables), output=self.output / 'serving' + ) + + # scoring + scoring_template = self._pdgf_scoring_template if not self.parallel else self._pdgf_scoring_template_parallel + pdgf_score_seeded = scoring_template.substitute( + pdgf_java_opts=self.pdgf_java_opts, + tpcxai_home=self.tpcxai_home, pdgf_path=pdgf, + schema=self.datagen_schema, generation=self.datagen_generation, + seed=self.seed + int(self.scale_factor), scale_factor=1, has_labels=1.0, ttvf=self.ttvf, + scale_factor_training=self.scale_factor, + tables=' '.join(tables), output=self.output / 'scoring' + ) + + new_phase = Phase.DATA_GENERATION + + if phase.value == Phase.TRAINING.value: + cmd = pdgf_train + elif phase.value == Phase.SERVING.value: + cmd = pdgf_serve + elif phase.value == Phase.SCORING.value: + cmd = pdgf_score_seeded + new_phase = Phase.SCORING_DATAGEN + else: + cmd = '' + + return SubprocessAction(0, cmd, new_phase, SubPhase.WORK, metadata=None) diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/database.py b/scripts/tpcx-ai/driver/tpcxai-driver/database.py new file mode 100644 index 00000000000..c79a70c1125 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/database.py @@ -0,0 +1,124 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +import queue +import sqlite3 +from threading import Thread, BoundedSemaphore + + +class DatabaseQueue: + """Single point of contact with the SQLite database. A background thread is started that will handle access to the + database to prevent concurrency issues. The queries are queued and handled in a FIFO fashion. Internally there are + two queue that handle different types of database requests. + """ + + def __init__(self, database_path): + self.database_path = database_path + self.queue = queue.Queue() + self.insert_queue = queue.Queue() + self.stop_signaled = False + self.db_semaphore = BoundedSemaphore(2) + self.db_thread = Thread(target=self._db_worker, args=(self.queue, ), daemon=True) + self.db_thread.start() + self.db_thread_insert = Thread(target=self._db_worker, args=(self.insert_queue, )) + self.db_thread_insert.start() + db_uri = database_path.as_uri() + '?mode=ro' + self.query_connection = sqlite3.connect(db_uri, uri=True, check_same_thread=False) + + def query(self, query, *args) -> sqlite3.Cursor: + """ + Read only access to the underlying database, blocks and returns the result cursor. + :param query: The SELECT query to be executed + :param args: The parameters for the query + :return: Cursor with the result of the given query + """ + try: + self.insert_queue.join() + res = self.query_connection.execute(query, *args) + self.query_connection.commit() + return res + except Exception as e: + raise RuntimeError(f"Error running {query} with parameters: {args}") from e + + def insert(self, query, params=[]): + """ + Executes a DB write. All writes are queued and executed by a single background thread to prevent concurrency + issues. Statements given are guaranteed to be executed eventually. That is, the main thread waits until all + request are fulfilled. + :param query: + :param params: + """ + query_item = (query, params) + self.insert_queue.put(query_item) + + def dump(self, query, params=None): + """ + Executes a DB write (non-guaranteed). All writes are queued and executed by an exclusive background thread. + Statements given are *not* guaranteed to be executed. That is the background thread is killed when the main + thead is killed and all remaining statements are left unfinished. + :param query: + :param params: + """ + query_item = (query, params) + self.queue.put(query_item) + + def stop_gracefully(self): + """ + Wait for the insert queue to finish. + :return: + """ + self.insert_queue.join() + self.stop_signaled = True + + def stop(self): + """ + Kill all background threads immediately. + :return: + """ + self.stop_signaled = True + + def wait_until_finished(self): + self.insert_queue.join() + + def _db_worker(self, sql_queue: queue.Queue): + connection = sqlite3.connect(str(self.database_path)) + while not self.stop_signaled: + try: + query, params = sql_queue.get(timeout=1) + except queue.Empty: + continue + try: + # guard the db with a semaphore to only allow one write at a time + with self.db_semaphore: + connection.execute(query, params) + connection.commit() + except sqlite3.Error as e: + raise RuntimeError(f"An error occurred when running {query} with params: {params}") from e + finally: + sql_queue.task_done() + + connection.close() diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/logger.py b/scripts/tpcx-ai/driver/tpcxai-driver/logger.py new file mode 100644 index 00000000000..a021b3c638d --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/logger.py @@ -0,0 +1,139 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2021 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# +import logging +import sqlite3 +import time +from pathlib import Path +from typing import Union + + +class LogDbHandler(logging.Handler): + + def __init__(self, benchmark_id, db_queue): + logging.Handler.__init__(self) + self.benchmark_id = benchmark_id + self.db_queue = db_queue + + def emit(self, record: logging.LogRecord) -> None: + rec_time = time.time() + if self.db_queue: + splits = record.msg.split(":") + # get metric parts in the form: TLOAD:0.01 + if len(splits) >= 2: + name, value = splits + query = 'INSERT INTO performance_metric(benchmark_fk, metric_name, metric_value, metric_time) ' \ + 'VALUES(?, ?, ?, ?)' + self.db_queue.insert(query, (self.benchmark_id, name, value, rec_time)) + # get final metric in the form: AIUCpm@1.0=10.3 + splits = record.msg.split("=") + if len(splits) >= 2: + name, value = splits + if "@" in name: + query = 'INSERT INTO performance_metric(benchmark_fk, metric_name, metric_value, metric_time) ' \ + 'VALUES(?, ?, ?, ?)' + self.db_queue.insert(query, (self.benchmark_id, name, value, rec_time)) + + +class LogPerformanceDbHandler(): + """A logging Handler to continuously log performance metrics""" + + # the number of parts in a message that are necessary + NUM_OF_PARTS = 5 + + def __init__(self, benchmark_id, db_queue, level=logging.INFO): + self.benchmark_id = benchmark_id + self.db_queue = db_queue + + def emit(self, record: str) -> None: + parts = list(map(lambda s: s.strip(), record.split(','))) + if len(parts) >= self.NUM_OF_PARTS: + q = "INSERT INTO timeseries(benchmark_fk, hostname, name, timestamp, value, unit) VALUES(?, ?, ?, ?, ?, ?)" + host, timestamp, name, value, unit = parts[:self.NUM_OF_PARTS] + self.db_queue.dump(q, (self.benchmark_id, host, name, timestamp, value, unit)) + + +class FileAndDBLogger: + + def __init__(self, usercase, log_dir: Union[str, Path], db_queue, max_lines=10000): + """ + + :param log_dir: + :param db_connection: + :param max_lines: Maximum number of lines to keep before flushing the logs + """ + self.log_dir = log_dir + self.db_queue = db_queue + self.max_lines = max_lines + self.std_out = [] + self.std_err = [] + self.usecase = usercase + + def __enter__(self): + self.file_handle = open(self.log_dir, 'a') + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.close() + + def close(self): + self.flush() + self.file_handle.close() + + def log_out(self, msg): + """ + Collects the log entries until max_bytes threshold is reached + :param msg: The message or text to write to the log + :return: + """ + self.std_out.append(msg) + self.file_handle.write(f"{msg}") + self.file_handle.flush() + if len(self.std_out) >= self.max_lines: + # flush the log + self.flush() + + def flush(self): + text_out = ''.join(self.std_out) + + # write to database + try: + last_part = self.db_queue.query('SELECT max(part) FROM log_std_out WHERE use_case_fk = ?', + (self.usecase, ) + ).fetchone()[0] + if last_part: + current_part = last_part + 1 + else: + current_part = 1 + self.db_queue.insert('INSERT INTO log_std_out VALUES(?, ?, ?)', (self.usecase, current_part, text_out)) + except sqlite3.Error as e: + print(e) + + # clear the log + self.std_out = [] + + def last_out(self): + text = ''.join(self.std_out) + return text diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/metrics.py b/scripts/tpcx-ai/driver/tpcxai-driver/metrics.py new file mode 100644 index 00000000000..c3350a07b3b --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/metrics.py @@ -0,0 +1,76 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2021 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +import re +def levenshtein(a, b): + """ + Calculates the Levenshtein distance between a and b. + """ + # The following code is from: https://folk.idi.ntnu.no/mlh/hetland_org/coding/python/levenshtein.py + n, m = len(a), len(b) + if n > m: + # Make sure n <= m, to use O(min(n,m)) space + a, b = b, a + n, m = m, n + + current = list(range(n+1)) + for i in range(1, m+1): + previous, current = current, [i]+[0]*n + for j in range(1, n+1): + add, delete = previous[j]+1, current[j-1]+1 + change = previous[j-1] + if a[j-1] != b[i-1]: + change = change + 1 + current[j] = min(add, delete, change) + + return current[n] + + +def word_error_rate(y_true, y_pred, pre_process=True): + """ + Calculates the word error rate + :param pre_process: If the texts should be pre-processed prior to being used to calculate the WER + :param y_true: ndarray of true texts + :param y_pred: ndarray of predicted texts + :return: the word-error-rate as a float + """ + def pre_proc(string): + return re.sub('[^a-z \']', '', string.lower()) + + def lev(first_str, second_str): + first_str = str(first_str) + second_str = str(second_str) + if pre_process: + first_norm = pre_proc(first_str) + second_norm = pre_proc(second_str) + else: + first_norm = first_str.lower() + second_norm = second_str.lower() + return levenshtein(first_norm.split(), second_norm.split()) + distances = map(lev, y_true, y_pred) + lengths = map(lambda txt: len(str(txt).split()), y_true) + return sum(distances) / sum(lengths) diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/subprocess_util.py b/scripts/tpcx-ai/driver/tpcxai-driver/subprocess_util.py new file mode 100644 index 00000000000..ac03ee65143 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/subprocess_util.py @@ -0,0 +1,178 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# +import os +import queue +import signal +import subprocess # nosec - the subprocess module is a crucial component to enable the flexibility to add other implementations of this benchmark. +import threading +import time +from pathlib import Path + +from .data import SubPhase +from .logger import FileAndDBLogger + +state = queue.Queue() + + +def run_and_capture(command, logger, verbose=False, **kvargs): + proc_out = {'stdout': [], 'stderr': []} + with subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, bufsize=1, + universal_newlines=True, **kvargs) as p: + for line in p.stdout: + logger.log_out(line) + proc_out['stdout'].append(line) + if verbose: + print(line, end='', flush=True) + + result = subprocess.CompletedProcess(p.args, p.returncode, ''.join(proc_out['stdout']), ''.join(proc_out['stderr'])) + return result + + +def run_and_log(command, logger, stop_event: threading.Event, verbose=False, **kvargs): + os.environ['PYTHONUNBUFFERED'] = '1' + with subprocess.Popen(command, start_new_session=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, bufsize=1, + universal_newlines=True, **kvargs) as p: + # append process id to global state + # this is used to kill the process/ process group from a different thread + state.put(p.pid) + forbidden_words = ['b\'', '\'', r'\n', r'\r'] + for line in p.stdout: + for w in forbidden_words: + line = line.replace(w, '') + line = line.strip() + if stop_event.is_set(): + break + else: + logger.emit(line) + + +def run_as_daemon(command, logger, stop_event, verbose=False, **kvargs): + thread_kvargs = {'command': command.split(), 'logger': logger, 'stop_event': stop_event, 'verbose': verbose, + **kvargs} + daemon_thread = threading.Thread(target=run_and_log, kwargs=thread_kvargs, daemon=True) + return daemon_thread + + +def stop_daemons(stop_event): + stop_event.set() + while True: + try: + pid = state.get(block=False) + try: + pg_pid = os.getpgid(pid) + os.killpg(pg_pid, signal.SIGHUP) + except ProcessLookupError: + pass + # the process was already killed, move on + state.task_done() + except queue.Empty: + break + + +class Stream(threading.Thread): + + def __init__(self, index, name, actions, db_queue, benchmark_sk, log_file: Path, verboses, tpcxai_home, + killall_event, **kvargs): + if not (len(actions) == len(verboses)): + raise ValueError(f'lengths of the parameters must match: ' + f'len(actions)={len(actions)}, len(verboses)={len(verboses)}') + threading.Thread.__init__(self) + self.index = index + self.name = name + self.actions = actions + self.db_queue = db_queue + self.benchmark_sk = benchmark_sk + self.verboses = verboses + self.kvargs = kvargs + self.results = [None] * len(actions) + self.last_idx = 0 + self.exc = None + self.killall_event = killall_event + self.start_times = [] + self.end_times = [] + self.timings = [] + self.log_file = log_file + self.adabench_home = tpcxai_home + use_case_sk = self.db_queue.query("SELECT max(command_sk) FROM command").fetchone()[0] + use_case_sk = use_case_sk + 1 if use_case_sk else 1 + self.base_usecase_sk = use_case_sk + + def run(self) -> None: + i = 0 + + for action, verbose in zip(self.actions, self.verboses): + if self.killall_event.is_set(): + break + else: + msg = [] + if action.subphase.value == SubPhase.INIT.value: + msg.append(f"stream {self.name} initializing {action.phase} for uc {action.use_case}") + elif action.subphase.value == SubPhase.PREPARATION.value: + msg.append(f"stream {self.name} preparing {action.phase} for uc {action.use_case}") + else: + msg.append(f"stream {self.name} running {action.phase} for uc {action.use_case}") + + use_case_sk = self.base_usecase_sk + self.index + i + self.db_queue.insert(''' + INSERT INTO command (command_sk, benchmark_fk, use_case, phase, phase_run, sub_phase, command) + VALUES (?, ?, ?, ?, ?, ?, ?) + ''', + (use_case_sk, self.benchmark_sk, action.use_case, + str(action.phase), action.run, str(action.subphase), str(action.command))) + self.db_queue.insert('INSERT INTO stream (use_case_fk, stream) VALUES (?, ?)', (use_case_sk, self.name)) + log_dir = self.log_file.parent + action_log_file = log_dir / f"{self.log_file.stem}-{action.phase}-{action.run}-{self.name}-{action.use_case}.out" + + start = time.perf_counter() + start_time = time.time() + with FileAndDBLogger(use_case_sk, action_log_file, self.db_queue) as logger: + if action.working_dir is not None: + result = run_and_capture(action.command, logger, verbose, cwd=action.working_dir, **self.kvargs) + else: + result = run_and_capture(action.command, logger, verbose, cwd=self.adabench_home, **self.kvargs) + end = time.perf_counter() + end_time = time.time() + duration = round(end - start, 3) + self.db_queue.insert('UPDATE command SET return_code = ?, start_time = ?, end_time = ?, runtime = ? WHERE command_sk = ?', + (result.returncode, start_time, end_time, duration, use_case_sk)) + self.start_times.append(start_time) + self.end_times.append(end_time) + self.timings.append(duration) + msg.append(f"in {duration}") + print('\n'.join(msg)) + self.results[i] = result + self.last_idx = i + if result.returncode != 0: + self.exc = RuntimeError(f"command {action.command} return {result.returncode}") + self.killall_event.set() + i += 1 + + def last_result(self): + return self.results[self.last_idx] + + def last_action(self): + return self.actions[self.last_idx] + diff --git a/scripts/tpcx-ai/driver/tpcxai-driver/usecase.py b/scripts/tpcx-ai/driver/tpcxai-driver/usecase.py new file mode 100644 index 00000000000..e10f4ba936b --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai-driver/usecase.py @@ -0,0 +1,34 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +class Usecase: + def __init__(self, number, datagen_cmd, loading_cmd, training_cmd, serving_cmd): + self.number = number + self.datagen_cmd = datagen_cmd + self.loading_cmd = loading_cmd + self.training_cmd = training_cmd + self.serving_cmd = serving_cmd diff --git a/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/PKG-INFO b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/PKG-INFO new file mode 100644 index 00000000000..75aa6ec2ab9 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/PKG-INFO @@ -0,0 +1,10 @@ +Metadata-Version: 2.1 +Name: tpcxai-driver +Version: 0.8 +Summary: UNKNOWN +Home-page: UNKNOWN +License: UNKNOWN +Platform: UNKNOWN + +UNKNOWN + diff --git a/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/SOURCES.txt b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/SOURCES.txt new file mode 100644 index 00000000000..44ad77a4ba5 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/SOURCES.txt @@ -0,0 +1,14 @@ +setup.py +tpcxai-driver/__init__.py +tpcxai-driver/__main__.py +tpcxai-driver/data.py +tpcxai-driver/data_generation.py +tpcxai-driver/database.py +tpcxai-driver/logger.py +tpcxai-driver/metrics.py +tpcxai-driver/subprocess_util.py +tpcxai-driver/usecase.py +tpcxai_driver.egg-info/PKG-INFO +tpcxai_driver.egg-info/SOURCES.txt +tpcxai_driver.egg-info/dependency_links.txt +tpcxai_driver.egg-info/top_level.txt \ No newline at end of file diff --git a/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/dependency_links.txt b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/dependency_links.txt new file mode 100644 index 00000000000..8b137891791 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/dependency_links.txt @@ -0,0 +1 @@ + diff --git a/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/top_level.txt b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/top_level.txt new file mode 100644 index 00000000000..c0f00aa6522 --- /dev/null +++ b/scripts/tpcx-ai/driver/tpcxai_driver.egg-info/top_level.txt @@ -0,0 +1 @@ +tpcxai-driver diff --git a/scripts/tpcx-ai/generate_data.sh b/scripts/tpcx-ai/generate_data.sh new file mode 100755 index 00000000000..4b7ac1bb910 --- /dev/null +++ b/scripts/tpcx-ai/generate_data.sh @@ -0,0 +1,28 @@ +#!/bin/bash + + +# Stop if any command fails +set -e +. setenv_sds.sh + +LOG_DEST="tpcxai_benchmark_run" +TPCxAI_CONFIG_FILE_PATH=${TPCxAI_BENCHMARKRUN_CONFIG_FILE_PATH} + + +if [[ ${TPCx_AI_VERBOSE} == "True" ]]; then + VFLAG="-v" +fi + +echo "TPCx-AI_HOME directory: ${TPCx_AI_HOME_DIR}"; +echo "Using configuration file: ${TPCxAI_CONFIG_FILE_PATH} and scale factor ${TPCxAI_SCALE_FACTOR}..." +echo "Starting data generation..." +sleep 1; + +PATH=$JAVA8_HOME/bin:$PATH +export JAVA8_HOME +export PATH +echo "Using Java at $JAVA8_HOME" +DATA_GEN_FLAG="--data-gen" +./bin/tpcxai.sh --phase {CLEAN,DATA_GENERATION,SCORING_DATAGEN,SCORING_LOADING} -sf ${TPCxAI_SCALE_FACTOR} --streams ${TPCxAI_SERVING_THROUGHPUT_STREAMS} -c ${TPCxAI_CONFIG_FILE_PATH} ${VFLAG} ${DATA_GEN_FLAG} + +echo "Successfully generated data with scale factor ${TPCxAI_SCALE_FACTOR}." \ No newline at end of file diff --git a/scripts/tpcx-ai/setenv_sds.sh b/scripts/tpcx-ai/setenv_sds.sh new file mode 100755 index 00000000000..4a096649aca --- /dev/null +++ b/scripts/tpcx-ai/setenv_sds.sh @@ -0,0 +1,77 @@ +#!/bin/bash + +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2021 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# Find benchmark kit base directory +#export TPCx_AI_HOME_DIR=$(cd `dirname -- $0` && pwd); +cd "$(dirname ${BASH_SOURCE[0]})" +export TPCx_AI_HOME_DIR="$PWD" +cd "$OLDPWD" +# + + +### Common variables for Python and Spark ### + + # Verbosity: Set to True for verbose execution +export TPCx_AI_VERBOSE=False + +# The scale factor is the dataset size in GB that will be generated and used to run the benchmark. +export TPCxAI_SCALE_FACTOR=1 + +# Number of current streams to use in the SERVING_THROUGHPUT test +export TPCxAI_SERVING_THROUGHPUT_STREAMS=2 + +# The absolute path to the configuration file used to run the validation test +export TPCxAI_VALIDATION_CONFIG_FILE_PATH=${TPCx_AI_HOME_DIR}/driver/config/default.yaml + +# The absolute path to the configuration file used for the benchmark run +export TPCxAI_BENCHMARKRUN_CONFIG_FILE_PATH=${TPCx_AI_HOME_DIR}/driver/config/default.yaml + +# Location of the subdirectory containing scripts to collect system configuration information +export TPCxAI_ENV_TOOLS_DIR=${TPCx_AI_HOME_DIR}/tools/python + +# Binary for Parallel SSH used for parallel data gen and getEnvInfo +export TPCxAI_PSSH=pssh +export TPCxAI_PSCP=pscp.pssh + +# Java options for PDGF +export TPCxAI_PDGF_JAVA_OPTS="" + +# Set path to Java 8 for data generation +export JAVA8_HOME= # TODO set the path to Java 8 home directory + +# Set path to Java 11 for benchmark run +export JAVA11_HOME= # TODO set the path to Java 11 home directory + +### Configuration variables for Spark only ### + +# export YARN_CONF_DIR=/etc/hadoop/conf.cloudera.yarn + +# Location of the Python binary of the virtual environment used to run the DL use cases +# export PYSPARK_PYTHON=/usr/envs/adabench_dl/bin/python +# export PYSPARK_DRIVER_PYTHON=/usr/envs/adabench_dl/bin/python diff --git a/scripts/tpcx-ai/setup_python_sds.sh b/scripts/tpcx-ai/setup_python_sds.sh new file mode 100755 index 00000000000..9c57271673c --- /dev/null +++ b/scripts/tpcx-ai/setup_python_sds.sh @@ -0,0 +1,58 @@ +#!/bin/sh + +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + +# fail fast +# return on any error +set -e + +# create virtual env for python projects +echo "Create virtual env" +conda env create --prefix ./lib/python-venv --file tools/python/python.yaml +conda env create --prefix ./lib/python-venv-ks --file tools/python/python-ks.yaml + +# TODO: Determine and set the correct filename for the systemds distribution +# The filename should match the appropriate version and build of systemds for your environment. +lib/python-venv/bin/pip install systemds-3.3.0.dev0-py3-none-any.whl +lib/python-venv-ks/bin/pip install systemds-3.3.0.dev0-py3-none-any.whl + +#lib/python-venv/bin/pip install -U pip +#lib/python-venv/bin/pip install git+https://github.com/evanhempel/python-flamegraph.git + +# build the workload (pYthon) +echo "Build Workload (Python)" +lib/python-venv/bin/pip install -e workload/python +lib/python-venv-ks/bin/pip install -e workload/python + + + +# build the driver +echo "Build driver" +lib/python-venv/bin/pip install -e driver +lib/python-venv-ks/bin/pip install -e driver + +#conda activate lib/python-venv diff --git a/scripts/tpcx-ai/tpcxai_fdr.py b/scripts/tpcx-ai/tpcxai_fdr.py new file mode 100644 index 00000000000..23b7ae4f902 --- /dev/null +++ b/scripts/tpcx-ai/tpcxai_fdr.py @@ -0,0 +1,673 @@ +#!/usr/bin/env python + +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2021 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +import argparse +import base64 +import io +import json +import os +import sqlite3 +import sys +import time +from datetime import datetime +from pathlib import Path,PurePath + +from datetime import timedelta +import jinja2 +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd + + + +def parse_args(args=None): + if not args: + args = sys.argv + parser = argparse.ArgumentParser() + parser.add_argument('-d', '--database', metavar='FILE', type=str, required=True) + parser.add_argument('-b', '--benchmark', metavar='ID', type=int, required=False) + parser.add_argument('-f', '--file', metavar='OUTPUT FILE', type=str, required=False) + parser.add_argument('-t', '--type', metavar='TYPE OF REPORT', choices=['txt', 'html', 'json'], default='txt') + parser.add_argument('--include-clean', action='store_true', default=False) + + args = parser.parse_args(args[1:]) + if check_args(args) ==False: + return None + return args + +def check_args(args): + db_path = Path(args.database) + try: + db_path.resolve().relative_to(Path().resolve().parent) + if db_path.exists()==False: + print(args.database, 'is not a valid path') + return False + except: + print(args.database, 'is not a valid path') + return False + ## + out_path=Path(args.file) + try: + out_path.resolve().relative_to(Path().resolve().parent) + except: + print(args.file, 'is not a valid path inside the benchmark top level directory') + return False + +def get_benchmark_id(connection: sqlite3.Connection, benchmark_id=None): + """ + Returns given benchmark_id if exists or the id of the last benchmark run + :param connection: Database connection to the tpcxai.db file + :param benchmark_id: ID of the benchmark to check or None + :return: + """ + if benchmark_id: + return benchmark_id + else: + query = 'SELECT benchmark_sk FROM benchmark ORDER BY start_time DESC LIMIT 1' + result = connection.execute(query) + return result.fetchone()[0] + + +def format_time(timestamp: float) -> str: + return datetime.fromtimestamp(timestamp).isoformat(sep=' ', timespec='milliseconds') + + +def rename_stream(stream, phase): + if phase in ['Phase.DATA_GENERATION', 'Phase.SCORING_DATAGEN']: + new_stream = 'DATA_GENERATION' + elif phase in ['Phase.CLEAN']: + new_stream = 'CLEAN' + elif phase in ['Phase.LOADING']: + new_stream = 'LOAD_TEST' + elif phase in ['Phase.SCORING', 'Phase.VERIFICATION', 'Phase.SCORING_LOADING']: + new_stream = 'SCORING' + else: + new_stream = stream + + return new_stream + + +def make_report(connection: sqlite3.Connection, benchmark_id, include_clean=True): + report = {'benchmark_id': benchmark_id} + + connection.row_factory = sqlite3.Row + cursor = connection.cursor() + meta_query = ''' + SELECT benchmark_name, start_time, end_time, cmd_flags, successful + FROM benchmark + WHERE benchmark_sk = ? + ''' + result = cursor.execute(meta_query, (benchmark_id,)) + benchmark_meta = result.fetchone() + report['benchmark_name'] = benchmark_meta['benchmark_name'] + report['start'] = benchmark_meta['start_time'] + report['end'] = benchmark_meta['end_time'] + if report['start'] and report['end']: + report['duration'] = "{}".format(str(timedelta(seconds=round(benchmark_meta['end_time'] - benchmark_meta['start_time'],3)))) + else: + report['duration'] = -1 + report['cmd_args'] = benchmark_meta['cmd_flags'] + report['successful'] = benchmark_meta['successful'] + # detailed failure object + report['failures'] = [] + # work around for old benchmark runs where not all return_codes are reported + # that means for some command the return_code is NULL which does not indicate failure but success + if not report['successful']: + success_query = 'SELECT total(return_code) AS successful FROM command WHERE benchmark_fk = ?' + result = cursor.execute(success_query, (benchmark_id, )).fetchone() + report['successful'] = True if result['successful'] == 0 else False + + if not bool(report['successful']): + qry = ''' + SELECT command_sk, use_case, phase, sub_phase, command, return_code, group_concat(log, '\n') AS log + FROM command LEFT JOIN log_std_out ON command_sk = use_case_fk + WHERE return_code != 0 AND benchmark_fk = ? + GROUP BY command_sk + ''' + result = cursor.execute(qry, (benchmark_id, )) + for rec in result.fetchall(): + failure = {'use_case': rec['use_case'], 'phase': rec['phase'], 'sub_phase': rec['sub_phase'], + 'command': rec['command'], 'return_code': rec['return_code'], 'log': rec['log']} + report['failures'].append(failure) + + metric_query = 'SELECT metric_name, metric_value ' \ + 'FROM performance_metric ' \ + 'WHERE benchmark_fk = ? ' \ + 'ORDER BY metric_time' + result = cursor.execute(metric_query, (benchmark_id, )) + metrics = [] + for rec in result.fetchall(): + metrics.append(dict(rec)) + report['metric'] = metrics + + phase_runtime_query = ''' + SELECT stream, phase, phase_run, + min(start_time) AS start_time, max(end_time) AS end_time, + sum(runtime) AS runtime, CASE sum(return_code) WHEN 0 THEN 'True' ELSE 'False' END successful + FROM command JOIN stream ON command_sk = use_case_fk + WHERE benchmark_fk = ? AND + (sub_phase = 'SubPhase.WORK' OR (phase = 'Phase.DATA_GENERATION' AND sub_phase = 'SubPhase.NONE')) + GROUP BY stream, phase, phase_run + ORDER BY start_time + ''' + phases_with_runs = cursor.execute( + 'SELECT phase, COUNT(DISTINCT phase_run) AS runs FROM command WHERE benchmark_fk = ? GROUP BY phase HAVING runs > 1', + (benchmark_id, ) + ).fetchall() + phases_with_runs = list(map(lambda r: r['phase'], phases_with_runs)) + result = cursor.execute(phase_runtime_query, (benchmark_id,)) + phases = [] + for rec in result.fetchall(): + phase_name = rec['phase'] + phase_name_formatted = f"{rec['phase']}_{rec['phase_run']}" if rec['phase'] in phases_with_runs else rec['phase'] + phase_name_formatted = phase_name_formatted.replace('Phase.', '') + phase_use_case_query = ''' + SELECT command.use_case, command.start_time, command.end_time, command.runtime, + command.command, + quality_metric.metric_name, printf("%.5f",quality_metric.metric_value) as metric_value, stream.stream, + return_code + FROM (command + JOIN stream ON command.command_sk = stream.use_case_fk) + LEFT JOIN quality_metric ON command.command_sk = quality_metric.use_case_fk + WHERE benchmark_fk = ? AND phase = ? AND sub_phase = 'SubPhase.WORK' AND stream = ? AND phase_run = ? + ORDER BY command.start_time + ''' + uc_result = cursor.execute(phase_use_case_query, (benchmark_id, phase_name, rec['stream'], rec['phase_run'])) + use_cases = [] + for uc_rec in uc_result.fetchall(): + start_time = uc_rec['start_time'] + end_time = uc_rec['end_time'] + uc = {'use_case': uc_rec['use_case'], + 'start_time': start_time, 'end_time': end_time, 'runtime': uc_rec['runtime'], 'command': uc_rec['command'], + 'metric_name': uc_rec['metric_name'], 'metric_value': uc_rec['metric_value'], + 'stream': rename_stream(rec['stream'], phase_name), + 'return_code': uc_rec['return_code'], 'successful': True if uc_rec['return_code'] == 0 else False} + use_cases.append(uc) + phase = {'stream': rename_stream(rec['stream'], phase_name), + 'phase': phase_name, 'phase_run': rec['phase_run'], 'phase_name_formatted': phase_name_formatted, + 'runtime': round(rec['runtime'], 3) if rec['runtime'] else rec['runtime'], + 'start_time': rec['start_time'], 'end_time': rec['end_time'], + 'successful': rec['successful'], 'use_cases': use_cases} + # add all phase except for CLEAN, which is only added when include_clean is True + # this means CLEAN is only add if explicitly specified + if not phase['phase'].startswith('Phase.CLEAN') or include_clean: + phases.append(phase) + + report['phases'] = phases + + # per use-case table + use_case_res = cursor.execute("SELECT * FROM command WHERE benchmark_fk = ? AND sub_phase = 'SubPhase.WORK'", + (benchmark_id, )) + records = [] + for rec in use_case_res.fetchall(): + records.append(dict(rec)) + + df = pd.DataFrame.from_records(records) + df['phase_name'] = np.where(df['phase'].isin(phases_with_runs), + df['phase'] + '_' + df['phase_run'].astype(str), + df['phase']) + df['phase_name'] = df['phase_name'].str.replace('Phase.', '', regex=False) + df['phase_name'] = df['phase_name'].str.replace('_nan', '').str.replace('.0', '', regex=False) + phase_sorting = df.groupby(['phase_name'])['start_time'].min().sort_values().index.values + df_mean = df.pivot_table(values='runtime', index='use_case', columns='phase_name').reset_index() + df_sum = df.pivot_table(values='runtime', index='use_case', columns='phase_name', aggfunc='sum').reset_index() + df = df_sum + if 'SERVING_THROUGHPUT' in df.columns: + df['SERVING_THROUGHPUT'] = df_mean['SERVING_THROUGHPUT'] + new_index = list(set(df.columns) - set(phase_sorting)) + list(phase_sorting) + df = df.reindex(new_index, axis='columns') + df = df.round({'CLEAN': 3, 'DATA_GENERATION': 3, 'LOADING': 3, 'TRAINING': 3, 'SCORING_DATAGEN': 3, 'SERVING_1': 3, + 'SERVING_2': 3, 'SCORING_LOADING': 3, 'SCORING': 3, 'SERVING_THROUGHPUT': 3}) + df = df.rename(columns={'SERVING_THROUGHPUT': 'SERVING_THROUGHPUT (AVG)'}) + df = df.drop(columns='VERIFICATION') + report['use_cases'] = df.to_dict(orient='records') + + return report + + +def write_report_json(report, path=None): + opened = False + if not path: + file = sys.stdout + elif isinstance(path, (str, Path)): + file = open(path, 'w') + opened = True + else: + file = path + + json.dump(report, file, indent=4) + file.write('\n') + if opened: + file.close() + + +def get_table(phases, phase, phase_name=None, phase_run=None, aggregate_only=False, format_datetime=False): + """ + Create dataframe of the form + Stream Phase Use Case Runtime Successful Comment + POWER_TEST DATA_GENERATION 0.0 192.166866 True complete phase + POWER_TEST DATA_GENERATION 0.0 119.355436 True + POWER_TEST DATA_GENERATION 0.0 72.811429 True + :param phases: + :param phase: + :param phase_name: + :param phase_run: + :param aggregate_only: + :return: + """ + if not phase_name: + phase_name = phase + + if phase_run: + filtered_phases = list(filter(lambda p: p['phase'] == phase and p['phase_run'] == phase_run, phases)) + else: + filtered_phases = list(filter(lambda p: p['phase'] == phase, phases)) + table = pd.DataFrame({'Stream': [], 'Phase': [], 'Use Case': [], 'Start Time': [], 'End Time': [], 'Runtime': [], + 'Successful': [], 'Comment': []}) + table['Use Case'] = table['Use Case'].astype(int) + for contents in filtered_phases: + sts = [] + ps = [] + pr = [] + ucs = [] + ts = [] + ss = [] + cs = [] + st = [] + et = [] + sts.append(contents['stream']) + phase_name_formatted = contents['phase_name_formatted'] + ps.append(phase_name_formatted) + # pr.append(contents['phase_run']) + ucs.append(0) + ts.append(contents['runtime']) + ss.append(contents['successful']) + cs.append('complete phase') + if format_datetime: + st.append(format_time(contents['start_time'])) + et.append(format_time(contents['end_time'])) + else: + st.append(contents['start_time']) + et.append(contents['end_time']) + if not aggregate_only: + for uc in contents['use_cases']: + sts.append(contents['stream']) + ps.append(phase_name_formatted) + pr.append(contents['phase_run']) + ucs.append(uc['use_case']) + ts.append(uc['runtime']) + ss.append(uc['successful']) + if phase == 'Phase.SCORING': + cs.append(f"{uc['metric_name']}: {uc['metric_value']}") + elif phase in ['Phase.CLEAN', 'Phase.LOADING']: + cs.append(uc['command']) + else: + cs.append('') + if format_datetime: + st.append(format_time(uc['start_time'])) + et.append(format_time(uc['end_time'])) + else: + st.append(uc['start_time']) + et.append(uc['end_time']) + t = pd.DataFrame({'Stream': sts, 'Phase': ps, 'Use Case': ucs, + 'Start Time': st, 'End Time': et, 'Runtime': ts, + 'Successful': ss, 'Comment': cs}) + table = table.append(t) + return table + + +def output_image(path=None, ax=None, format='png'): + img = None + if path: + if ax: + ax.figure.savefig(path, bbox_inches='tight') + else: + plt.savefig(path, format=format, bbox_inches='tight') + else: + img_bytes = io.BytesIO() + if ax: + ax.figure.savefig(img_bytes, format=format, bbox_inches='tight') + else: + plt.savefig(img_bytes, format=format, bbox_inches='tight') + # img_bytes.seek(0) + img = base64.b64encode(img_bytes.getvalue()).decode("utf-8").replace("\n", "") + + return img + + +def make_graphs(report, output=None): + label_mapping = {1: '1_tpc', + 2: '2_tpc', + 3: '3_tpc', + 4: '4_tpc', + 5: '5_tpc', + 6: '6_tpc', + 7: "7_tpc", + 8: "8_tpc", + 9: "9_tpc", + 10: '10_tpc', + 11: "1_sds", + 12: "2:sds", + 13: "3_sds", + 14: "4_sds", + 15: "5_sds", + 16: "6_sds", + 17: "7_sds", + 18: "8_sds", + 19: "9_sds", + 20: "10_sds"} + + result = {} + color_cycle = plt.rcParams['axes.prop_cycle'].by_key()['color'] + # phases + phases_runtimes = [{ + 'phase': p['phase'].replace('Phase.', ''), + 'phase_name_formatted': p['phase_name_formatted'], + 'runtime': p['runtime'] + } for p in report['phases']] + phases_runtimes = pd.DataFrame(phases_runtimes) + use_cases_df = pd.DataFrame.from_records(report['use_cases']) + + # pie chart + figsize = (32, 8) + figure = plt.figure(1, (32, 8)) + plt.rcParams.update({'font.size': 22}) + plt.pie(phases_runtimes['runtime'], labels=phases_runtimes['phase_name_formatted'], autopct='%1.1f%%') + if output: + img = f"{output}/pie.png" + output_image(path=img) + else: + img = output_image() + result['phases_pie'] = img + + # bar chart + ax = phases_runtimes.plot.bar(x='phase_name_formatted', y='runtime', color=color_cycle, figsize=figsize, rot=0) + ax.set_ylabel('runtime (s)') + ax.legend('') + if output: + img = f"{output}/phases.png" + output_image(path=img) + else: + img = output_image(ax=ax) + result['phases_bar'] = img + + ax = phases_runtimes.plot.bar(x='phase_name_formatted', y='runtime', color=color_cycle, logy=True, figsize=figsize, rot=0) + ax.set_ylabel('runtime (s)') + ax.legend('') + if output: + img = f"{output}/phases_log.png" + output_image(path=img) + else: + img = output_image(ax=ax) + result['phases_bar_log'] = img + + # training + lst = [ucs['use_cases'] for ucs in report['phases'] if ucs['phase'] == 'Phase.TRAINING'] + if len(lst) > 0: + + lst = lst[0] + lst = list(map(lambda uc: {'use_case': uc['use_case'], 'runtime': uc['runtime']}, lst)) + training_runtimes = pd.DataFrame(lst) + + training_runtimes['use_case'] = training_runtimes['use_case'].map(label_mapping) + + # bar chart + ax = training_runtimes.plot.bar(x='use_case', y='runtime', color=color_cycle, figsize=figsize, logy=True, rot=0) + ax.set_ylabel('runtime (s)') + ax.legend('') + if output: + img = f"{output}/training_bar.png" + output_image(path=img) + else: + img = output_image(ax=ax, format='svg') + result['training_bar'] = img + + # serving + # new_order = [0, 4, 1, 5, 2, 6, 3, 7] + serving_times = use_cases_df[use_cases_df['use_case'] != 0] + #serving_times = serving_times.iloc[new_order].reset_index(drop=True) + serving_times['use_case'] = serving_times['use_case'].map(label_mapping) + serving_times = serving_times.set_index('use_case') + serving_cols = [col for col in serving_times.columns if col.startswith("SERVING")] + serving_times = serving_times[serving_cols] + + if len(serving_times) > 0: + # bar chart + ax = serving_times.plot.bar(color=color_cycle, figsize=figsize, rot=0) + ax.set_ylabel('runtime (s)') + # ax.legend('') + if output: + img = f"{output}/serving_bar_grouped.png" + output_image(path=img) + else: + img = output_image(ax=ax, format='svg') + result['serving_bar_grouped'] = img + + # serving throughput error bar + serving_throughput_times = [] + for phase in report['phases']: + if phase['phase_name_formatted'] == 'SERVING_THROUGHPUT': + stream = phase['stream'] + for use_case in phase['use_cases']: + rec = { + 'use_case': use_case['use_case'], + 'stream': stream, + 'runtime': use_case['runtime'] + } + serving_throughput_times.append(rec) + + serving_throughput_times = pd.DataFrame.from_records(serving_throughput_times) + #new_order = [0, 4, 1, 5, 2, 6, 3, 7] + + serving_throughput_times['use_case'] = serving_throughput_times['use_case'].map(label_mapping) + serving_throughput_times = pd.pivot_table(serving_throughput_times, + values='runtime', index='stream', columns='use_case') + if len(serving_throughput_times) > 0: + ax = serving_throughput_times.plot.box(figsize=figsize, rot=0) + ax.set_ylabel('runtime (s)') + if output: + img = f"{output}/serving_throughput_error.png" + output_image(path=img) + else: + img = output_image(ax=ax, format='svg') + result['serving_throughput_error'] = img + + # use_cases + use_cases_phases = use_cases_df[use_cases_df['use_case'] != 0] + #use_cases_phases = use_cases_phases.iloc[new_order].reset_index(drop=True) + use_cases_phases = use_cases_phases.set_index('use_case') + use_cases_cols = [col for col in use_cases_phases.columns if col.startswith("TRAINING") or col.startswith("SERVING")] + use_cases_phases = use_cases_phases[use_cases_cols] + + + + use_cases_phases['use case'] = use_cases_phases.index.map(label_mapping) + use_cases_phases.set_index('use case', inplace=True) + if len(use_cases_phases) > 0: + # bar chart + ax = use_cases_phases.plot.barh(color=color_cycle, stacked=True, figsize=figsize, rot=0) + ax.set_xlabel('runtime (s)') + # ax.legend('') + if output: + img = f"{output}/use_cases_stacked.png" + output_image(path=img) + else: + img = output_image(ax=ax, format='svg') + result['use_cases_stacked'] = img + + return result + + +def write_report_html(report, path=None, **kwargs): + opened = False + if not path: + file = sys.stdout + elif isinstance(path, (str, Path)): + file = open(path, 'w') + opened = True + else: + file = path + + script_path = os.path.abspath(__file__) + script_dir = os.path.dirname(script_path) + with open(f'{script_dir}/tpcxai_fdr_template.html') as template_file: + template = jinja2.Template(template_file.read()) + report = update_use_case_labels(report) + html = template.render(report, **kwargs) + file.write(html) + file.write('\n') + if opened: + file.close() + +def update_use_case_labels(report): + label_mapping2 = {1: '1_tpc', + 2: '2_tpc', + 3: '3_tpc', + 4: '4_tpc', + 5: '5_tpc', + 6: '6_tpc', + 7: "7_tpc", + 8: "8_tpc", + 9: "9_tpc", + 10: '10_tpc', + 11: "1_sds", + 12: "2:sds", + 13: "3_sds", + 14: "4_sds", + 15: "5_sds", + 16: "6_sds", + 17: "7_sds", + 18: "8_sds", + 19: "9_sds", + 20: "10_sds"} + + for phase in report['phases']: + for uc in phase['use_cases']: + if uc['use_case'] in label_mapping2: + uc['use_case'] = label_mapping2[uc['use_case']] + for uc in report['use_cases']: + if uc['use_case'] in label_mapping2: + uc['use_case'] = label_mapping2[uc['use_case']] + return report + +def write_report_txt(report, path=None, exclude_clean=False): + opened = False + if not path: + file = sys.stdout + elif isinstance(path, (str, Path)): + file = open(path, 'w') + opened = True + else: + file = path + + perf_metric_tbl = pd.DataFrame.from_records(report['metric']) + + tables = pd.DataFrame({'Stream': [], 'Phase': [], 'Use Case': [], 'Start Time': [], 'End Time': [], 'Runtime': [], + 'Successful': [], 'Comment': []}) + tables = tables.astype({'Use Case': int}) + phases = report['phases'] + + clean_tbl = get_table(phases, 'Phase.CLEAN', 'CLEAN', format_datetime=True) + tables = tables.append(clean_tbl) + data_gen_tbl = get_table(phases, 'Phase.DATA_GENERATION', 'DATA_GENERATION', format_datetime=True) + tables = tables.append(data_gen_tbl) + loading_tbl = get_table(phases, 'Phase.LOADING', 'LOADING', format_datetime=True) + tables = tables.append(loading_tbl) + training_tbl = get_table(phases, 'Phase.TRAINING', 'TRAINING', format_datetime=True) + tables = tables.append(training_tbl) + serving_tbl = get_table(phases, 'Phase.SERVING', 'SERVING', format_datetime=True) + tables = tables.append(serving_tbl) + scoring_tbl = get_table(phases, 'Phase.SCORING', 'SCORING', format_datetime=True) + tables = tables.append(scoring_tbl) + serving_throughput_tbl = get_table(phases, 'Phase.SERVING_THROUGHPUT', 'SERVING_THROUGHPUT', format_datetime=True) + tables = tables.append(serving_throughput_tbl) + + uc_table = pd.DataFrame.from_records(report['use_cases']) + + file.write(f"BENCHMARK ID: {report['benchmark_id']}\n") + file.write("\n") + file.write(f"BENCHMARK START: {format_time(report['start'])}\n") + file.write(f"BENCHMARK END: {format_time(report['end'])}\n") + file.write(f"BENCHMARK DURATION: {report['duration']}\n") + file.write(f"BENCHMARK NAME: {report['benchmark_name']}\n") + file.write(f"BENCHMARK METRIC:\n") + file.write(perf_metric_tbl.to_string(index=False, header=True)) + file.write("\n") + file.write(f"CMD ARGS: {report['cmd_args']}\n") + file.write("\n") + file.write(tables.to_string(index=False, header=True)) + file.write("\n\n") + file.write(uc_table.to_string(index=False, header=True, na_rep='')) + file.write("\n\n") + if bool(report['successful']): + file.write("Benchmark run is valid\n") + else: + file.write("Benchmark run is NOT valid\n") + for failure in report['failures']: + file.write("FAILURE: ") + file.write(f"use case {failure['use_case']} failed during phase {failure['phase'], failure['sub_phase']}\n") + file.write(f"command was: {failure['command']}\n") + file.write(f"command returned: {failure['return_code']}\n") + file.write("\n") + file.write(failure['log']) + file.write("\n") + + file.write("\n") + + if opened: + file.close() + + +def main(): + args = parse_args() + if args is None: + return 1 + + file = sys.stdout if not args.file else args.file + include_clean = args.include_clean + + connection = sqlite3.connect(args.database) + benchmark_id = get_benchmark_id(connection, args.benchmark) + + report = make_report(connection, benchmark_id, include_clean) + + if args.type.lower() == 'json': + write_report_json(report, file) + elif args.type.lower() == 'html': + graphs = make_graphs(report, None) + write_report_html(report, file, **graphs) + elif args.type.lower() == 'txt': + write_report_txt(report, file) + else: + print(f"report type {args.type} is invalid", sys.stderr) + exit(1) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/tpcxai_fdr_template.html b/scripts/tpcx-ai/tpcxai_fdr_template.html new file mode 100644 index 00000000000..1eec28f071d --- /dev/null +++ b/scripts/tpcx-ai/tpcxai_fdr_template.html @@ -0,0 +1,337 @@ + + + + + + + + + TPCx-AI Benchmark Run Report + + + + +
+
+

TPCx-AI Benchmark Run Report

+
+ +
+

Benchmark Meta Data

+
    +
  • benchmark ID: {{benchmark_id}}
  • +
  • benchmark start: {{start}}
  • +
  • benchmark end: {{end}}
  • +
  • benchmark duration: {{duration}}
  • +
  • benchmark name: {{benchmark_name}}
  • +
  • cmd args: {{cmd_args}}
  • +
+
+ +
+

TPCx-AI Metric

+

Definitions

+
    +
  • SF: user defined scale factor, which defines the size of the data set in GBs.
  • +
  • N: number of use cases (i.e., 10 in the current version).
  • +
  • S: user defined number of concurrent streams to run in the Serving Throughput Test.
  • +
  • TLD: loading factor, which is the overall time it takes to ingest the datasets into the data store used for training and serving.
  • +
  • TPTT: Power Training Test factor, defined as the geometric mean of the training times tt of all use cases √(∏i=1N tti).
  • +
  • TPST: Power Serving Test factor, defined as the geometric mean of the serving times ts of all use cases √(∏i=1N tsi), here the higher result of the two runs is taken.
  • +
  • TTT: Serving Throughput Test factor, defined as the total time spent running the throughput test divided by the number of use cases N, and the number of streams in the Serving Throughput Test S.
  • +
+ +

Performance Metric

+

The from these values computed performance metric AIUCpm@SF is defined as:

+
+ AIUCpm@SF = [SF * N * 60 / (4√TLD * TPTT * TPST * TTT)] +
+

Values measured in current Benchmark run:

+ + + + + + + + {% for m in metric %} + + + + + + + {% endfor %} +
Metric NameValue in s
{{m.metric_name}}{{m.metric_value}}
+
+ + +
+

Evaluation

+

Summary of Evaluation Metrics

+
    +
  • Word Error Rate (WER): Indicates transcription accuracy, how many words are correctly transcribed; lower is better, 0 is perfect, and 1+ indicates complete inaccuracy.
  • +
  • Mean Squared Log Error (MSLE): Used in regression, measures error scale; MSLE ≤ 1.0 suggests predictions and true values are of similar scale.
  • +
  • F1 Score: Combines precision and recall; higher is better with 1 as perfect.
  • +
  • Matthews Correlation Coefficient (MCC): Measures correlation between predictions and true values; ranges from -1 (perfect anti-correlation) to 1 (perfect correlation), with 0 indicating no correlation.
  • +
  • Median Absolute Error (MAE): Measures median of absolute deviations from true values; 0 is best, higher values indicate larger average errors.
  • +
  • Accuracy Score: Measures correct classifications; ranges from 0 (worst) to 1 (perfect), sensitive to class imbalance.
  • +
+ {% for phase in phases%} + {% if phase.phase_name_formatted == "VERIFICATION" %} +
+

{{phase.phase_name_formatted}}

+ + + + + + + + + + + + + + + {% for uc in phase.use_cases %} + + + + + + + + {% if phase.phase == 'Phase.SCORING' %} + + {% elif phase.phase in ['Phase.VERIFICATION', 'Phase.CLEAN', 'Phase.LOADING'] %} + + {% else %} + + {% endif %} + + {% endfor %} + + + + + + + + + + + + +
StreamUse CaseStart TimeEnd TimeRuntimeSuccessfulComment
{{uc.stream}}{{uc.use_case}}{{uc.start_time}}{{uc.end_time}}{{uc.runtime}}{{uc.successful}}{{uc.metric_name}}: {{uc.metric_value}}{{uc.command}}
Total{{phase.start_time}}{{phase.end_time}}{{phase.runtime}}{{phase.successful}}complete phase
+
+ {% endif %} + {% endfor %} + +
+ +

Use Cases

+

+ The following table shows the time in seconds that elapsed for the executed phases for every Use Case. +

+

Stages for Each Use Case

+
    +
  • Data Generation: All data sets including training, serving, and scoring data sets are generated. This stage is not timed.
  • +
  • Load Test: Data is loaded into the system where it will be used. This can involve simple copying or more complex ingestion processes including data transformations, format changes, compression, and encoding.
  • +
  • Power Training Test: Runs the training pipelines of the ten use cases sequentially, including all preprocessing, model training, and postprocessing tasks.
  • +
  • Power Serving Test I and II: Runs the serving pipeline of all use cases sequentially using the same model for both tests. For the P_ST value used in the AIUC_pm@SF, the higher of the two values will be used.
  • +
  • Scoring Test: Predictions of each use case serving pipeline are measured against predefined thresholds. This test is part of the model validation stage and measures the quality of output, though the measured time is not used for the primary metric.
  • +
  • Serving Throughput Test: Concurrent streams of the ten serving pipelines are defined and run. Each stream consists of the sequential execution of a permutation of the serving pipelines of the ten use cases.
  • +
+ + + + + {% for key in use_cases[0].keys() %} + + {% endfor %} + + + + {% for uc in use_cases %} + + {% for value in uc.values() %} + + {% endfor %} + + {% endfor %} + +
{{key}}
{{value}}
+
+ +
+

Graphs

+

Phases

+
+

X-axis: Runtime per use case in seconds

+

Y-axis: use case

+
+
+ +
+ +

Training

+
+

X-axis: use case

+

Y-axis: Runtime per use case in seconds

+
+ +
+ +
+

Serving

+
+

X-axis: use case

+

Y-axis: Runtime per use case in seconds

+
+
+ +
+

Serving Throughput

+
+

X-axis: use case

+

Y-axis: Runtime per use case in seconds

+
+
+ +
+ + + + diff --git a/scripts/tpcx-ai/use_cases/UseCase01.py b/scripts/tpcx-ai/use_cases/UseCase01.py new file mode 100644 index 00000000000..0a5c6c11ca9 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase01.py @@ -0,0 +1,172 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import os +import timeit + +# numerical computing +from pathlib import Path + +# data frames +import joblib +import numpy as np +import pandas as pd +from sklearn.pipeline import Pipeline +from sklearn.preprocessing import MinMaxScaler + +from systemds.operator.algorithm import kmeans, kmeansPredict +from systemds.context import SystemDSContext + +def load_data(order_path: str, lineitem_path: str, order_returns_path: str) -> pd.DataFrame: + order_data = pd.read_csv(order_path, parse_dates=['date']) + lineitem_data = pd.read_csv(lineitem_path) + order_returns_data = pd.read_csv(order_returns_path) + returns_data = lineitem_data.merge(order_returns_data, + how='left', + left_on=['li_order_id', 'li_product_id'], + right_on=['or_order_id', 'or_product_id']) + raw_data = returns_data.merge(order_data, left_on=['li_order_id'], right_on=['o_order_id']) + raw_data = raw_data.fillna(0.0) + return raw_data[['o_order_id', 'o_customer_sk', 'date', 'li_product_id', 'price', 'quantity', 'or_return_quantity']] + + +def pre_process(data: pd.DataFrame): + data['invoice_year'] = data['date'].dt.year + data['row_price'] = data['quantity'] * data['price'] + data['return_row_price'] = data['or_return_quantity'] * data['price'] + + groups = data.groupby(['o_customer_sk', 'o_order_id']).agg({ + 'row_price': np.sum, + 'return_row_price': np.sum, + 'invoice_year': np.min + }).reset_index() + + groups['ratio'] = groups['return_row_price'] / groups['row_price'] + + ratio = groups.groupby(['o_customer_sk']).agg( + # mean order ratio + return_ratio=('ratio', np.mean) + ) + + frequency_groups = groups.groupby(['o_customer_sk', 'invoice_year'])['o_order_id'].nunique().reset_index() + frequency = frequency_groups.groupby(['o_customer_sk']).agg(frequency=('o_order_id', np.mean)) + + return pd.merge(frequency, ratio, left_index=True, right_index=True) + + +def train(featurevector: pd.DataFrame, num_clusters) -> Pipeline: + mms = MinMaxScaler() + sds = SystemDSContext() + scaled_features = mms.fit_transform(featurevector[['return_ratio', 'frequency']]) + feature_vector_sds = sds.from_numpy(scaled_features) + [centroids, _] = kmeans(feature_vector_sds, k=num_clusters, max_iter=300, runs=10, seed=-1).compute() + sds.close() + return centroids + + +def serve(centroids, data: pd.DataFrame): + mms = MinMaxScaler() + sds = SystemDSContext() + C = sds.from_numpy(centroids) + data_sds_scaled = mms.fit_transform(data[['return_ratio', 'frequency']]) + X = sds.from_numpy(data_sds_scaled) + prediction_sds = kmeansPredict(X, C).compute() + prediction_sds = np.squeeze(prediction_sds).astype(np.int32) + data['c_cluster_id'] = prediction_sds + sds.close() + return data.reset_index()[['o_customer_sk', 'c_cluster_id']] + + +def main(): + print("main") + model_file_name = "uc01.python.model" + + parser = argparse.ArgumentParser() + parser.add_argument('--num_clusters', metavar='N', type=int, default=4) + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("order") + parser.add_argument('lineitem') + parser.add_argument('order_returns') + + args = parser.parse_args() + order_path = args.order + lineitem_path = args.lineitem + order_returns_path = args.order_returns + num_clusters = args.num_clusters + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + raw_data = load_data(order_path, lineitem_path, order_returns_path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + start = timeit.default_timer() + preprocessed_data = pre_process(raw_data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + centroids = train(preprocessed_data, num_clusters) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + joblib.dump(centroids, work_dir / model_file_name) + + if stage == 'serving': + centroids = joblib.load(work_dir / model_file_name) + start = timeit.default_timer() + prediction = serve(centroids, preprocessed_data) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + out_data = prediction + out_data.to_csv(output / 'predictions.csv', index=False) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase02.py b/scripts/tpcx-ai/use_cases/UseCase02.py new file mode 100644 index 00000000000..fdadb4f0620 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase02.py @@ -0,0 +1,378 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2021 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +import argparse +import io +import os +import re +import string +import subprocess # nosec - The subprocess module is needed to call an external tool (sox) +import sys +import timeit +from pathlib import Path +from typing import Dict + +import librosa +import numpy as np +import pandas as pd +import scipy.io.wavfile as wav +import tensorflow as tf +from tensorflow.keras.backend import ctc_batch_cost, ctc_decode, expand_dims, squeeze +from tensorflow.keras.layers import Input, Masking, TimeDistributed, Dense, ReLU, Dropout, Bidirectional, LSTM, \ + Lambda, ZeroPadding2D, Conv2D +from tensorflow.keras.models import Model, load_model +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.sequence import pad_sequences +from tqdm import tqdm + +# GLOBALE DEFINITIONS +INPUT_SEPARATOR = '|' +OUTPUT_SEPARATOR = '|' +ALPHABET = string.ascii_lowercase + '\' ' +ALPHABET_DICT = {k: v for v, k in enumerate(ALPHABET)} + +PAUSE_IN_MS = 20 +SAMPLE_RATE = 16000 +WINDOW_SIZE = 32 +WINDOW_STRIDE = 20 +N_MFCC = 26 +N_HIDDEN = 1024 +DROPOUT_RATE = 0.005 +CONTEXT = 9 +MAX_RELU = 20 + +# DEFAULTS +BATCH_SIZE_DEFAULT = 32 +EPOCHS_DEFAULT = 5 + + +def resample_audio(audio, desired_sample_rate): + if audio[0] == desired_sample_rate: + return audio + cmd = f"sox - --type raw --bits 16 --channels 1 --rate {desired_sample_rate} --encoding signed-integer --endian little --compression 0.0 --no-dither - " + f = io.BytesIO() + wav.write(f, audio[0], audio[1]) + result = subprocess.run(cmd.split(), input=f.read(), stdout=subprocess.PIPE) + if result.returncode == 0: + return desired_sample_rate, np.frombuffer(result.stdout, dtype=np.int16) + else: + print(result.stdout, file=sys.stdout) + print(result.stderr, file=sys.stderr) + return 0, "" + + +def add_silence(audio, duration=20): + audio_sig = audio[1] + sample_rate = audio[0] + num_samples = int(duration / 1000 * sample_rate) + five_ms_silence = np.zeros(num_samples, dtype=audio_sig.dtype) + audio_with_silence = np.concatenate((five_ms_silence, audio_sig)) + return sample_rate, audio_with_silence + + +def decode_sequence(sequence, alphabet: Dict): + def lookup(k): + try: + return alphabet[k] + except KeyError: + return '' + + decoded_sequence = list(map(lookup, sequence)) + result = ''.join(decoded_sequence) + if len(result) == 0: + return ' ' + else: + return result + + +def load_data(path) -> pd.DataFrame: + data = pd.read_csv(path, sep=INPUT_SEPARATOR, dtype={'transcript': str}) + basedir = Path(os.path.dirname(os.path.realpath(path))) + + def make_abs_path(f): + if os.path.exists(f): + return f + elif os.path.exists(basedir / f): + return basedir / f + else: + raise FileNotFoundError(f"Neither {f} nor {basedir / f} are exist") + + data['filepath'] = data['wav_filename'].apply(make_abs_path) + data['audio'] = data['filepath'].apply(wav.read) + return data + + +def clean_data(data: pd.DataFrame) -> pd.DataFrame: + if 'transcript' not in data.columns: + return data + # remove samples with no transcript + pattern = re.compile(f"[^{ALPHABET}]") + + def normalize_transcript(trans): + try: + return pattern.sub('', trans.lower()) + except AttributeError: + return pattern.sub('', str(trans).lower()) + + # print('normalizing transcripts', file=sys.stderr) + tqdm.pandas(desc='normalizing transcripts') + data['transcript_norm'] = data['transcript'].progress_apply(normalize_transcript) + data = data[~data['transcript_norm'].isnull()] + data = data[~data['transcript_norm'].str.isspace()] + return data + + +def preprocess_data(data: pd.DataFrame): + # resampling + tqdm.pandas(desc='resampling') + data['audio'] = data['audio'].progress_apply(resample_audio, desired_sample_rate=SAMPLE_RATE) + # adding silence + # tqdm.pandas(desc='adding silence') + # data['audio'] = data['audio'].progress_apply(add_silence, duration=PAUSE_IN_MS) + + # LABELS + if 'transcript' in data.columns: + def text_to_seq(text, max_length=None): + seq = [] + for c in text: + seq.append(ALPHABET_DICT[c]) + seq = np.asarray(seq) + if max_length: + if len(seq) >= max_length: + # truncate if necessary + return seq[:max_length] + else: + # fill with zeros in the end + zeros = np.zeros(max_length) + zeros[:seq.shape[0]] = seq + return zeros + else: + return seq + + data['transcript_seq'] = data['transcript_norm'].progress_apply(text_to_seq, max_length=None) + data['labels_len'] = data['transcript_seq'].apply(len) + data = data[(data['labels_len'] <= 100) & (data['labels_len'] > 0)] + + # FEATURES + # calculate the mel spectograms windows_size 32 * 16, window_stride 20 * 16 + def to_melspectograms(audio, win_size=WINDOW_SIZE, win_stride=WINDOW_STRIDE): + # convert to float32 (-1.0 to 1.0) + min_value = np.iinfo(audio[1].dtype).min + max_value = np.iinfo(audio[1].dtype).max + factor = 1 / np.max(np.abs([min_value, max_value])) + y = audio[1] * factor + sr = audio[0] + return librosa.feature.melspectrogram(y, sr, n_fft=win_size*sr//1000, hop_length=win_stride*sr//1000) + + # calculate 20 mfcc on mel spectograms + def to_mfcc(spectograms, max_timesteps=None): + mfccs = librosa.feature.mfcc(S=spectograms, sr=SAMPLE_RATE, n_mfcc=N_MFCC).transpose() + if max_timesteps: + return np.zeros((max_timesteps, N_MFCC), mfccs.dtype) + else: + return mfccs + + tqdm.pandas(desc='calculating mel spectograms') + data['features'] = data['audio'].progress_apply(to_melspectograms) + # get the mfcc's (20 coeefficients) for the mel spectograms + tqdm.pandas(desc='calculating mfcc') + data['features'] = data['features'].progress_apply(to_mfcc, max_timesteps=None) + tqdm.pandas(desc='creating sequences from transcripts') + + tqdm.pandas(desc='padding transcripts and features') + data['features_len'] = data['features'].apply(len) + print(f"pre-processed data contains {len(data)} rows") + if 'transcript' in data.columns: + data = data[data['features_len'] > data['labels_len']] + x = pad_sequences(data['features'].values, dtype=np.float32, padding='post') + y = pad_sequences(data['transcript_seq'].values, padding='post') + return data['wav_filename'], x, y, data['features_len'].values, data['labels_len'].values + else: + x = pad_sequences(data['features'].values, dtype=np.float32, padding='post') + return data['wav_filename'], x, None, data['features_len'].values, None + + +def make_model_func(with_convolution=True): + x = Input((None, N_MFCC), name="X") + y_true = Input((None,), name="y") + seq_lengths = Input((1,), name="sequence_lengths") + time_steps = Input((1,), name="time_steps") + + masking = Masking(mask_value=0)(x) + + if with_convolution: + conv_layer = Lambda(lambda val: expand_dims(val, axis=-1))(masking) + conv_layer = ZeroPadding2D(padding=(CONTEXT, 0))(conv_layer) + conv_layer = Conv2D(filters=N_HIDDEN, kernel_size=(2 * CONTEXT + 1, N_MFCC))(conv_layer) + conv_layer = Lambda(squeeze, arguments=dict(axis=2))(conv_layer) + conv_layer = ReLU(max_value=20)(conv_layer) + conv_layer = Dropout(DROPOUT_RATE)(conv_layer) + + layer_1 = TimeDistributed(Dense(N_HIDDEN))(conv_layer) + else: + layer_1 = TimeDistributed(Dense(N_HIDDEN))(masking) + + layer_1 = ReLU(max_value=MAX_RELU)(layer_1) + layer_1 = Dropout(DROPOUT_RATE)(layer_1) + + layer_2 = TimeDistributed(Dense(N_HIDDEN))(layer_1) + layer_2 = ReLU(max_value=MAX_RELU)(layer_2) + layer_2 = Dropout(DROPOUT_RATE)(layer_2) + + lstm = Bidirectional(LSTM(N_HIDDEN, return_sequences=True), merge_mode='sum')(layer_2) + softmax = TimeDistributed(Dense(len(ALPHABET) + 1, activation='softmax'), name='prediction_softmax')(lstm) + + def myloss_layer(args): + y_true, y_pred, time_steps, label_lengths = args + return ctc_batch_cost(y_true, y_pred, time_steps, label_lengths) + + ctc_loss_layer = Lambda(myloss_layer, output_shape=(1,), name='ctc')([y_true, softmax, time_steps, seq_lengths]) + + model = Model(inputs=[x, y_true, time_steps, seq_lengths], outputs=ctc_loss_layer) + + return model + + +def ctc_dummy(y_true, y_pred): + mean = tf.reduce_mean(y_pred) + return mean + + +def train(x, y, features_len, labels_len, batch_size, epochs, learning_rate=None): + model = make_model_func(with_convolution=False) + model.summary(line_length=200) + # train / validation split (80/20) + num_samples = len(x) + train_idx = int(num_samples * 0.8) + # training data + x_train = x[:train_idx] + f_len_train = features_len[:train_idx] + y_train = y[:train_idx] + y_len_train = labels_len[:train_idx] + training_data = [x_train, y_train, f_len_train, y_len_train] + + # validation data + x_val = x[train_idx:] + f_len_val = features_len[train_idx:] + y_val = y[train_idx:] + y_len_val = labels_len[train_idx:] + validation_data = [x_val, y_val, f_len_val, y_len_val] + + lr = learning_rate if learning_rate else 0.001 + optimizer = Adam(learning_rate=lr, beta_1=0.9, beta_2=0.999, epsilon=1e-8) + + model.compile(optimizer=optimizer, loss=ctc_dummy) + hist = model.fit(training_data, y_train, validation_data=(validation_data, y_val), + batch_size=batch_size, epochs=epochs) + return hist + + +def serve(model, x, features_len, batch_size): + input_layer = model.get_layer(name='X') + output_layer = model.get_layer(name='prediction_softmax') + predict_model = Model([input_layer.input], output_layer.output) + pred = predict_model.predict(x, batch_size=batch_size) + result = ctc_decode(pred, features_len, greedy=False) + inv_alphabet = {v: k for k, v in ALPHABET_DICT.items()} + transcripts = result[0][0].numpy().tolist() + transcripts = map(lambda s: decode_sequence(s, inv_alphabet), transcripts) + transcripts = pd.DataFrame(data=transcripts, columns=['transcript']) + return transcripts + + +def main(): + tqdm.pandas() + + model_file_name = "uc02.python.model" + + parser = argparse.ArgumentParser() + parser.add_argument('--batch', metavar='SIZE', type=int, default=BATCH_SIZE_DEFAULT) + parser.add_argument('--epochs', metavar='N', type=int, default=EPOCHS_DEFAULT) + parser.add_argument('--learning_rate', '-lr', required=False, type=float) + + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("filename") + + args = parser.parse_args() + batch_size = args.batch + epochs = args.epochs + learning_rate = args.learning_rate if args.learning_rate else None + + path = args.filename + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + raw_data = load_data(path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + start = timeit.default_timer() + cleaned_data = clean_data(raw_data) + wav_filenames, x, y, features_len, labels_len = preprocess_data(cleaned_data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + hist = train(x, y, features_len, labels_len, batch_size, epochs, learning_rate) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + model = hist.model + + model.save(work_dir / model_file_name, save_format='h5') + + if stage == 'serving': + model = load_model(work_dir / model_file_name, custom_objects={'ctc_dummy': ctc_dummy}) + start = timeit.default_timer() + prediction = serve(model, x, features_len, batch_size) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + prediction['wav_filename'] = wav_filenames.reset_index(drop=True) + prediction.to_csv(output / 'predictions.csv', sep=INPUT_SEPARATOR, header=True, index=False) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase03.py b/scripts/tpcx-ai/use_cases/UseCase03.py new file mode 100644 index 00000000000..5674843aba4 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase03.py @@ -0,0 +1,280 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import datetime +import os +import timeit +import warnings +from pathlib import Path + +import numpy as np +import pandas as pd +from statsmodels.tools.sm_exceptions import ValueWarning +from statsmodels.tsa.holtwinters import ExponentialSmoothing +import joblib +from tqdm import tqdm + + +class UseCase03Model(object): + + def __init__(self, use_store=False, use_department=True): + if not use_store and not use_department: + raise ValueError(f"use_store = {use_store}, use_department = {use_department}: at least one must be True") + + self._use_store = use_store + self._use_department = use_department + self._models = {} + self._min = {} + self._max = {} + + def _get_key(self, store, department): + if self._use_store and self._use_department: + key = (store, department) + elif self._use_store: + key = store + else: + key = department + + return key + + def store_model(self, store: int, department: int, model, ts_min, ts_max): + key = self._get_key(store, department) + self._models[key] = model + self._min[key] = ts_min + self._max[key] = ts_max + + def get_model(self, store: int, department: int): + key = self._get_key(store, department) + model = self._models[key] + ts_min = self._min[key] + ts_max = self._max[key] + return model, ts_min, ts_max + + +def load(order_path: str, lineitem_path: str, product_path: str) -> pd.DataFrame: + order_data = pd.read_csv(order_path, parse_dates=['date']) + lineitem_data = pd.read_csv(lineitem_path) + product_data = pd.read_csv(product_path) + data = order_data.merge(lineitem_data, left_on='o_order_id', right_on='li_order_id') + data = data.merge(product_data, left_on='li_product_id', right_on='p_product_id') + + return data[['store', 'department', 'li_order_id', 'date', 'price', 'quantity']] + + +def pre_process(data: pd.DataFrame) -> pd.DataFrame: + data['year'] = data['date'].dt.year + data['week'] = data['date'].dt.week + data['month'] = data['date'].dt.month + # reset year in cases where one the last days of the week is in the new year + # e.g. the last day of week 52 of 2011 is actual 2012-01-01 + # to get the proper year the year has to be reduced by 1 to 2011 + # in general whenever the month of the date is in january but the week is above 50 the year needs to be reduced + data['year'] = np.where((data['week'] > 50) & (data['month'] == 1), data['year'] - 1, data['year']) + data['row_price'] = data['quantity'] * data['price'] + + grouped = data.groupby(['store', 'department', 'year', 'week'])['row_price'].sum().reset_index() + + def make_date(year, week, weekday): + date_str = "{}-W{:02d}-{}".format(year, week, weekday) + date = datetime.datetime.strptime(date_str, '%G-W%V-%u') + return date + + grouped['date'] = grouped[['week', 'year']].apply(lambda r: make_date(r['year'], r['week'], 5), axis=1) + grouped = grouped.rename(index=str, columns={'store': 'Store', 'department': 'Dept', 'date': 'Date', 'row_price': 'Weekly_Sales'}) + print("pp done") + return grouped[['Store', 'Dept', 'Date', 'Weekly_Sales']] + +def load_data_in_chunks(order_path: str, lineitem_path: str, product_path: str, + chunksize: int = 10000000) -> pd.DataFrame: + all_chunks = [] + product_data = pd.read_csv(product_path) + + # Initialize iterators for lineitem and order data + lineitem_iter = pd.read_csv(lineitem_path, chunksize=chunksize) + order_data = pd.read_csv(order_path, parse_dates=['date']) + + start_order_idx = 0 # To keep track of the starting index for orders + + for lineitem_chunk in lineitem_iter: + # Determine the range of li_order_id in the current lineitem_chunk + min_li_order_id = lineitem_chunk['li_order_id'].min() + max_li_order_id = lineitem_chunk['li_order_id'].max() + + # Find corresponding orders in the order data + order_chunk = order_data[ + (order_data['o_order_id'] >= min_li_order_id) & (order_data['o_order_id'] <= max_li_order_id)] + + # Merge the chunks + data = lineitem_chunk.merge(order_chunk, left_on='li_order_id', right_on='o_order_id') + data = data.merge(product_data, left_on='li_product_id', right_on='p_product_id') + + # Select the required columns + processed_chunk = data[['store', 'department', 'li_order_id', 'date', 'price', 'quantity']] + + # Append the processed chunk to the list + all_chunks.append(processed_chunk.reset_index(drop=True)) + + final_data = pd.concat(all_chunks, ignore_index=True) + print("loaddone") + return final_data + + +def train(data: pd.DataFrame) -> UseCase03Model: + """ + Trains an ARIMA or SARIMAX model for each department at each store. The best model for each department is found + using the `auto_arima` method from `pmdarima`. + :param data: Pandas DataFrame with columns=[Store, Dept, Date, Weekly_Sales] + :return: A UseCase03Models + """ + models = UseCase03Model(use_store=True, use_department=True) + combinations = np.unique(data[['Store', 'Dept']].apply(lambda r: (r[0], r[1]), axis=1)) + for c in tqdm(combinations, desc='Training'): + store = c[0] + dept = c[1] + ts_data = data[(data.Store == store) & (data.Dept == dept)] + ts_min = ts_data.Date.min() + ts_max = ts_data.Date.max() + ts_data = ts_data.set_index('Date')['Weekly_Sales'].sort_index() + print(c) + print(ts_data.shape) + with warnings.catch_warnings(): + warnings.simplefilter('ignore') + # add freq='W-Fri' would fail + model = ExponentialSmoothing(ts_data, seasonal='add', seasonal_periods=52).fit() + print(f"{store},{dept},{ts_min},{ts_max}") + models.store_model(store, dept, model, ts_min, ts_max) + + return models + + +def serve(model: UseCase03Model, data: pd.DataFrame) -> pd.DataFrame: + """ + Create forecasts using the given model for the given deprartments and stores. + :param model: The trained models for each department and each store + :param data: Pandas DataFrame containing the stores, departments, and the desired number of periods for the forecast + :return: The forecasts for each department and each store of the desired length + """ + # compute forecast for all store/department combinations in the data set + forecasts = pd.DataFrame(columns=['store', 'department', 'date', 'weekly_sales']) + # combinations = np.unique(data[['Store', 'Dept']].values, axis=0) + for index, row in tqdm(data.iterrows(), desc='Forecasting', total=len(data)): + store = row.store + dept = row.department + periods = int(row.periods) + try: + current_model, ts_min, ts_max = model.get_model(store, dept) + except KeyError: + continue + # disable warnings that non-date index is returned from forecast + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", category=ValueWarning) + forecast = current_model.forecast(periods) + forecast = np.clip(forecast, a_min=0.0, a_max=None) # replace negative forecasts + start = pd.date_range(ts_max, periods=2)[1] + forecast_idx = pd.date_range(start, periods=periods, freq='W-FRI') + df = pd.DataFrame({'store': store, 'department': dept, 'date': forecast_idx, 'weekly_sales': forecast}) + forecasts = forecasts.append(df) + + return forecasts + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("path", nargs='+') + + # configuration parameters + args = parser.parse_args() + order_path = args.path[0] + + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + # derivative configuration parameters + model_file = work_dir / 'uc03.python.model' + + if stage == 'training': + lineitem_path = args.path[1] + product_path = args.path[2] + start = timeit.default_timer() + data = load_data_in_chunks(order_path, lineitem_path, product_path) + print(data.head()) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + start = timeit.default_timer() + data = pre_process(data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + start = timeit.default_timer() + models = train(data) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + joblib.dump(models, model_file) + + elif stage == 'serving': + start = timeit.default_timer() + data = pd.read_csv(order_path) + print(data.head()) + print("s: " + order_path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + models = joblib.load(model_file) + + start = timeit.default_timer() + forecasts = serve(models, data) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + forecasts.to_csv(output / 'predictions.csv', index=False) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase04.py b/scripts/tpcx-ai/use_cases/UseCase04.py new file mode 100644 index 00000000000..bab4191f773 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase04.py @@ -0,0 +1,133 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import os + +# numerical computing +import timeit + +import numpy as np + +# data frames +import pandas as pd + +# Naive Bayes +from sklearn import feature_extraction, naive_bayes +from sklearn.pipeline import Pipeline +from sklearn.feature_extraction.text import TfidfTransformer + +import joblib + + +def load_data(path: str) -> pd.DataFrame: + raw_data = pd.read_csv(path, delimiter='|', encoding='utf8') + raw_data['text'] = raw_data['text'].astype(str) + return raw_data + + +def clean_data(data: pd.DataFrame) -> pd.DataFrame: + # drop duplicates + data.drop_duplicates(inplace=True) + return data + + +def train(data:pd.DataFrame) -> naive_bayes: + bayesTfIDF = Pipeline([ + ('cv', feature_extraction.text.CountVectorizer(stop_words='english', ngram_range=(1, 2), decode_error='replace')), + ('tf-idf', TfidfTransformer()), + ('mnb', naive_bayes.MultinomialNB()) + ]) + + return bayesTfIDF.fit(data["text"], data['spam'].values) + + +def serve(model, data: pd.DataFrame) -> np.array: + predictions = model.predict(data["text"]) + return predictions + + +def main(): + model_file_name = "uc04.python.model" + + parser = argparse.ArgumentParser() + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("filename") + + args = parser.parse_args() + path = args.filename + stage = args.stage + work_dir = args.workdir + if args.output: + output = args.output + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + raw_data = load_data(path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + start = timeit.default_timer() + cleaned_data = clean_data(raw_data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + model = train(cleaned_data) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + joblib.dump(model, work_dir + '/' + model_file_name) + + if stage == 'serving': + model = joblib.load(work_dir + '/' + model_file_name) + start = timeit.default_timer() + prediction = serve(model, cleaned_data) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + out_data = pd.DataFrame(prediction, columns=['spam']) + out_data['ID'] = cleaned_data['ID'] + out_data.to_csv(output + '/predictions.csv', index=False, sep='|') + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase05.py b/scripts/tpcx-ai/use_cases/UseCase05.py new file mode 100644 index 00000000000..94469a55846 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase05.py @@ -0,0 +1,180 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# +import argparse +import os +import timeit +from pathlib import Path + +import pandas as pd +import tensorflow as tf +import joblib +from tensorflow.keras import Sequential +from tensorflow.keras.layers import Embedding, Dense, GRU +from tensorflow.keras.losses import mean_squared_error, mean_squared_logarithmic_error +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.sequence import pad_sequences +from tensorflow.keras.preprocessing.text import Tokenizer + + +def load(path) -> pd.DataFrame: + f = open(path, 'r', encoding="utf8") + list_of_lines = f.readlines() + f.close() + list_of_lines[0] = list_of_lines[0].replace('"', '') + f = open(path, 'w', encoding="utf8") + f.writelines(list_of_lines) + f.close() + data = pd.read_csv(path, sep='|', quoting=3) + return data + + +def pre_process(data: pd.DataFrame, tokenizer=None): + data['description'] = data.description.str[1:-1] + text_data = data.description + if 'price' in data.columns: + labels = data.price + else: + labels = None + if tokenizer is None: + tokenizer = Tokenizer() + tokenizer.fit_on_texts(text_data) + data_seq = tokenizer.texts_to_sequences(text_data) + data_seq_pad = pad_sequences(data_seq, maxlen=200) + return labels, data_seq_pad, tokenizer + + +def train(architecture, labels, features, loss=mean_squared_error, epochs=10, batch_size=4096, + learning_rate=None) -> tf.keras.callbacks.History: + lr = learning_rate if learning_rate else 0.001 + architecture.compile(optimizer=Adam(learning_rate=lr), loss=loss) + print(architecture.summary()) + return architecture.fit(features, labels, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.3) + + +def make_bi_lstm(tokenizer_len): + rnn_model = Sequential() + rnn_model.add(Embedding(tokenizer_len, 300, input_length=200)) + rnn_model.add(GRU(16)) + rnn_model.add(Dense(128)) + rnn_model.add(Dense(64)) + rnn_model.add(Dense(1, activation='linear')) + return rnn_model + + +def serve(model, data, batch_size=4096): + return model.predict(data, batch_size=batch_size) + + +def main(): + model_file_name = 'uc05.python.model' + tokenizer_file_name = f"{model_file_name}.tok" + + parser = argparse.ArgumentParser() + + # use-case specific parameters + parser.add_argument('--loss', choices=['mse', 'msle'], default='mse') + parser.add_argument('--epochs', metavar='N', type=int, default=15) + parser.add_argument('--batch', metavar='N', type=int, default=4096) + parser.add_argument('--learning_rate', '-lr', required=False, type=float) + + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("filename") + + # configuration parameters + args = parser.parse_args() + loss = mean_squared_error if args.loss == 'mse' else mean_squared_logarithmic_error + epochs = args.epochs + batch = args.batch + learning_rate = args.learning_rate if args.learning_rate else None + + path = args.filename + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + # derivative configuration parameters + model_file = work_dir / model_file_name + tokenizer_file = work_dir / tokenizer_file_name + + start = timeit.default_timer() + data = load(path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + if stage == 'training': + start = timeit.default_timer() + (labels, features, tokenizer) = pre_process(data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + start = timeit.default_timer() + tok_len = len(tokenizer.word_index) + 1 + architecture = make_bi_lstm(tok_len) + history = train(architecture, labels, features, loss, epochs, batch, learning_rate) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + history.model.save(str(model_file)) + joblib.dump(tokenizer, tokenizer_file) + + elif stage == 'serving': + tokenizer = joblib.load(tokenizer_file) + model = tf.keras.models.load_model(str(model_file)) + + start = timeit.default_timer() + (labels, features, tokenizer) = pre_process(data, tokenizer) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + start = timeit.default_timer() + price_suggestions = serve(model, features) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + # negative price suggestions need to be changed to 0: .clip(min=0) + df = pd.DataFrame({'id': data['id'], 'price': price_suggestions.ravel().clip(min=0)}) + df.to_csv(output / 'predictions.csv', index=False, sep='|') + + +if __name__ == "__main__": + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase06.py b/scripts/tpcx-ai/use_cases/UseCase06.py new file mode 100644 index 00000000000..3cadf95df5f --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase06.py @@ -0,0 +1,216 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import os +import timeit + +import numpy as np +import pandas as pd + +from imblearn.over_sampling import ADASYN +import joblib +from sklearn import metrics +from sklearn.metrics import classification_report, confusion_matrix, f1_score +from systemds.operator.algorithm import l2svm, l2svmPredict, scale +from systemds.context import SystemDSContext + +_random_state = 0xDEADBEEF + +label_column = '' + +feature_columns = ['smart_5_raw', + 'smart_10_raw', + 'smart_184_raw', + 'smart_187_raw', + 'smart_188_raw', + 'smart_197_raw', + 'smart_198_raw'] + + +def load_data(path: str) -> pd.DataFrame: + raw_data = pd.read_csv(path, parse_dates=['date']) + return raw_data + + +def pre_process(data: pd.DataFrame, failures_only=True) -> (np.array, pd.DataFrame): + if 'failure' in data.columns: + if failures_only: + # only get the hdd's that fail eventually + failures_raw = data.groupby(['serial_number', 'model']).filter(lambda x: (x['failure'] == 1).any()) + else: + failures_raw = data + failures_raw['ttf'] = pre_label(failures_raw).fillna(pd.Timedelta.max) + failures_raw = failures_raw[failures_raw['ttf'] >= pd.Timedelta('0 days')] + + training_data = failures_raw[feature_columns] + labels = failures_raw.ttf.apply(lambda x: label(x, thresholds=pd.Timedelta('1 days'))) + print("pp done") + return labels, training_data[feature_columns] + else: + return None, data[feature_columns] + + +def train(training_data: pd.DataFrame, labels: np.array): + sds = SystemDSContext() + resampled_features, resampled_labels = ADASYN(random_state=_random_state).fit_resample(training_data, labels) + resampled_features_sds= sds.from_numpy(resampled_features) + scaled_features_sds = scale(resampled_features_sds).compute() + scaled_features_sds = sds.from_numpy(scaled_features_sds[0]) + resampled_labels_sds = sds.from_numpy(resampled_labels) + model = l2svm(X=scaled_features_sds, Y=resampled_labels_sds, epsilon=0.1, maxIterations=20, maxii=5).compute() + sds.close() + return model + + +def serve(model, data): + sds = SystemDSContext() + data_sds = sds.from_numpy(data) + scaled_data_sds = scale(data_sds).compute() + scaled_data_sds = sds.from_numpy(scaled_data_sds[0]) + model_sds = sds.from_numpy(model) + [YRaw, _] = l2svmPredict(scaled_data_sds, model_sds).compute() + predictions = np.squeeze(YRaw).astype(np.int32) + predictions = (predictions >= 2).astype(int) + sds.close() + return predictions + + +def score(model, data, labels): + predictions = serve(model, data) + + f_score = f1_score(labels, predictions, average='weighted') + + tn, fn, fp, tp = confusion_matrix(labels, predictions).ravel() + + print(confusion_matrix(labels, predictions)) + print(classification_report(labels, predictions)) + + fpr, tpr, thresholds = metrics.roc_curve(labels, predictions, pos_label=1) + auc = metrics.auc(fpr, tpr) + + false_positive_rate = fp / (fp + tn) + return {'f1': f_score, 'fpr': false_positive_rate, 'auc': auc} + + +def pre_label(df, absolute_time='date', failure_indicator='failure', grouping_key=['model', 'serial_number']): + tmp = df.copy() + tmp['last'] = df.apply(lambda x: x[absolute_time] if x[failure_indicator] == 1 else np.NaN, axis='columns') + tmp['last'] = tmp.groupby(grouping_key)['last'].transform(lambda x: np.max(x)) + return tmp['last'] - df['date'] + + +def label(x, thresholds=pd.Timedelta('1 days')): + if x <= thresholds: + return 1 + else: + return 0 + + +def main(): + model_file_name = 'uc06.python.model' + + wallclock_start = timeit.default_timer() + + parser = argparse.ArgumentParser() + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("filename") + + args = parser.parse_args() + path = args.filename + stage = args.stage + work_dir = args.workdir + if args.output: + output = args.output + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + failures_only = stage == 'training' + + start = timeit.default_timer() + raw_data = load_data(path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + start = timeit.default_timer() + (labels, data) = pre_process(raw_data, failures_only) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + model = train(data, labels) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + joblib.dump(model, work_dir + '/' + model_file_name) + + + if stage == 'serving': + model = joblib.load(work_dir + '/' + model_file_name) + start = timeit.default_timer() + predictions = serve(model, data) + end = timeit.default_timer() + serve_time = end - start + + out_data = pd.DataFrame({'model': raw_data['model'], 'serial_number': raw_data['serial_number'], + 'date': raw_data['date'], 'failure': predictions}) + out_data.to_csv(output + '/predictions.csv', index=False) + + print('serve time:\t', serve_time) + + if stage == 'scoring': + model = joblib.load(work_dir + '/' + model_file_name) + + scores = score(model, data, labels) + + print(scores) + + wallclock_end = timeit.default_timer() + wallclock_time = wallclock_end - wallclock_start + + if stage == 'serving': + throughput = len(predictions) / wallclock_time + print('throughput:\t{} samples/s'.format(throughput)) + print('wallclock time:\t', wallclock_time) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase07.py b/scripts/tpcx-ai/use_cases/UseCase07.py new file mode 100644 index 00000000000..d9fba16f1b4 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase07.py @@ -0,0 +1,196 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import os + +# numerical computing +import timeit +from pathlib import Path + +import numpy as np + +# data frames +import pandas as pd + +import joblib + +from systemds.context import SystemDSContext + +def load_data(path: str) -> pd.DataFrame: + raw_data = pd.read_csv(path) + # raw_data.columns = ['userID', 'productID', 'rating'] + return raw_data + +def split_matrix(matrix, chunk_size): + num_chunks = int(np.ceil(matrix.shape[0] / chunk_size)) + return [matrix[i*chunk_size:(i+1)*chunk_size] for i in range(num_chunks)] + +def train_in_chunks(data: pd.DataFrame, save_path: str): + sds = SystemDSContext() + matrix, user_means, item_means, global_mean, user_index, item_index = compute_mean_centered_matrix(data) + matrix_chunks = split_matrix(matrix, 1000) + user_features_list = [] + max_length = matrix.shape[1] + item_features = np.zeros(max_length) + for chunk in matrix_chunks: + sds_matrix = sds.from_numpy(chunk) + U, S, V = sds_matrix.svd().compute() + S_diagonal = np.diag(S) + user_features_chunk = np.dot(U, S_diagonal) + item_features_chunk = np.dot(S_diagonal, V.T).T + user_features_list.append(user_features_chunk.flatten()) + if len(item_features_chunk) < max_length: + item_features_chunk = np.pad(item_features_chunk, (0, max_length - len(item_features_chunk)), 'constant', constant_values=0) + item_features = item_features + item_features_chunk + user_features = np.concatenate(user_features_list) + sds.close() + + joblib.dump({ + 'user_features': user_features, + 'item_features': item_features, + 'user_means': user_means, + 'item_means': item_means, + 'global_mean': global_mean, + 'user_index': user_index, + 'item_index': item_index + }, save_path) + return + +def serve(load_path: str, users, data, n=None) -> pd.DataFrame: + + model_data = joblib.load(load_path) + + user_features = model_data['user_features'] + item_features = model_data['item_features'] + user_means = model_data['user_means'] + item_means = model_data['item_means'] + global_mean = model_data['global_mean'] + user_index = model_data['user_index'] + item_index = model_data['item_index'] + + user_recommendations = [] + + user_item_interactions = data.groupby('userID')['productID'].apply(list).to_dict() + for user_id in users: + if user_id not in user_index: + continue + user_idx = user_index[user_id] + ratings = [] + if user_id in user_item_interactions: + for item_id in user_item_interactions[user_id]: + if item_id not in item_index: + continue + item_idx = item_index[item_id] + predicted_rating = np.dot(user_features[user_idx], item_features[item_idx]) + predicted_rating += user_means[user_id] + item_means[item_id] - global_mean + predicted_rating = min(max(predicted_rating, 1), 5) + ratings.append((user_id, item_id, predicted_rating)) + if n: + ratings = sorted(ratings, key=lambda t: t[2], reverse=True)[:n] + user_recommendations.extend(ratings) + return pd.DataFrame(user_recommendations, columns=['userID', 'productID', 'rating']) + + +def compute_mean_centered_matrix(data: pd.DataFrame) -> (np.ndarray, dict, dict, float, dict, dict): + user_means = data.groupby('userID')['rating'].mean().to_dict() + item_means = data.groupby('productID')['rating'].mean().to_dict() + global_mean = data['rating'].mean() + users = data['userID'].unique() + items = data['productID'].unique() + user_index = {user: idx for idx, user in enumerate(users)} + item_index = {item: idx for idx, item in enumerate(items)} + num_users = len(users) + num_items = len(items) + matrix = np.zeros((num_users, num_items)) + for _, row in data.iterrows(): + user = row['userID'] + item = row['productID'] + rating = row['rating'] + user_mean = user_means[user] + item_mean = item_means[item] + matrix[user_index[user], item_index[item]] = rating - user_mean - item_mean + global_mean + return matrix, user_means, item_means, global_mean, user_index, item_index + +def main(): + model_file_name = "uc07.python.model" + + parser = argparse.ArgumentParser() + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("filename") + + args = parser.parse_args() + path = args.filename + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + raw_data = load_data(path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + start = timeit.default_timer() + end = timeit.default_timer() + users = raw_data.userID.unique() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + train_in_chunks(raw_data, work_dir / model_file_name) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + + if stage == 'serving': + start = timeit.default_timer() + recommendations = serve(work_dir / model_file_name, users, raw_data) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + out_data = pd.DataFrame(recommendations) + out_data.to_csv(output / 'predictions.csv', index=False) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase08.py b/scripts/tpcx-ai/use_cases/UseCase08.py new file mode 100644 index 00000000000..1cb841e3afa --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase08.py @@ -0,0 +1,240 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2019 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +# +import argparse +import os +import timeit + +import joblib +import numpy as np +import pandas as pd +from scipy.sparse import csr_matrix +from xgboost.sklearn import XGBClassifier + +department_columns = [ + "FINANCIAL SERVICES", "SHOES", "PERSONAL CARE", "PAINT AND ACCESSORIES", "DSD GROCERY", "MEAT - FRESH & FROZEN", + "DAIRY", "PETS AND SUPPLIES", "HOUSEHOLD CHEMICALS/SUPP", "IMPULSE MERCHANDISE", "PRODUCE", + "CANDY, TOBACCO, COOKIES", "GROCERY DRY GOODS", "BOYS WEAR", "FABRICS AND CRAFTS", "JEWELRY AND SUNGLASSES", + "MENS WEAR", "ACCESSORIES", "HOME MANAGEMENT", "FROZEN FOODS", "SERVICE DELI", "INFANT CONSUMABLE HARDLINES", + "PRE PACKED DELI", "COOK AND DINE", "PHARMACY OTC", "LADIESWEAR", "COMM BREAD", "BAKERY", "HOUSEHOLD PAPER GOODS", + "CELEBRATION", "HARDWARE", "BEAUTY", "AUTOMOTIVE", "BOOKS AND MAGAZINES", "SEAFOOD", "OFFICE SUPPLIES", + "LAWN AND GARDEN", "SHEER HOSIERY", "WIRELESS", "BEDDING", "BATH AND SHOWER", "HORTICULTURE AND ACCESS", + "HOME DECOR", "TOYS", "INFANT APPAREL", "LADIES SOCKS", "PLUS AND MATERNITY", "ELECTRONICS", + "GIRLS WEAR, 4-6X AND 7-14", "BRAS & SHAPEWEAR", "LIQUOR,WINE,BEER", "SLEEPWEAR/FOUNDATIONS", + "CAMERAS AND SUPPLIES", "SPORTING GOODS", "PLAYERS AND ELECTRONICS", "PHARMACY RX", "MENSWEAR", "OPTICAL - FRAMES", + "SWIMWEAR/OUTERWEAR", "OTHER DEPARTMENTS", "MEDIA AND GAMING", "FURNITURE", "OPTICAL - LENSES", "SEASONAL", + "LARGE HOUSEHOLD GOODS", "1-HR PHOTO", "CONCEPT STORES", "HEALTH AND BEAUTY AIDS" +] + +weekday_columns = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] + +featureColumns = ['scan_count', 'scan_count_abs'] + weekday_columns + department_columns + +label_column = 'trip_type' + +# deleted label 14, since only 4 samples existed in the sample data set +label_range = [3, 4, 5, 6, 7, 8, 9, 12, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, + 37, 38, 39, 40, 41, 42, 43, 44, 999] +sorted_labels = sorted(label_range, key=str) +label_to_index = {k: v for v, k in enumerate(sorted_labels)} + + +def load_data(order_path: str, lineitem_path: str, product_path: str) -> pd.DataFrame: + order_data = pd.read_csv(order_path, parse_dates=['date']) + lineitem_data = pd.read_csv(lineitem_path) + product_data = pd.read_csv(product_path) + data = order_data.merge(lineitem_data, left_on='o_order_id', right_on='li_order_id') + data = data.merge(product_data, left_on='li_product_id', right_on='p_product_id') + print("merge loading done") + if 'trip_type' in data.columns: + return data[['o_order_id', 'date', 'department', 'quantity', 'trip_type']] + else: + return data[['o_order_id', 'date', 'department', 'quantity']] + +def pre_process(raw_data: pd.DataFrame) -> (np.array, pd.DataFrame): + has_labels = label_column in raw_data.columns + + def scan_count(x): + return np.sum(x) + + def scan_count_abs(x): + return np.sum(np.abs(x)) + + def weekday(x): + return np.min(x) + + def trip_type(x): + return np.min(x) + + if has_labels: + agg_func = { + 'scan_count': [scan_count, scan_count_abs], + 'weekday': weekday, + 'trip_type': trip_type + } + else: + agg_func = { + 'scan_count': [scan_count, scan_count_abs], + 'weekday': weekday + } + + raw_data['scan_count'] = raw_data['quantity'] + raw_data['weekday'] = raw_data['date'].dt.day_name() + features_scan_count: pd.DataFrame = raw_data.groupby(['o_order_id']).agg(agg_func) + + features_scan_count.columns = features_scan_count.columns.droplevel(0) + + def grper(x): + return int(pd.Series.count(x) > 0) + + weekdays = raw_data.pivot_table(index='o_order_id', columns='weekday', values='scan_count', + aggfunc=grper).fillna(0.0) + + missing_weekdays = set(weekday_columns) - set(weekdays.columns) + for c in missing_weekdays: + weekdays.insert(1, c, 0.0) + + departments = raw_data.pivot_table(index='o_order_id', columns='department', values='scan_count', + aggfunc='sum') + + missing_cols = set(department_columns) - set(departments.columns) + for c in missing_cols: + departments.insert(1, c, 0.0) + + final_data: pd.DataFrame = features_scan_count.drop(columns=['weekday']) \ + .join(weekdays) \ + .join(departments) \ + .fillna(0.0) + + if label_column in final_data.columns: + # remove tiny classes + final_data = final_data[final_data['trip_type'] != 14] + final_data[label_column] = final_data['trip_type'].apply(encode_label) + return final_data[label_column].values.ravel(), final_data[featureColumns] + else: + return None, final_data[featureColumns] + + +def train(training_data: pd.DataFrame, labels, num_rounds): + xgboost_clf = XGBClassifier(tree_method='hist', objective='multi:softprob', n_estimators=num_rounds) + + features = csr_matrix(training_data[featureColumns]) + model = xgboost_clf.fit(features, labels) + return model + + +def serve(model, data: pd.DataFrame) -> pd.DataFrame: + + sparse_data = csr_matrix(data) + predictions = model.predict(sparse_data) + dec_fun = np.vectorize(decode_label) + predictions_df = pd.DataFrame({'o_order_id': data.index, 'trip_type': dec_fun(predictions)}) + return predictions_df + + +def encode_label(label): + return label_to_index[label] + + +def decode_label(label): + return sorted_labels[label] + + +def main(): + wallclock_start = timeit.default_timer() + model_file_name = 'uc08.python.model' + + parser = argparse.ArgumentParser() + parser.add_argument('--stage', choices=['training', 'serving', 'scoring'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument('--num-rounds', metavar='num-rounds', required=False, type=int, dest='num_rounds') + parser.add_argument("order") + parser.add_argument("lineitem") + parser.add_argument("product") + + args = parser.parse_args() + order_path = args.order + lineitem_path = args.lineitem + product_path = args.product + stage = args.stage + work_dir = args.workdir + if args.output: + output = args.output + else: + output = work_dir + + num_rounds = args.num_rounds if args.num_rounds else 100 + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + raw_data = load_data(order_path, lineitem_path, product_path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + start = timeit.default_timer() + (labels, data) = pre_process(raw_data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + model = train(data, labels, num_rounds) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + joblib.dump(model, work_dir + '/' + model_file_name) + + if stage == 'serving': + model = joblib.load(work_dir + '/' + model_file_name) + + start = timeit.default_timer() + predictions = serve(model, data) + end = timeit.default_timer() + serve_time = end - start + + predictions['o_order_id'] = data.index + predictions.to_csv(output + '/predictions.csv', index=False) + + print('serve time:\t', serve_time) + + wallclock_end = timeit.default_timer() + wallclock_time = wallclock_end - wallclock_start + print('wallclock time:\t', wallclock_time) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase09.py b/scripts/tpcx-ai/use_cases/UseCase09.py new file mode 100644 index 00000000000..7e2798c0fdd --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase09.py @@ -0,0 +1,332 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# Copyright 2021 Intel Corporation. +# This software and the related documents are Intel copyrighted materials, and your use of them +# is governed by the express license under which they were provided to you ("License"). Unless the +# License provides otherwise, you may not use, modify, copy, publish, distribute, disclose or +# transmit this software or the related documents without Intel's prior written permission. +# +# This software and the related documents are provided as is, with no express or implied warranties, +# other than those that are expressly stated in the License. +# +# + + +import argparse +import math +import os +import tarfile +import timeit +import zipfile +from pathlib import Path + +import cv2 +import joblib +import numpy as np +import pandas as pd +from sklearn.preprocessing import LabelEncoder +from tensorflow.keras import optimizers +from tensorflow.keras.callbacks import EarlyStopping +from tensorflow.keras.layers import Dense +from tensorflow.keras.models import Sequential, load_model, save_model +from tensorflow.keras.optimizers import Adadelta +from tensorflow.keras.utils import to_categorical +from tensorflow_addons.losses import TripletHardLoss +from tqdm import tqdm +from systemds.operator.algorithm import multiLogReg, multiLogRegPredict +from systemds.context import SystemDSContext + +from .openface.align import AlignDlib +from .openface.model import create_model + +BATCH_SIZE_DEFAULT = 64 +EPOCHS_EMBEDDING_DEFAULT = 15 +EPOCHS_CLASSIFIER_DEFAULT = 10000 + +IMAGE_SIZE = 96 + + +def load_data(path) -> pd.DataFrame: + if path.endswith('.zip'): + # the given path is a zip file + z = zipfile.ZipFile(path) + getnames = z.namelist + read = z.read + elif path.endswith('.tgz') or path.endswith('.tar.gz'): + # the given path is a compressed tarball + z = tarfile.open(path) + getnames = z.getnames + + def read(p): + z.extractfile(p).read() + z.close() + + else: + # the given path is a directory + new_path = Path(path) + if not new_path.exists(): + raise NotADirectoryError(f"The given path {new_path.absolute()} is not a directory") + + def getnames(): + files = map(str, new_path.rglob("*")) + # files = os.listdir() + # root, dirs, files = os.walk(newPath) + return list(files) + + def read(p): + b = None + with open(p, 'rb') as f: + b = f.read() + return b + + i = 0 + images = [] + identities = [] + paths = [] + names = getnames() + for name in names: + if name.endswith('.jpg') or name.endswith('.png'): + i += 1 + data = read(name) + img = cv2.imdecode(np.frombuffer(data, np.uint8), cv2.IMREAD_COLOR) + identity = os.path.dirname(name).split('/')[-1] # get the last directory from the path + img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) + images.append(img) + identities.append(identity) + paths.append(name) + + return pd.DataFrame({'identity': identities, 'path': paths, 'image': images}) + + +def clean_data(data: pd.DataFrame): + data.groupby(['identity']).filter(lambda rows: len(rows) >= 10) + + +def preprocess_data(data): + res_path = Path(__file__).parent + res_path = res_path / 'resources/uc09/shape_predictor_5_face_landmarks.dat' + aligner = AlignDlib(str(res_path)) + + def align_l(img): return align_image(aligner, img, IMAGE_SIZE) + data['image_aligned'] = data['image'].progress_apply(align_l) + zero = np.ndarray((IMAGE_SIZE, IMAGE_SIZE, 3)) + zero.fill(0.0) + data['image_aligned'] = data['image_aligned'].map(lambda img: zero if img is None else img) + data['image_aligned'] = data['image_aligned'] / 255 + print("pp done") + return data + + +def train_embedding(architecture, data, epochs, batch_size, loss, learning_rate=None): + lr = learning_rate if learning_rate else 0.000001 + opt = optimizers.Adam(learning_rate=lr) + architecture.compile(loss=loss, optimizer=opt) + x = np.stack(data['image_aligned']) + y = np.stack(data['identity']) + encoder = LabelEncoder() + y = encoder.fit_transform(y) + history = architecture.fit(x, y, epochs=epochs, batch_size=batch_size, validation_split=0.1, verbose=2) + model_trained = history.model + return model_trained + + +def train_classifier(data, epochs): + # prepare data + shape = (len(data), 128) + x = np.stack(data.embedding).reshape(shape) + + label_enc = LabelEncoder() + label_enc.fit(data.identity) + num_classes = len(label_enc.classes_) + y = label_enc.transform(data.identity) + + # create keras model that is equivalent of a SVM + model = Sequential() + model.add(Dense(math.log2(num_classes), input_shape=(128,))) + model.add(Dense(num_classes, input_shape=(128, ), kernel_regularizer='l2', activation='linear')) + model.summary(line_length=120) + opt = Adadelta(learning_rate=0.1) + model.compile(optimizer=opt, loss='categorical_hinge') + early_stop = EarlyStopping(monitor='loss', patience=10, verbose=1) + model = model.fit(x, to_categorical(y), batch_size=32, epochs=epochs, callbacks=[early_stop]) + return model.model, label_enc + + +def train_classifier_svm(data): + sds = SystemDSContext() + shape = (len(data), 128) + x = np.stack(data.embedding).reshape(shape) + label_enc = LabelEncoder() + label_enc.fit(data.identity) + y = label_enc.transform(data.identity) + X = sds.from_numpy(x) + Y = sds.from_numpy(y) + betas = multiLogReg(X=X, Y=Y).compute() + sds.close() + return betas, label_enc + + +def serve(model, label_encoder, data): + shape = (len(data), 128) + x = np.stack(data.embedding).reshape(shape) + predictions = model.predict(x) + predictions_label = np.argmax(predictions, axis=1) + predictions_encoded = label_encoder.inverse_transform(predictions_label) + # convert path to integer, e.g. /path/to/file/01.png to 1 + samples = data.path.map(lambda s: int(os.path.splitext(os.path.split(s)[1])[0])) + return pd.DataFrame({'sample': samples, 'prediction': predictions_label, 'identity': predictions_encoded}) + + +def serve_svm(betas, label_encoder, data): + sds = SystemDSContext() + shape = (len(data), 128) + x = np.stack(data.embedding).reshape(shape) + X = sds.from_numpy(x) + B = sds.from_numpy(betas) + prediction_sds = multiLogRegPredict(X=X, B=B).compute() + prediction_sds = np.squeeze(prediction_sds[1]).astype(np.int32) + highest_label = np.max(prediction_sds) + prediction_sds[prediction_sds == highest_label] = 0 + pred_enc_sds = label_encoder.inverse_transform(prediction_sds) + + # convert path to integer, e.g. /path/to/file/01.png to 1 + samples = data.path.map(lambda s: int(os.path.splitext(os.path.split(s)[1])[0])) + sds.close() + return pd.DataFrame({'sample': samples, 'prediction': prediction_sds, 'identity': pred_enc_sds}) + + +def align_image(aligner: AlignDlib, img, image_size): + bb = aligner.getLargestFaceBoundingBox(img) + if not bb: + return None + landmarks = aligner.findLandmarks(img, bb) + new_landmarks = 68 * [(0, 0)] + new_landmarks[33] = landmarks[4] + new_landmarks[36] = landmarks[2] + new_landmarks[45] = landmarks[0] + return aligner.align(image_size, img, bb, landmarks=new_landmarks, + landmarkIndices=AlignDlib.OUTER_EYES_AND_NOSE) + + +def to_embedding(embedding, img): + emb = embedding.predict(np.expand_dims(img, axis=0)) + return emb + + +def main(): + tqdm.pandas() + + model_file_name = "uc09.python.model" + + parser = argparse.ArgumentParser() + parser.add_argument('--nosvm', action='store_true', default=False) + parser.add_argument('--batch', metavar='SIZE', type=int, default=BATCH_SIZE_DEFAULT) + parser.add_argument('--epochs_embedding', metavar='N', type=int, default=EPOCHS_EMBEDDING_DEFAULT) + parser.add_argument('--epochs_classifier', metavar='N', type=int, default=EPOCHS_CLASSIFIER_DEFAULT) + parser.add_argument('--learning_rate', '-lr', required=False, type=float) + + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("filename") + + args = parser.parse_args() + nosvm = args.nosvm + batch_size = args.batch + epochs_embedding = args.epochs_embedding + epochs_classifier = args.epochs_classifier + learning_rate = args.learning_rate if args.learning_rate else None + + model_file_name = f"{model_file_name}.dnn" if nosvm else f"{model_file_name}.svm" + path = args.filename + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + raw_data = load_data(path) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + loss = TripletHardLoss(margin=0.2) + + start = timeit.default_timer() + preprocessed_data = preprocess_data(raw_data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + embedding_pretrained = create_model() + res_path = Path(__file__).parent + weights_path = res_path / 'resources/uc09/nn4.small2.v1.h5' + embedding_pretrained.load_weights(str(weights_path)) + embedding = train_embedding(embedding_pretrained, preprocessed_data, epochs_embedding, batch_size, loss, learning_rate) + preprocessed_data['embedding'] = preprocessed_data['image_aligned'].apply( + lambda img: to_embedding(embedding, img)) + if nosvm: + model, label_enc = train_classifier(preprocessed_data, epochs_classifier) + else: + model, label_enc = train_classifier_svm(preprocessed_data) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + save_model(embedding, work_dir / f"{model_file_name}.embedding", save_format='h5') + if nosvm: + save_model(model, work_dir / model_file_name, save_format='h5') + else: + joblib.dump(model, work_dir / model_file_name) + joblib.dump(label_enc, work_dir / f"{model_file_name}.enc") + + if stage == 'serving': + embedding = load_model(work_dir / f"{model_file_name}.embedding", compile=False, + custom_objects={'TripletHardLoss': loss}) + if nosvm: + model = load_model(work_dir / model_file_name) + else: + model = joblib.load(work_dir / model_file_name) + label_enc = joblib.load(work_dir / f"{model_file_name}.enc") + start = timeit.default_timer() + # get the 128-D embedding for each aligned image + preprocessed_data['embedding'] = preprocessed_data['image_aligned'].apply( + lambda img: to_embedding(embedding, img)) + if nosvm: + prediction = serve(model, label_enc, preprocessed_data) + else: + prediction = serve_svm(model, label_enc, preprocessed_data) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + out_data = prediction + out_data[['sample', 'identity']].sort_values('sample').to_csv(output / 'predictions.csv', index=False) + + +if __name__ == '__main__': + main() diff --git a/scripts/tpcx-ai/use_cases/UseCase10.py b/scripts/tpcx-ai/use_cases/UseCase10.py new file mode 100644 index 00000000000..578d3e33f70 --- /dev/null +++ b/scripts/tpcx-ai/use_cases/UseCase10.py @@ -0,0 +1,182 @@ +# +# Copyright (C) 2021 Transaction Processing Performance Council (TPC) and/or its contributors. +# This file is part of a software package distributed by the TPC +# The contents of this file have been developed by the TPC, and/or have been licensed to the TPC under one or more contributor +# license agreements. +# This file is subject to the terms and conditions outlined in the End-User +# License Agreement (EULA) which can be found in this distribution (EULA.txt) and is available at the following URL: +# http://www.tpc.org/TPC_Documents_Current_Versions/txt/EULA.txt +# Unless required by applicable law or agreed to in writing, this software is distributed on an "AS IS" BASIS, WITHOUT +# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, and the user bears the entire risk as to quality +# and performance as well as the entire cost of service or repair in case of defect. See the EULA for more details. +# + + +# +# "INTEL CONFIDENTIAL" Copyright 2019 Intel Corporation All Rights +# Reserved. +# +# The source code contained or described herein and all documents related +# to the source code ("Material") are owned by Intel Corporation or its +# suppliers or licensors. Title to the Material remains with Intel +# Corporation or its suppliers and licensors. The Material contains trade +# secrets and proprietary and confidential information of Intel or its +# suppliers and licensors. The Material is protected by worldwide copyright +# and trade secret laws and treaty provisions. No part of the Material may +# be used, copied, reproduced, modified, published, uploaded, posted, +# transmitted, distributed, or disclosed in any way without Intel's prior +# express written permission. +# +# No license under any patent, copyright, trade secret or other +# intellectual property right is granted to or conferred upon you by +# disclosure or delivery of the Materials, either expressly, by +# implication, inducement, estoppel or otherwise. Any license under such +# intellectual property rights must be express and approved by Intel in +# writing. +# + +import argparse +import os +import timeit + +# data frames +from pathlib import Path +import pandas as pd +import numpy as np + +#logistic regression +from sklearn.linear_model import LogisticRegression +from systemds.operator.algorithm import multiLogReg, multiLogRegPredict +from systemds.context import SystemDSContext +import joblib + + +def load_data(path_customers: str, path_transactions: str) -> pd.DataFrame: + customer_data = pd.read_csv(path_customers) + transaction_data = pd.read_csv(path_transactions) + customer_data['senderID'] = customer_data['fa_customer_sk'] + data = pd.merge(transaction_data, customer_data, on="senderID") + return (data) + + +def hour_func(ts): + return ts.hour + + +def pre_process(data: pd.DataFrame) -> pd.DataFrame: + data_pre = data + data_pre['time'] = pd.to_datetime(data_pre['time']) + data_pre['business_hour'] = data_pre['time'].apply(hour_func) + data_pre['amount_norm'] = data_pre['amount'] / data_pre['transaction_limit'] + data_pre['business_hour_norm'] = data_pre['business_hour'] / 23 + print("pp done") + if 'isFraud' in data_pre.columns: + return data_pre[['transactionID', 'amount_norm', 'business_hour_norm', 'isFraud']] + else: + return data_pre[['transactionID', 'amount_norm', 'business_hour_norm']] + +# Function to split numpy array into smaller chunks +def split_array(array, chunk_size): + return [array[i:i + chunk_size] for i in range(0, len(array), chunk_size)] + +def train(data: pd.DataFrame): + sds = SystemDSContext() + training_data = data[["business_hour_norm", "amount_norm"]] + x = training_data.to_numpy() + y = data['isFraud'] + y = y.to_numpy() + + # Split data into smaller chunks + X_chunks = split_array(x, 1000000) + Y_chunks = split_array(y, 1000000) + X_sds_chunks = [sds.from_numpy(chunk) for chunk in X_chunks] + Y_sds_chunks = [sds.from_numpy(chunk) for chunk in Y_chunks] + + betas_chunks = [multiLogReg(X=chunk_X, Y=chunk_Y).compute() for chunk_X, chunk_Y in zip(X_sds_chunks, Y_sds_chunks)] + betas_aggregated = np.mean(betas_chunks, axis=0) + + sds.close() + return betas_aggregated + + +def serve(B, data): + sds = SystemDSContext() + serving_data = data[["business_hour_norm", "amount_norm"]] + X = serving_data.to_numpy() + X_chunks = split_array(X, 1000000) + + X_sds_chunks = [sds.from_numpy(chunk) for chunk in X_chunks] + B = sds.from_numpy(B) + prediction_chunks = [multiLogRegPredict(X=chunk_X, B=B).compute() for chunk_X in X_sds_chunks] + sds_predict = [] + for elem in prediction_chunks: + prediction = np.squeeze(elem[1]).astype(np.int32) + sds_predict.append(prediction) + sds_predict_combined = np.concatenate(sds_predict) + sds_predict_combined[sds_predict_combined == 2] = 0 + data['isFraud'] = sds_predict_combined + sds.close() + return data[["transactionID", "isFraud"]] + +def main(): + model_file_name = "uc10.python.model.svm" + + parser = argparse.ArgumentParser() + parser.add_argument('--debug', action='store_true', required=False) + parser.add_argument('--stage', choices=['training', 'serving'], metavar='stage', required=True) + parser.add_argument('--workdir', metavar='workdir', required=True) + parser.add_argument('--output', metavar='output', required=False) + parser.add_argument("customers") + parser.add_argument("transactions") + + args = parser.parse_args() + path_customers = args.customers + path_transactions = args.transactions + stage = args.stage + work_dir = Path(args.workdir) + if args.output: + output = Path(args.output) + else: + output = work_dir + + if not os.path.exists(work_dir): + os.makedirs(work_dir) + + if not os.path.exists(output): + os.makedirs(output) + + start = timeit.default_timer() + (raw_data) = load_data(path_customers, path_transactions) + end = timeit.default_timer() + load_time = end - start + print('load time:\t', load_time) + + start = timeit.default_timer() + preprocessed_data = pre_process(raw_data) + end = timeit.default_timer() + pre_process_time = end - start + print('pre-process time:\t', pre_process_time) + + if stage == 'training': + start = timeit.default_timer() + model = train(preprocessed_data) + end = timeit.default_timer() + train_time = end - start + print('train time:\t', train_time) + + joblib.dump(model, work_dir / model_file_name) + + if stage == 'serving': + model = joblib.load(work_dir / model_file_name) + start = timeit.default_timer() + prediction = serve(model, preprocessed_data) + end = timeit.default_timer() + serve_time = end - start + print('serve time:\t', serve_time) + + out_data = pd.DataFrame(prediction) + out_data.to_csv(output / 'predictions.csv', index=False) + + +if __name__ == '__main__': + main()