- Download TPCx-AI project
- Install Anaconda, Java, SBT (build tool for java)
1~2 for Vessl datacenter workspace
# 1. Install Anaconda cd tpcx-ai-v1.0.3.1/ apt update apt install zip libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6 -y wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh chmod +x ./Anaconda3-2019.07-Linux-x86_64.sh ./Anaconda3-2019.07-Linux-x86_64.sh # Installing Anaconda takes time... # I guess java & sbt required for multi-node only (spark) # 2. java (optional) apt update apt install openjdk-8-jre apt install openjdk-8-jdk-headless # 3. sbt (optional) echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | tee /etc/apt/sources.list.d/sbt_old.list curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | gpg --no-default-keyring --keyring gnupg-ring:/etc/apt/trusted.gpg.d/scalasbt-release.gpg --import chmod 644 /etc/apt/trusted.gpg.d/scalasbt-release.gpg apt update apt install sbt
For Windows & Mac, you need to install Anaconda, (
openjdk-8-jre and sbt) manually.- Install python virtual environment and
chmod +x *.sh bin/*.sh tools/**/*.sh ./setenv.sh ./setup-python.sh ./TPCx-AI_Benchmarkrun.sh > benchmark.log ./TPCx-AI_Validation.sh > validation.log
Project folder structure
workloadmain use case scriptpythonworkload scripts for each use case for single-node systemsparkworkload scripts for each use case for multi-node systemspark3workload scripts for each use case for multi-node system
data-genconfigtpcxai-generation.xmldataset download specificationtpcxai-schema.xmldataset type schema specification
toolspythonpython setup scripts for single-node systemsparkspark setup scripts & ansible playbook for multi-node system
libruntime and libraries
driverproject config & source codeconfigconfig for benchmark (commands and output folder)datdata information for benchmarktcpxai-drivermain python driver for benchmark__main__main script for benchmark
Project scripts
setenv.shset necessary environment variables and scale factor
setup-python.shcreate virtual python environment for single-node system
setup-spark.shcreate virtual python environment for multi-node system
setup-spark.shcreate virtual python environment for multi-node system
TPCx-AI_Benchmarkrun.shrun benchmark
TPCx-AI_Validation.shrun validation (Validation enforces scale factor 1)- There is no meaningful difference except that.
Full_TPCx-AI_Benchmarkrunrun validation and benchmark
Additional folder generated during benchmark
outputraw_datadata before preprocessingdatadata after preprocessingtrainingtraining data with labelsservingserving data without labels (without duplication with training data)scoringscoring data with labels (When global SF is 1, because of internal SF it is subset of training dataset)modeltrained model outputoutputserving and scoring output
loglog folder for each run with SQLite database
- Enable GPU for TF(optional because it does not improve default performance)
conda activate lib/python-venv conda install cudnn -y export LD_LIBRARY_PATH=/root/tpcx-ai-v1.0.3.1/lib/python-venv/lib:$LD_LIBRARY_PATH # check tensorflow detect gpu python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
- Multi-node requirement installation
libXxf86vm,libXxf86vm-devel,mesa-libGL-devel
- Hadoop (HDFS, YARN)
- Spark 2.4+
# Install libs apt install libxxf86vm-dev libgl1-mesa-dev # Install OpenMPI apt install openmpi-bin openmpi-common openssh-client openssh-server libopenmpi-dev # Install Spark wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz tar xf spark-3.5.1-bin-hadoop3.tgz rm spark-3.5.1-bin-hadoop3.tgz mv spark-3.5.1-bin-hadoop3 ~/spark3 pyspark # Install pssh pip install parallel-ssh
Use can use Ansible to automate installation by
tools/spark/ansible_site.yaml except Spark and Hadoop cluster. cd tools/spark # Edit inventory file ansible-playbook -i inventory -e "spark_version=3" ansible_site.yml
After Hadoop Install, you can finally run the benchmark
export PIP_NO_CACHE_DIR=1 export HOROVOD_WITH_TENSORFLOW=1 export HOROVOD_WITH_MPI=1 export HOROVOD_CPU_OPERATIONS="MPI" export HOROVOD_WITHOUT_GLOO=1 export HOROVOD_WITHOUT_PYTORCH=1 export HOROVOD_WITHOUT_MXNET=1
conda create -y -p /home/ubuntu/adabench_dl -c conda-forge python=3.7 gcc_linux-64=10 gxx_linux-64=10 cmake=3 openmpi-mpicc=4 pip=22 setuptools=59 conda activate /home/ubuntu/adabench_dl conda env update -p /home/ubuntu/adabench_dl --file tools/spark/build_dl.yml
Enable Parallel DataGen
./tools/enable_parallel_datagen.sh
Envs before .bashrc interactivity check
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=~/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" export YARN_CONF_DIR=~/hadoop/etc/hadoop export HADOOP_CONF_DIR=~/hadoop/etc/hadoop export HDFS_NAMENODE_USER="ubuntu" export HDFS_DATANODE_USER="ubuntu" export HDFS_SECONDARYNAMENODE_USER="ubuntu" export YARN_RESOURCEMANAGER_USER="ubuntu" export YARN_NODEMANAGER_USER="ubuntu" export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export PATH=$PATH:$JAVA_HOME/bin export PATH=$PATH:~/spark3/bin export PYSPARK_PYTHON=/home/ubuntu/adabench_dl
AWS
- 172.31.15.227
- 172.31.13.72
Ports
9870- HDFS
8088- Hadoop cluster
19888- Job history
8080- Spark
8042- NodeManager
Run a single phase
cd ~/tpcx-ai-v1.0.3.1 source ./setenv.sh ./bin/tpcxai.sh -c /home/ubuntu/tpcx-ai-v1.0.3.1/driver/config/default-spark.yaml -uc 10 --phase SERVING_THROUGHPUT # or ./TPCx-AI_Benchmarkrun.sh
start-all.sh cd ~/spark3/sbin ./start-all.sh
Node count changing test
- spark config execution
--num-executors
- nodes file
- stop or start worker
$HADOOP_HOME/sbin/yarn-daemon.sh stop nodemanager /home/ubuntu/spark3/sbin/stop-slave.sh
Scale factor test
setenvscale factor
- set
TOTAL_USE_CASES
TPC Benchmarks Overview
The Transaction Processing Performance Council (TPC) defines Transaction Processing and Database Benchmarks and delivers trusted results to the industry.
https://www.tpc.org/information/benchmarks5.asp
www.tpc.org
https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPCX-AI_v1.0.3.1.pdf
Seonglae Cho