YSU Big Data TPCxAI

Created
Created
2024 Mar 12 14:30
Creator
Creator
Seonglae ChoSeonglae Cho
Editor
Edited
Edited
2024 May 26 16:59
Refs
Refs
TPCx-AI
  1. Download TPCx-AI project
  1. Install Anaconda, Java, SBT (build tool for java)
 
1~2 for Vessl datacenter workspace
# 1. Install Anaconda cd tpcx-ai-v1.0.3.1/ apt update apt install zip libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6 -y wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh chmod +x ./Anaconda3-2019.07-Linux-x86_64.sh ./Anaconda3-2019.07-Linux-x86_64.sh # Installing Anaconda takes time... # I guess java & sbt required for multi-node only (spark) # 2. java (optional) apt update apt install openjdk-8-jre apt install openjdk-8-jdk-headless # 3. sbt (optional) echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee /etc/apt/sources.list.d/sbt.list echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | tee /etc/apt/sources.list.d/sbt_old.list curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | gpg --no-default-keyring --keyring gnupg-ring:/etc/apt/trusted.gpg.d/scalasbt-release.gpg --import chmod 644 /etc/apt/trusted.gpg.d/scalasbt-release.gpg apt update apt install sbt
For Windows & Mac, you need to install Anaconda, (openjdk-8-jre and sbt) manually.
 
 
 
  1. Install python virtual environment and
chmod +x *.sh bin/*.sh tools/**/*.sh ./setenv.sh ./setup-python.sh ./TPCx-AI_Benchmarkrun.sh > benchmark.log ./TPCx-AI_Validation.sh > validation.log
 
 
 

Project folder structure

  • workload main use case script
    • python workload scripts for each use case for single-node system
    • spark workload scripts for each use case for multi-node system
    • spark3 workload scripts for each use case for multi-node system
  • data-gen
    • config
      • tpcxai-generation.xml dataset download specification
      • tpcxai-schema.xml dataset type schema specification
  • tools
    • python python setup scripts for single-node system
    • spark spark setup scripts & ansible playbook for multi-node system
  • lib runtime and libraries
  • driver project config & source code
    • config config for benchmark (commands and output folder)
    • dat data information for benchmark
    • tcpxai-driver main python driver for benchmark
      • __main__ main script for benchmark

Project scripts

  • setenv.sh set necessary environment variables and scale factor
  • setup-python.sh create virtual python environment for single-node system
  • setup-spark.sh create virtual python environment for multi-node system
  • setup-spark.sh create virtual python environment for multi-node system
  • TPCx-AI_Benchmarkrun.sh run benchmark
  • TPCx-AI_Validation.sh run validation (Validation enforces scale factor 1)
    • There is no meaningful difference except that.
  • Full_TPCx-AI_Benchmarkrun run validation and benchmark
 

Additional folder generated during benchmark

  • output
    • raw_data data before preprocessing
    • data data after preprocessing
      • training training data with labels
      • serving serving data without labels (without duplication with training data)
      • scoring scoring data with labels (When global SF is 1, because of internal SF it is subset of training dataset)
    • model trained model output
    • output serving and scoring output
  • log log folder for each run with SQLite database
 
 
 
 
  1. Enable GPU for TF(optional because it does not improve default performance)
conda activate lib/python-venv conda install cudnn -y export LD_LIBRARY_PATH=/root/tpcx-ai-v1.0.3.1/lib/python-venv/lib:$LD_LIBRARY_PATH # check tensorflow detect gpu python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
 
 
  1. Multi-node requirement installation
  • libXxf86vm, libXxf86vm-devel, mesa-libGL-devel
  • Hadoop (HDFS, YARN)
  • Spark 2.4+
# Install libs apt install libxxf86vm-dev libgl1-mesa-dev # Install OpenMPI apt install openmpi-bin openmpi-common openssh-client openssh-server libopenmpi-dev # Install Spark wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz tar xf spark-3.5.1-bin-hadoop3.tgz rm spark-3.5.1-bin-hadoop3.tgz mv spark-3.5.1-bin-hadoop3 ~/spark3 pyspark # Install pssh pip install parallel-ssh
Use can use Ansible to automate installation by tools/spark/ansible_site.yaml except Spark and Hadoop cluster.
cd tools/spark # Edit inventory file ansible-playbook -i inventory -e "spark_version=3" ansible_site.yml
After
Hadoop Install
, you can finally run the benchmark
export PIP_NO_CACHE_DIR=1 export HOROVOD_WITH_TENSORFLOW=1 export HOROVOD_WITH_MPI=1 export HOROVOD_CPU_OPERATIONS="MPI" export HOROVOD_WITHOUT_GLOO=1 export HOROVOD_WITHOUT_PYTORCH=1 export HOROVOD_WITHOUT_MXNET=1
conda create -y -p /home/ubuntu/adabench_dl -c conda-forge python=3.7 gcc_linux-64=10 gxx_linux-64=10 cmake=3 openmpi-mpicc=4 pip=22 setuptools=59 conda activate /home/ubuntu/adabench_dl conda env update -p /home/ubuntu/adabench_dl --file tools/spark/build_dl.yml

Enable Parallel DataGen

./tools/enable_parallel_datagen.sh
 

Envs before .bashrc interactivity check

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=~/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" export YARN_CONF_DIR=~/hadoop/etc/hadoop export HADOOP_CONF_DIR=~/hadoop/etc/hadoop export HDFS_NAMENODE_USER="ubuntu" export HDFS_DATANODE_USER="ubuntu" export HDFS_SECONDARYNAMENODE_USER="ubuntu" export YARN_RESOURCEMANAGER_USER="ubuntu" export YARN_NODEMANAGER_USER="ubuntu" export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export PATH=$PATH:$JAVA_HOME/bin export PATH=$PATH:~/spark3/bin export PYSPARK_PYTHON=/home/ubuntu/adabench_dl
 

AWS

  1. 172.31.15.227
  1. 172.31.13.72
 
 
 

Ports

  • 9870 - HDFS
  • 8088 - Hadoop cluster
  • 19888 - Job history
  • 8080 - Spark
  • 8042 - NodeManager
 
 
 

Run a single phase

cd ~/tpcx-ai-v1.0.3.1 source ./setenv.sh ./bin/tpcxai.sh -c /home/ubuntu/tpcx-ai-v1.0.3.1/driver/config/default-spark.yaml -uc 10 --phase SERVING_THROUGHPUT # or ./TPCx-AI_Benchmarkrun.sh
 
start-all.sh cd ~/spark3/sbin ./start-all.sh
 
 

Node count changing test

  • spark config execution --num-executors
  • nodes file
  • stop or start worker
$HADOOP_HOME/sbin/yarn-daemon.sh stop nodemanager /home/ubuntu/spark3/sbin/stop-slave.sh
 
 
 

Scale factor test

  • setenv scale factor
  • set TOTAL_USE_CASES
 
 
 
 
 
TPC Benchmarks Overview
The Transaction Processing Performance Council (TPC) defines Transaction Processing and Database Benchmarks and delivers trusted results to the industry.
www.tpc.org
 
 
 

Recommendations