YSU Big Data Milestone 1

Todo

Spark, Spark3 with HDFS, Horovod

Report architecture

What is TPCxAI?

TPCx-AI is a benchmark developed by the Transaction Processing Performance Council (TPC), which serves as a standardized performance measurement tool to evaluate computational abilities of Artificial Intelligence (AI) systems. It is a tool used to assess various aspects of AI workloads. By providing a standardized framework, TPCx-AI allows fair evaluations of AI system performance, which aids organizations to make informed decisions in investing in AI infrastructure.

Benchmark architecture

folder structure, data input, data output and dependencies

Folder Structure: Project folder structure

workload main use case script

python workload scripts for each use case for single-node system
spark workload scripts for each use case for multi-node system
spark3 workload scripts for each use case for multi-node system

data-gen

config

tpcxai-generation.xml dataset download specification
tpcxai-schema.xml dataset type schema specification

tools

python python setup scripts for single-node system
spark spark setup scripts for multi-node system

lib runtime and libraries

driver project config & source code

config config for benchmark (commands and output folder)
dat data information for benchmark
tcpxai-driver main python driver for benchmark

__main__ main script for benchmark

Project scripts

setenv.sh set necessary environment variables and scale factor

setup-python.sh create virtual python environment for single-node system

setup-spark.sh create virtual python environment for multi-node system

setup-spark.sh create virtual python environment for multi-node system

TPCx-AI_Benchmarkrun.sh run benchmark

TPCx-AI_Validation.sh run validation

Full_TPCx-AI_Benchmarkrun run validation and benchmark

Additional folder generated during benchmark

output

raw_data data before preprocessing
data data after preprocessing

training training data with labels
serving serving data without labels (without duplication with training data)
scoring scoring data with labels (When global SF is 1, because of internal SF it is subset of training dataset)

model trained model output
output serving and scoring output

log log folder for each run with sqlite database

Dependencies: Tensorflow, Java, SBT, Anaconda

The benchmark architecture starts by setting up an executing environment with necessary hardware (mainly space on the computer) and software configurations.

The benchmark setup includes installing dependencies such as JAVA, SBT, Tensorflow and Conda. The benchmark workflow involves cleaning the data first, then data generation, loading test, power training, test and scoring.

Dataset chose

Why we chose

We chose the Customer Conversation Transcription because speech-to-text is becoming more of more used in the real world. Either to analyze voice pshising

dataset because it is a mix of audio and text offering a diverse range of data types. Moreover, since it’s a Customer Conversation Transcription , It embodies the essence of customer service interactions, which are inherently diverse and complex, which is ideal to develop robust speech-to-text transcription.

Dataset schema:

Model type to train

Vanilla Single-node benchmark result

CPU, GPU result

Vanilla Multi-node benchmark result (optional)

System improvement plan

MLFlow
S3

YSU Big Data Milestone 1

Todo

Report architecture

Project scripts

Additional folder generated during benchmark

Recommendations