YSU Big Data Milestone 1

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2024 Mar 19 5:38
Editor
Edited
Edited
2024 Mar 25 9:20
Refs
Refs

Todo

Spark, Spark3 with HDFS, Horovod
 
 
 
 
 

Report architecture

  1. What is TPCxAI?
      • TPCx-AI is a benchmark developed by the Transaction Processing Performance Council (TPC), which serves as a standardized performance measurement tool to evaluate computational abilities of Artificial Intelligence (AI) systems. It is a tool used to assess various aspects of AI workloads. By providing a standardized framework, TPCx-AI allows fair evaluations of AI system performance, which aids organizations to make informed decisions in investing in AI infrastructure.
      notion image
      notion image
  1. Benchmark architecture
    1. folder structure, data input, data output and dependencies
    2.  
      Folder Structure: Project folder structure
      • workload main use case script
        • python workload scripts for each use case for single-node system
        • spark workload scripts for each use case for multi-node system
        • spark3 workload scripts for each use case for multi-node system
      • data-gen
        • config
          • tpcxai-generation.xml dataset download specification
          • tpcxai-schema.xml dataset type schema specification
      • tools
        • python python setup scripts for single-node system
        • spark spark setup scripts for multi-node system
      • lib runtime and libraries
      • driver project config & source code
        • config config for benchmark (commands and output folder)
        • dat data information for benchmark
        • tcpxai-driver main python driver for benchmark
          • __main__ main script for benchmark

      Project scripts

      • setenv.sh set necessary environment variables and scale factor
      • setup-python.sh create virtual python environment for single-node system
      • setup-spark.sh create virtual python environment for multi-node system
      • setup-spark.sh create virtual python environment for multi-node system
      • TPCx-AI_Benchmarkrun.sh run benchmark
      • TPCx-AI_Validation.sh run validation
      • Full_TPCx-AI_Benchmarkrun run validation and benchmark
       

      Additional folder generated during benchmark

      • output
        • raw_data data before preprocessing
        • data data after preprocessing
          • training training data with labels
          • serving serving data without labels (without duplication with training data)
          • scoring scoring data with labels (When global SF is 1, because of internal SF it is subset of training dataset)
        • model trained model output
        • output serving and scoring output
      • log log folder for each run with sqlite database
       
    3. Dependencies: Tensorflow, Java, SBT, Anaconda
    4. The benchmark architecture starts by setting up an executing environment with necessary hardware (mainly space on the computer) and software configurations.
      The benchmark setup includes installing dependencies such as JAVA, SBT, Tensorflow and Conda. The benchmark workflow involves cleaning the data first, then data generation, loading test, power training, test and scoring.
  1. Dataset chose
    1. notion image
      notion image
      notion image
    2. Why we chose
    3. We chose the Customer Conversation Transcription because speech-to-text is becoming more of more used in the real world. Either to analyze voice pshising
       
      dataset because it is a mix of audio and text offering a diverse range of data types. Moreover, since it’s a Customer Conversation Transcription , It embodies the essence of customer service interactions, which are inherently diverse and complex, which is ideal to develop robust speech-to-text transcription.
    4. Dataset schema:
    5.  
    6. Model type to train
  1. Vanilla Single-node benchmark result
    1. CPU, GPU result
  1. Vanilla Multi-node benchmark result (optional)
  1. System improvement plan
    1. MLFlow
    2. S3
 
 
 
 
 
 
 
 

Recommendations