SIMD

Creator
Created
Created
2019 Nov 5 5:17
Editor
Edited
Edited
2024 Jan 13 6:28
Refs
Refs
BLAS

Single Instruction Multiple Data

Vector Processor
Need new Array processor and controller and Instruction Set Architecture
  • SIMD - Single Instruction operates on multiple data Vector processors - one controller and multi processor - ex. array processor - reduce instruction memory and data memory access - best for loop and week for switch

Vectorizable Loops - do not access same index

Loop in which one iteration is not dependent on the other is vectorizable

  • no control flow in loop = static instruction number is equal to dynamic instruction
Only vectorizable loops can be efficiently executed by vector processors
  • Load / Store vector value - need Vector Register and memory support
  • Vectors of different lengths in same operation - Vector length register(VLEN - control length of vector) or strip-mining(Break loops into pieces - overhead) needed
  • Elements stored apart from each other - Vector Stride Register(VSTR) needed(length between continuative vector)
stride(increment, pitch or step size) of an array(vector)
 

VMIPS

DAXPY - Double precision a times X Plus Y : Y = aX + Y
 
ADDVV.D : add two vectors of doubles ADDVS.D : add vector to a scalar cycle time depends vector length and dependencies - has sequence
notion image
 

compute/memory operation balance broken → bottleneck(usually memory)

memory banking? can be a one solution
 
dynamic instruction : run time number of instruction execution (do not count in-executed instruction) static instruction : line of assembly code
 

Scatter-Gather

If not strided manner(ex. index vector) → indirect access(Scatter - Gather)
  • LVI/SVI instructions : load/store vector indices/gather ex. LVI Va, (Ra+Vk) ;load A[K[]]
 
strided manner : index 0 to length
 

Variables for Vector processor

Convoy

A set of vector instructions that could potentially execute together (since of same vector use)

Chime

Unit of cycle time to execute one element

Chaining

Allows a vector operation to start as soon as the individual elements of its vector source operands become available (RAW in same convoy via chaining can be occur - execute finish = available, available ≠ finish WB)
 

Real Application

intel use word as 16 bits
  • MMX - bad desicion : aliasing MMX to FPU for c ompatibility
    • - MMX data types - all 64 bits 1. packed byte : 8 byte packed into 64 bits 2. packed word : 4 word packed into 64 bits 3. packed double-word : 2 double-words packed into 64 bits 4. packed quad-word : One 64-bit quantity
notion image
  • SSE(1~3) - no aliasing
      1. Adds eight 128-bit registers
      1. Allows SIMD operations on packed single-precision floating-point numbers
notion image
 
 
 
 

Recommendations