SIMD

Single Instruction Multiple Data

Vector Processor

Need new Array processor and controller and Instruction Set Architecture

SIMD - Single Instruction operates on multiple data Vector processors - one controller and multi processor - ex. array processor - reduce instruction memory and data memory access - best for loop and week for switch

Vectorizable Loops - do not access same index

Loop in which one iteration is not dependent on the other is vectorizable

no control flow in loop = static instruction number is equal to dynamic instruction

Only vectorizable loops can be efficiently executed by vector processors

Load / Store vector value - need Vector Register and memory support

Vectors of different lengths in same operation - Vector length register(VLEN - control length of vector) or strip-mining(Break loops into pieces - overhead) needed

Elements stored apart from each other - Vector Stride Register(VSTR) needed(length between continuative vector)

stride(increment, pitch or step size) of an array(vector)

VMIPS

DAXPY - Double precision a times X Plus Y : Y = aX + Y

ADDVV.D : add two vectors of doubles ADDVS.D : add vector to a scalar cycle time depends vector length and dependencies - has sequence

compute/memory operation balance broken → bottleneck(usually memory)

memory banking? can be a one solution

dynamic instruction : run time number of instruction execution (do not count in-executed instruction) static instruction : line of assembly code

Scatter-Gather

If not strided manner(ex. index vector) → indirect access(Scatter - Gather)

LVI/SVI instructions : load/store vector indices/gather ex. LVI Va, (Ra+Vk) ;load A[K[]]

strided manner : index 0 to length

Variables for Vector processor

Convoy

A set of vector instructions that could potentially execute together (since of same vector use)

Chime

Unit of cycle time to execute one element

Chaining

Allows a vector operation to start as soon as the individual elements of its vector source operands become available (RAW in same convoy via chaining can be occur - execute finish = available, available ≠ finish WB)

Real Application

intel use word as 16 bits

MMX - bad desicion : aliasing MMX to FPU for c ompatibility

- MMX data types - all 64 bits 1. packed byte : 8 byte packed into 64 bits 2. packed word : 4 word packed into 64 bits 3. packed double-word : 2 double-words packed into 64 bits 4. packed quad-word : One 64-bit quantity

SSE(1~3) - no aliasing

Adds eight 128-bit registers

Allows SIMD operations on packed single-precision floating-point numbers