Foundation AI Final

Sear 469, Blue, S9

1c i) typo The answer is 21/5, as many of you obtained. I have mistyped 2 with 4.

Week 1

iid Independent means one event’s outcome doesn’t provide any information about another event. Identical means that if a subset of data is sampled from different parts of the dataset, the distribution is the same (identical parameters) Deterministic means non-independent.

MNAR,

MAR,

MCAR

Data Imputation

Models make a lot of assumptions and many of these assumptions are not met on test data.

The variable types define how we conduct data pre-processing

Data Scaling,

Data Format JSON is schema flexible semi-structured data

Week 2

a change of one unit of feature changes the odds ratio by a factor

Unsupervised learning

Data Clustering

Dimension Reduction

Week 3

K-means Clustering

Week 4

Garbage in, Garbage out

Data Processing

Text Preprocessing

(
Text Parsing paragraph level)

Text Tokenizer word level

Text Normalization generalization

(
Stop word removal information level)

(
Text Lemmatization information level)

(
Text Stemming information level)

Bag of words,

BPE

Data Augmentation

Week 5

For better generalization performance, we prefer biased models within the

Bias-Variance Trade-off to prevent the high variance that occurs in unbiased models

Bias-Variance Trade-off
Overfitting
Underfitting
Risk

Data Leakage

k-fold cross validation Divide the train data into k mutual exclusive folds. Use k-1 folds to train and validation on the remain fold. It gives us multiple estimates that provide the standard deviation of the performance metric. It gives a more statistically robust estimate. However, It is computationally expensive than using a single validation set.

LOOCV which is most precise estimate with high variance and low bias. Each data point is used as a test instance exactly once, and the remaining data points are used for training. (Leave-one-out cross-validation)

Confusion Matrix

F1-score 2x/+,

Accuracy diagonal/all

Precision - 분자 고정

Recall score - 분모 고정

Recall Precision Tradeoff,

ROC Curve

For a random classifier, the precision-recall tradeoff shows that precision converges to the actual dataset ratio, while recall can range from 0 to 1 depending on how the prediction threshold affects the proportions in the confusion matrix.

T test,

p-value

MRR,

Mean Average Precision,

NDCG