UCL Applied AI Test

LFPY4

Part1

Multiple choice. One or more options might be right.

Part2

Conceptual questions (based on the slides and lectures)

Ex. Explain the difference between underfitting and overfitting.

Part3

Code questions (small pieces of code based on the lab sessions)

Ex. Check the piece of code below and correct any mistakes you might find (note that the code might be correct or have mistakes)

Lecture 1

The goal of supervised learning is to learn a function that maps feature vectors (input) to labels (output) based on examples input output pairs (training data) The labels can be discrete categories (classification) or continuous (regression).

Unsupervised learning is used with unlabeled data with the aim of finding some underlying structure/pattern in the data that can be used for tasks such as clustering, outlier/anomaly detection and dimensionality reduction

Reinforcement learning is concerned with how intelligent agents take actions in an environment in order to maximize the notion of cumulative reward.

Feature extraction (feature engineering) → Pre-processing

Features: also called variables

Examples: also called samples, data points

Process of transforming raw data into numerical features that can be processed while preserving the information in the original data set

Ex. Bag-of-words, Spectrogram of a signal using short-time Fourier transform

Data pre-processing usually refers to the addition, deletion or transformation of training set data. Different models have different sensitivity to the data types in the model:

Tree-based models are usually insensitive to the data characteristics

Regression models might be sensitive to the data characteristics

Ex. Centering, Scaling, Data transformations (Rank transformation, Non-linear transformations)

Avoiding leaking information during pre processing (Need to include preprocessing in cross-validation)

Estimators fit/memorize and scalers manage data scale

One-hot encoding is redundant (possible to drop one) and can introduce co-linearity.

Choice which one matters for penalized models while keeping all can make the model more interpretable that requires additional architecture treating co-linearity
For sparse data, Only scale is recommended and not centering, Be careful with sparse data: subtracting anything will make the data dense.

Target Encoding encodes based on an estimate of the average target values for observations belonging to the category

Useful for high cardinality categorical features
Instead of many one-hot variables, one response encoded variable
Ex. zipcode = 12345 → encoded as the average house price in that area

Informative missing: missing data is related to the outcome -> bias in the model

Data imputation uses information in the training set features to estimate missing values
Median or
K-nearest Neighbor

Lecture 2

The aim of machine learning is building a model that works well on novel data. Training error is not a good estimator of the test error.

Model Complexity and

Overfitting is related to

Bias-Variance Trade-off. Expected test error can be decomposed into three terms: noise, bias and variance. (

Risk,

Bias,

Variance)

High variance (Test error is higher than ) → More training data, Reduce model complexity,
Bagging

High bias (Training error is higher than ) → Use more complex model, Add features,
Boosting

Model selection and model assessment should be done using different partitions of the data

Model selection estimates the performance of different models in order to choose the best one.

Model assessment (model evaluation): having chosen a final model, estimating its generalization.

Cross-validation is an option when the sample size is limited and It can be slow. CV score does not give a true estimation of generalization performance. Common mistake: dataset was not split into train and test before CV.

Nested Cross-validation doesn’t yield single model. Model performance is Mean performance across outer folds.

Stratified ensures relative class frequencies reflect relative class frequencies. Grouped Data have a group structure. We can assume that data within a group are more correlated than across groups. In GroupKFold each group is entirely on the training or testing set. To account group structure, Matching / stratification should be done across classes (in classification problems) and across folds.

Accuracy misleads for imbalanced classes, does not take into account uncertainty and cannot be optimized with gradient-based methods. Balanced accuracy is an average of the accuracy per class.

Precision: “How many of those predicted positive are actually positive?”

Recall: “How many of those which are actually positive are correctly predictive as positive?”

Specificity: “How many of those which are actually negative are correctly predictive as negative?”

Soft classifier and

ROC Curve or Multiclass ROC AUC

Metrics for regression are , and

Lecture 3

Simply minimizing the training loss can lead to overfitting. is usually added to objective function to prevent overfitting and solve ill-posed problems (p>N)

L2-norm heavily penalizes large weight but slightly penalizes small weight. Differentiable while L1-norm is non-differentiable.

SVM know as maximum margin classifier. The kernel methodology provides an investigating general types of relationships in the data.

The dual representation with proper regularization and kernel method enables efficient solution when p>N () as the complexity of the problem depends on the number of examples and instead of on the number of input dimensions .

Lecture 4

Decision tree. A key advantage of the recursive binary tree is its interpretability/intuitive.

How do we find the splitting variables and splitting points? Finding the best binary partition in terms of minimizing the sum of the squares is generally computationally infeasible. Solution: greedy partitioning.

Tree size is a hyper-parameter determining the model complexity. Pre-pruning: (does not improve the sum or minimum node; cost-complexity pruning) Post-pruning: merge/inverse

The importance of a feature is computed as the (normalized) total reduction of the impurity criterion brought by that feature. It is also known as the Gini importance (or mean decreased impurity)

Big Challenge: Instability, High-variance, Non-linear

Voting Bagging (Averaging) Boosting (combine the outputs of many weak learners to produce a strong learner). Bagging can dramatically reduce the variance of unstable procedures like trees. Adaboost modifies the data to weighted data at each iteration.

Bagging trees leads to reduction in variance but not bias. The idea in is to improve the variance reduction of bagging by reducing the correlation between trees. This is achieved through random selection of the input variables/features. 1) For each tree: Pick a bootstrap sample of data 2) For each split: Pick random sample of the features. Builds a large collection of de-correlated trees and averages them.

is a Generalization of boosting to arbitrary loss function. Usually outperforms random forest but needs careful tuning.

Lecture 5

Semantic segmentation: classifying each pixel in an image into a predefined category

Instance segmentation: Semantic+ distinguishes different instances of the same class

Segmentation is Sensitive to noise and illumination variations. Difficult in handling complex scenes, overlapping objects and mostly Based on low-level features (color, intensity, texture, …).

Sliding window is Highly inefficient so Fully Convolutional Neural Nets appears but Memory intensive and computational expensive. Encoder-Decoder Architecture has compact representation.

Resolution enhancement, Depth estimation

Lecture 6

Model Interpretation

global explanation: identifying features relevant for the prediction

local explanation: interpretability refers to the ability to extract information about specific predictions to justify why a model produced certain output

Intrinsic interpretability: Sparse linear models are considered interpretable due to their structure.

Post hoc interpretability: Complex models, such as neural network need post hoc methods after training the model to enable interpretability

What features are relevant? Linear models: coefficients/weights, Tree models: feature importance.

allow the incorporation of domain knowledge through additional spatial and temporal constraints. each coefficient of a linear model is associated to a location; Structure sparse models use this 3D structure to define particular types of regularizations that take into account the location of the features.

regularization favors solutions that have constant value in contiguous regions (piece-wise clusters) and .

Correlation among features might make coefficients uninterpretable. L1 regularization might pick up a random feature from a correlated group

is the process of selecting a subset of relevant features for use in model construction for data reduction & Data understanding & Improve performance.

Filter methods (measure to score/rank the features) / Wrapper methods (predictive model to score/rank the features) / Embedded methods (as part of the model construction process)

Forward selection: starts with an empty set and progressively add features yielding to the improvement of performance

Backward elimination: starts with all the features and progressively eliminate the least useful ones

Explaining the model ≠ explained the data. Model inspection only provides information about the model, The model might not accurately reflect the data

Lecture 7

PCA is an unsupervised machine learning method, which allows exploring covariance of features in the data

The number of PCs can be decided based on the singular values, explained variance or using crossvalidation in case of a downstream task.

transforms the original variables into new, uncorrelated features while PCA retains the variance of the original data

Can be interpreted (meaningful signs while PCA signs nothing). Optimization procedure is non-convex; requires initialization. Learned components are not orthogonal nor naturally ordered.

decompose X into independent non-Gaussian components. ICA finds the independent components by maximizing the statistical independence of the estimated components. (Minimization of mutual information, Maximization of non-Gaussianity)

Manifold learning Axes do not have interpretation in the input space and cannot transform new data (except ). t-SNE has a cost function that is not convex.

Cross-decomposition methods (, ) for Multi-view data find associations across multi-view data. PLS and CCA extract directions of maximum covariance and correlation respectively. PLS/CCA models can be used to identify latent dimensions of cross modality association

if one view has much higher variance it can dominate the PLS solution

CCA is more sensitive to the direction of the relationships across

When p > N CCA degenerates because the within-view covariance matrix cannot be inverted

Regularized Cross-decomposition methods provide provide solutions when p < N robust to noise with sparse solutions. owever, a common assumption is that XX and YY is identical.

Lecture 8

Clustering for Data exploration, Data partition and Unsupervised feature extraction.

Choice of similarity (or dissimilarity) measure is somewhat similar to the specification of a loss function in supervised learning. It depends on domain knowledge.

iterative greedy descent like

K-means Clustering rather treat exponential non-feasible problem.

Hierarchical clustering: K does not need to be specified 1) Agglomerative (merge) 2) Divisive (split) such as Single linkage (smallest distance), Average linkage (mean distance), Complete linkage (smallest maximum distance) and Ward (variance increase). Some linkage criteria can lead to very imbalanced clusters size.

GMM Soft assign points to clusters but still may reach a locally optimal solution

for small within-cluster variance and large between clusters variances.

Measures the separation distance between clusters.