## Search This Blog

### Machine Learning Notes 3

Let $J(w,b)$ be any cost function of two variables. If we want to be specific, we can assume that $J(w,b)$ is the usual mean squared error cost function from the univariate linear regression model.

1) The algorithm first sets two arbitrary values into $w$ and $b$ (e.g. $w := 0,\ b := 0$)

2) Then a loop begins. At each step of that loop the algorithm applies these two updates simultaneously

2.1) $w := w - \alpha \cdot \frac{\partial J}{\partial w}(w,b)$

2.2) $b := b - \alpha \cdot \frac{\partial J}{\partial b}(w,b)$

Here $\alpha$ is some small positive number e.g. $\alpha=0.01$

This number $\alpha$ is called the learning rate. It controls the size of the step which the algorithm takes downhill in each iteration of the loop.

3) The loop repeats step 2) until the values of $(w,b)$ eventually settle at some point $(w_0, b_0)$. Settle means that after a number of steps, the values of $w$ and $b$ no longer change much with each additional step of the algorithm. At that point, the algorithm exits the loop. The point $(w_0, b_0)$ is the point where $J(w,b)$ has a local minimum. We also say that the algorithm converges to the point $(w_0, b_0)$ after a number of steps.

Note 1: A correct implementation ensures that the updates in step 2) are done simultaneously. What does that mean? That means in the two right hand sides of the assignments 2.1) and 2.2), we should make sure we are using "the old values" of $w$ and $b$. Usually this is implemented in the following way.

$tmp\_w := w - \alpha \cdot \frac{\partial J}{\partial w}(w,b)$

$tmp\_b := b - \alpha \cdot \frac{\partial J}{\partial b}(w,b)$

$w := tmp\_w$

$b := tmp\_b$

Note 2: Choosing a good value for $\alpha$ is very important. If $\alpha$ is too small, the gradient descent algorithm will work correctly but might be too slow. If $\alpha$ is too big, the gradient descent algorithm may overshoot (as they call it), i.e. it may miss the local minimum, i.e. the algorithm may diverge instead of converge (this is usually detected by the condition that after several iterations of the algorithm, the cost $J(w,b)$ actually gets bigger instead of getting smaller).

### Real Analysis - Basic Differentiation Rules

$C' = 0$, where $C$ is a constant

$(C \cdot x)' = C$, where $C$ is a constant

$(x^n)' = n \cdot x^{n-1}$, $x \in \mathbb{R}$, $n \in \mathbb{Z}$

$(x^a)' = a \cdot x^{a-1}$, $x \in \mathbb{R}$, $x \gt 0$, $a \in \mathbb{R}$

$(\sin{x})' = \cos{x}$

$(\cos{x})' = -\sin{x}$

$(\tan{x})' = (tg\ {x})' = 1 / \cos^2{x}$

$(\cot{x})' = (cotg\ {x})' = -1 / \sin^2{x}$

$(\arcsin{x})' = 1 / \sqrt{1-x^2}$

$(\arccos{x})' = -1 / \sqrt{1-x^2}$

$(arctan\ {x})' = (arctg\ {x})' = 1 / (1 + x^2)$

$(arccot\ {x})' = (arccotg\ {x})' = -1 / (1 + x^2)$

$(e^x)' = e^x$

$(a^x)' = a^x \cdot \ln{a}\ \ \$ ($a \gt 0$ is a constant)

$(\ln{x})' = 1 / x$

$(\log_{a}x)' = 1 / (x \cdot \ln{a})\ \ \$ ($a \gt 0$ is a constant)

$$(u + v)' = u' + v'$$

$$(u - v)' = u' - v'$$

$$(uv)' = u' \cdot v + v' \cdot u$$

$$\left(\frac{u}{v}\right)' = \frac{u'v - v'u}{v^2}$$

$$(f(g(x)))' = f'(g(x)) \cdot g'(x)$$

### Machine Learning Notes 2

Supervised learning terminology/notation

• Training set is the dataset used to train the model
• $x$ - the input variable, the input feature
• $y$ - the output variable, the target variable
• $m$ - the total number of the training examples (training data points)
• $(x^{(i)}, y^{(i)})$ - the $i$-th training example (for $i=1,2,\dots, m$); here the expression $(i)$ is just a superscript, it does not denote exponentiation
• $f$ - hypothesis function, prediction function; also $f$ is called the model
• $\widehat{y} = f(x)$, where $x$ is the input variable and $\widehat{y}$ is the prediction for the target variable $y$
• $\widehat{y}^{(i)} = f(x^{(i)})$, where $x^{(i)}$ is the $i$-th training example, and $\widehat{y}^{(i)}$ is the prediction value corresponding to $x^{(i)}$
• Cost function is a function which estimates how well a given model predicts the values for the target variable $y$; this is a function which measures how well the model fits the training data. Cost function is a general concept which is applicable to any supervised learning model.

Linear regression with one variable (univariate linear regression)

When we have only one input variable we call this model/algorithm univariate linear regression. In a univariate linear regression model, we have $f(x) = wx + b$, where $x$ is the input variable, and $w, b$ are numbers which are called parameters of the model. So $f(x)$ is a linear function of $x$. It's also written sometimes as $f_{w,\ b}(x) = wx + b$. By varying $w$ and $b$ we get different linear models. The parameter $w$ is called the slope, and $b$ is called the y-intercept because $b=f(0)$ and so $b$ is the point where the graph of $f(x)$ intercepts the $y$ axis.

When we have more than one input variable (more than one feature) the model is called multivariate linear regression. In this case we try to predict the values of the target variable $y$ based on several input variables $x_1, x_2, \dots, x_k$. Here we have $k \in \mathbb{N}, k \ge 2$.

In a univariate linear regression model the most commonly used cost function is the mean squared error cost function.

$$J(w,b) = \frac{1}{m} \cdot \sum_{i=1}^m (\widehat{y}^{(i)} - y^{(i)})^2$$

In the last formula

$\widehat{y}^{(i)} = f_{w,b}(x^{(i)}) = wx^{(i)} + b$

for $i=1,2,\dots,m$.

So for the cost function we also get the following expression

$$J(w,b) = \frac{1}{m} \cdot \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})^2$$

Note that $J(w,b) \ge 0$

In practice an additional division by 2 is performed in the formulas given above. This is done just for practical reasons, to make further computations simpler. In this way we get the final version of our cost function.

$$J(w,b) = \frac{1}{2m} \cdot \sum_{i=1}^m (\widehat{y}^{(i)} - y^{(i)})^2$$

$$J(w,b) = \frac{1}{2m} \cdot \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})^2$$

Then the task is to find values of $w,b$ such that the value of $J(w,b)$ is as small as possible.
The smaller the value of $J(w,b)$, the better the model.

All this info about univariate linear regression can be summarized with the following picture
(pictures credits go to Andrew Ng's Coursera ML Course).

### An application of the AM-GM inequality

I found this problem and its solution in a FB group dedicated to maths. Here they are.

Problem

If $a,b,c \gt 0$ prove that

$$\frac{a^2 + b^2}{c} + \frac{b^2 + c^2}{a} + \frac{c^2 + a^2}{b} \ge 2(a+b+c)$$

Solution:

By the AM-GM inequality we have that

$$\frac{a^2}{c} + c + \frac{b^2}{c} + c \ge 2\sqrt{a^2} + 2\sqrt{b^2} = 2(a+b)$$

$$\frac{b^2}{a} + a + \frac{c^2}{a} + a \ge 2\sqrt{b^2} + 2\sqrt{c^2} = 2(b+c)$$

$$\frac{c^2}{b} + b + \frac{a^2}{b} + b \ge 2\sqrt{c^2} + 2\sqrt{a^2} = 2(c+a)$$

When we sum up these 3 inequalities we get the desired inequality.

Note: This solution is due to Ghimisi Dumitrel, Professor at Colegiul National Ionita Asan, Caracal.

### Machine Learning Notes 1

1) Machine learning (ML) related definitions

Definition A (Athur Samuel, 1959): Machine learning is the field of study which gives computers the ability to learn without being explicitly programmed.

Definition B (Tom Mitchell, 1998): Well-posed machine learning problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

2) Types of ML algorithms

In general any ML algorithm can be classified into one of the following two categories

• Supervised learning (the "right answers" are given in the initial data set; i.e. the data set is labeled)
• Regression: trying to predict the values of some quantity which we view as a continuous function
• Classification: trying to predict the values of some quantity whose possible values form a small finite discrete set (that set could be e.g. {Yes, No}; {0,1,2,3}; {Cat, Dog, Other}; etc.); trying to classify a set of data points into a small finite number of classes
• Unsupervised learning (the "right answers" are not given in the initial data set; i.e. the data set is unlabeled); in unsupervised learning the task is to automatically find some structure in the unlabeled data set that is given
• Clustering: a data set is given, the task is to break it down into several separate clusters (i.e. into several separate disjoint data sets). Examples: 1) Google news (news articles from various sources are grouped together if they are about the same story); 2) Market segmentation: find market segments/clusters in a dataset which contains data about all customers of a given company.
• Non-clustering:
• Independent component analysis (used in signal processing). Example: separating different voices (or say different audio tracks) out of one (or a few) given chaotic audio recordings (see e.g. (1) cocktail party effect, (2) cocktail party problem and algorithm). The so-called "cocktail party algorithm" allows you to find structure in a chaotic environment (identifying individual voices and music from a mesh of sounds at a cocktail party).
• Anomaly detection (detect unusual data points in a given dataset). Anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior (Source: Wikipedia).
• Dimensionality reduction (compress data points using fewer numbers). Dimensionality reduction, or dimension reduction, is the transformation of data from a higher-dimensional space into a lower-dimensional space so that the lower-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension (Source: Wikipedia).
3) GNU Octave

This is a powerful framework for quick prototyping of solutions/algorithms for ML problems (see here: GNU Octave)

### NumPy operators and their corresponding NumPy universal functions (ufuncs)

NumPy operators and their corresponding NumPy universal functions (ufuncs)

This table was compiled from the 3 tables shown on pages 53, 72, 75 of the book Python Data Science Handbook by Jake VanderPlas. I put it here just for my own reference.

 Arithmetic Operator Equivalent ufunc Description + np.add Addition (e.g., 1 + 1 = 2) - np.subtract Subtraction (e.g., 3 - 2 = 1) - np.negative Unary negation (e.g., -2) * np.multiply Multiplication (e.g., 2 * 3 = 6) / np.divide Division (e.g., 3 / 2 = 1.5) // np.floor_divide Floor division (e.g., 3 // 2 = 1) ** np.power Exponentiation (e.g., 2 ** 3 = 8) % np.mod Modulus/remainder (e.g., 9 % 4 = 1) Comparison Operator Equivalent ufunc == np.equal != np.not_equal < np.less <= np.less_equal > np.greater >= np.greater_equal Logical Operator Equivalent ufunc See also & np.bitwise_and np.logical_and | np.bitwise_or np.logical_or ^ np.bitwise_xor np.logical_xor ~ np.bitwise_not np.logical_not

### NumPy rules of broadcasting

These rules are listed in the book Python Data Science Handbook by Jake VanderPlas (see page 65).

Broadcasting in NumPy is simply a set of rules for applying binary universal functions (like addition, subtraction, multiplication, etc.) on NumPy arrays of different sizes.

Here are the NumPy broadcasting rules.

• Rule 1: If the two arrays differ in their number of dimensions, the shape of the
one with fewer dimensions is
padded with ones on its leading (left) side.

• Rule 2: If the shape of the two arrays does not match in any dimension, the array
with shape equal to 1 in that dimension is stretched to match the other shape.

• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is
raised.