1. Introduction
1.1 Welcome
1.2 What is Machine Learning
Machine Learning definition
- Tom Mitchell: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
1.3 Supervised Learning
(We’re given what is the so-called right answer)
regression problems and classification problems
1.4 Unsupervised Learning
(We’re given data that doesn’t have any labels or that have the same label or really no labels. Given this data set, an unsupervised learning algorithm might decide that the data lives in two different clusters. And the unsupervised learning algorithms may break these data into two separate clusters. So this is called a clustering algorithm. )
SVD functions (Singular Value Decomposition)奇异值分解
2. Linear regression with one variable
2.1 Model representation
Linear regression with one variable(univariate linear regression)
Notation:
m = Number of training examples
x‘s = “input” variable / features
y‘s = “output” variable / “target” variable
(x,y) = one training example
($x^i,y^i$)=$i^{th}$ training example
2.2 Cost function
Hypothesis:$h_\theta x=\theta_0+\theta_1$
Parameters: $\theta_0,\theta_1$
Cost function:(square error cost function):$J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i)^2$(the most commonly used one for regression problems)
Goal: minimize $J(\theta_0,\theta_1)$
2.3 Cost function intuition(1)
2.4 Cost function intuition(2)
2.5 Gradient descent
Gradient descent algorithm
repeat until convergence{
$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$ (for $j=0$ and $j=1$)
$\alpha:$learning rate
simultaneously update $\theta_0$ and $\theta_1 $
}
Correct: Simultaneously update
temp0$:=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)$
temp1$:=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)$
$\theta_0:=$temp0
$\theta_1:=$temp1
2.6 Gradient descent intuition
$\theta_1:=\theta_1-\alpha\frac{\mathrm{d}}{\mathrm{d}\theta_1}J(\theta_1)$
Gradient descent can converge to a local minimum, even with the learning rate $\alpha$ fixed
As we approach a local minimum, gradient descent will automatically take smaller steps. So no need to decrease $\alpha$ over time.
2.7 Gradient descent for linear regression
Gradient descent algorithm
repeat until convergence {
$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)$
(for $j=0$ and $j=1$)
}
Linear Regression Model
$h_\theta x=\theta_0+\theta_1$
$J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i)^2$
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples.
3. Linear Algebra review (optional)
3.1 Matrices and vectors
An vector is a matrix that has only 1 column, so you haveS an nx1 matrix
$y^i=i^{th}$ element
Math: 1-indexed vectors
Machine learning: 0-indexed vectors
(By convention, most people will use upper case to refer to matrices and usually we’ll use lower case to refer to either numbers, or just raw numbers or scalars or to vectors.)