Hubert Wang

I am
Hubert Wang

Wechat Official Account
Find fun things here!

Machine Learning 07: Kernel Regression and SVM

Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017

Regression

Wish to learn f: X—>Y, where Y is real, given {}

Approach:

  1. choose some parameterized form for P(Y|X; θ)

  ( θ is the vector of parameters)

  1. derive learning algorithm as MCLE or MAP estimate for θ

1. Choose parameterized form for P(Y|X; θ)

Let’s assume Y is some deterministic f(X), plus random noise

y = f(x) + ε, where ε ~ N(0, σ)

Therefore Y is a random variable that follows the distribution

p(y|x) = N(f(x), σ)

and the expected value of y for any given x is f(x)

2. Derive learning algorithm as MCLE or MAP estimate for θ

Consider Linear Regression

p(y|x; W) = N(w0 + w1x, σ^2)

How can we learn W from the training data?

Learn Maximum Conditional Likelihood Estimate!

How about MAP instead of MLE estimate?

If we assume zero-mean Gaussian prior on each w_i:

This is also called Hinge Loss.

But what if y=f(x) is non-linear?

Let's start by learning Kernel Functions

Kernel Functions

  • Kernel functions provide a way to manipulate data as though it were projected into a higher dimensional space, by operating on it in its original space
  • This leads to efficient algorithms that learn
    • linear regression in implicit high dimension spaces, corresponding to non-linear regression in their actual input space
    • linear decision surfaces in implicit high dimension spaces, corresponding to nonlinear classifiers in their actual input space
  • And is a key component of algorithms such as
    • Support Vector Machines
    • kernel regression
    • kernel PCA
    • kernel CCA

Regularized Linear Regression

Wish to learn f: X —> Y, where X = , Y real-valued

Note that in this course, we use to represent inner product of matrix (ie. dot product).

Vectors, Data Points, Inner Products

We draw the x and w in coordinate system. if U1 = unit vector [1, 0], and U2 = unit vector [0, 1], then w can be represented by U1 and U2. In the same way, if we use two training samples x1 and x2 to replace U1 and U2, w can be represented by two training examples x1 and x2: w = x1α1 + x2α2.

It means that the weight we are going to learn can be represented by training examples' variables.

It also means that we can represent the weight in the sapce spanned by the variables <x1 … xn>.

Primary form and Dual form

With the idea above, we can transform the primary form to the dual form, which is only represented by x without w appears.

Key idea: Dual form expresses linear regression f(x) in terms of dot products with training examples.

-> more efficient computation when features per example > # examps

-> turns training and application of learned model into dot products, which allows using kernel functions

[]

The derive of dual form is very simple: take the truth that w can be represented by x, and put this into f(x) = <x, w>.

Note that XX^T is actually is a matrix of m x m.

Kernel

  • Many learning tasks are framed as optimization problems
  • Primal and Dual formulations of optimization problems
  • Dual version framed in terms of dot products between x’s
  • Kernel functions k(x,y) allow calculating dot products <Φ(x),Φ(y)> without bothering to project x into Φ(x)
  • Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces

Example: Quadratic Kernel

Suppose we have data originally in 2D, but project it into 3D using Φ(x).

But we can use the following kernel function to calculate inner products in the projected 3D space, in terms of operations in the 2D space. (It actually avoid transforming into 3D calculation!)

And use it to train and apply our regression function, never leaving 2D space:

Some Common Mercer Kernels

Properties of Kernels

Theorem (Mercer):

K is a kernel if and only if:

  • K is symmetrics
  • For any sey of training points x1, x2, … , xm, and for any a1, a2, … , am ∈ R, we have:

Kernel Closure Properties

Easily create new kernels using basic ones!

Simple Kernel Based Classifier

Consider finding the centres of mass of positive and negative examples and classifying a test point by measuring which is closest

We can express as a function of kernel evaluations.

Support Vector Machines

Two key ideas:

  1. new loss function (maximize margin)
  2. use kernels to achieve non-linear boundaries

The decision surface is decided by the margin. The margin is the distance from decision surface to the closest points.

How did we got the value of margin in the following diagram?

Note that the W is always perpendicular to the decision surface! Shown as below.

The margin is gotton by maximize the margin as shown in the blue rectangle above.

Linear hyperplane is fully defined by “support vectors” *

Moving other points a little doesn’t effect the decision boundary

only need to store the support vectors to predict labels of new points

How many support vectors in linearly separable case, given d dimensions? <= d+1

SVM Decision Surface using Gaussian Kernel

Although SVM can only get linear decision surface in 2D space, but with kernel, we can map it to other spaces (like the Gaussian space above), in these cases, the decision surface can be very complex.

SVM with Soft Margins

What if the examples are not perfectly splitable? For example:

Allow "error" in classification.

It seems like to add a "penalty" item to the minimize goal if the example is within hyperplane or even in the other side.

SVM Soft Margin Decision Surface using Gaussian Kernel

Comparing the performance of SVM and Logistic Regression. If no kernel is used, SVM and LR performs very similar because both of them give linear decision boundary. But if you use kernel with SVM, it usually performs better since non-linear decision surface is given.

SVM Summary

  • Objective: maximize margin between decision surface and data
  • Primal and dual formulations
    • dual represents classifier decision in terms of support vectors
  • Kernel SVM’s
    • learn linear decision surface in high dimension space, working in original low dimension space
  • Handling noisy data: soft margin “slack variables”
    • again primal and dual forms
  • SVM algorithm: Quadratic Program optimization
    • single global optimum
2590
TOC
Comments
Write a Comment