Skip to content

Commit 874d6d1

Browse files
committed
update week 13
1 parent f5df0b2 commit 874d6d1

File tree

6 files changed

+951
-488
lines changed

6 files changed

+951
-488
lines changed

doc/pub/week13/html/week13-bs.html

Lines changed: 171 additions & 69 deletions
Large diffs are not rendered by default.

doc/pub/week13/html/week13-reveal.html

Lines changed: 121 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -473,6 +473,10 @@ <h2 id="classical-support-vector-machines-overarching-aims-best-visualized-with-
473473
intuitive way in terms of lines in a two-dimensional space separating
474474
the two classes (see figure below).
475475
</p>
476+
</section>
477+
478+
<section>
479+
<h2 id="basic-mathematics">Basic mathematics </h2>
476480

477481
<p>The basic mathematics behind the SVM is however less familiar to most of us.
478482
It relies on the definition of hyperplanes and the
@@ -601,10 +605,15 @@ <h2 id="what-is-a-hyperplane">What is a hyperplane? </h2>
601605
distinctly classifies the data points.
602606
</p>
603607

604-
<p>In a \( p \)-dimensional space, a hyperplane is what we call an affine subspace of dimension of \( p-1 \).
605-
As an example, in two dimension, a hyperplane is simply as straight line while in three dimensions it is
606-
a two-dimensional subspace, or stated simply, a plane.
608+
<p>In a \( p \)-dimensional space, a hyperplane is what we call an affine
609+
subspace of dimension of \( p-1 \). As an example, in two dimension, a
610+
hyperplane is simply as straight line while in three dimensions it is
611+
a two-dimensional subspace, or stated simply, a plane.
607612
</p>
613+
</section>
614+
615+
<section>
616+
<h2 id="two-dimensional-case">Two-dimensional case </h2>
608617

609618
<p>In two dimensions, with the variables \( x_1 \) and \( x_2 \), the hyperplane is defined as</p>
610619
<p>&nbsp;<br>
@@ -647,6 +656,10 @@ <h2 id="a-p-dimensional-space-of-features">A \( p \)-dimensional space of featur
647656
\boldsymbol{x}_i = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \dots \\ \dots \\ x_{ip} \end{bmatrix}.
648657
$$
649658
<p>&nbsp;<br>
659+
</section>
660+
661+
<section>
662+
<h2 id="more-details">More details </h2>
650663

651664
<p>If the above condition is not met for a given vector \( \boldsymbol{x}_i \) we have </p>
652665
<p>&nbsp;<br>
@@ -675,7 +688,10 @@ <h2 id="a-p-dimensional-space-of-features">A \( p \)-dimensional space of featur
675688
$$
676689
<p>&nbsp;<br>
677690

678-
<p>When we try to separate hyperplanes, if it exists, we can use it to construct a natural classifier: a test observation is assigned a given class depending on which side of the hyperplane it is located.</p>
691+
<p>When we try to separate hyperplanes, if it exists, we can use it to
692+
construct a natural classifier: a test observation is assigned a given
693+
class depending on which side of the hyperplane it is located.
694+
</p>
679695
</section>
680696

681697
<section>
@@ -690,6 +706,10 @@ <h2 id="the-two-dimensional-case">The two-dimensional case </h2>
690706
some reinforcement so that future data points can be classified with
691707
more confidence.
692708
</p>
709+
</section>
710+
711+
<section>
712+
<h2 id="linear-classifier">Linear classifier </h2>
693713

694714
<p>What a linear classifier attempts to accomplish is to split the
695715
feature space into two half spaces by placing a hyperplane between the
@@ -740,7 +760,11 @@ <h2 id="first-attempt-at-a-minimization-approach">First attempt at a minimizatio
740760
$$
741761
<p>&nbsp;<br>
742762

743-
<p>We could now for example define all values \( y_i =1 \) as misclassified in case we have \( \boldsymbol{w}^T\boldsymbol{x}_i+b < 0 \) and the opposite if we have \( y_i=-1 \). Taking the derivatives gives us</p>
763+
<p>We could now for example define all values \( y_i =1 \) as misclassified
764+
in case we have \( \boldsymbol{w}^T\boldsymbol{x}_i+b < 0 \) and the opposite if we have
765+
\( y_i=-1 \). Taking the derivatives gives us
766+
</p>
767+
744768
<p>&nbsp;<br>
745769
$$
746770
\frac{\partial C}{\partial b} = -\sum_{i\in M} y_i,
@@ -776,41 +800,15 @@ <h2 id="solving-the-equations">Solving the equations </h2>
776800
</section>
777801

778802
<section>
779-
<h2 id="code-example">Code Example </h2>
803+
<h2 id="problems-with-the-simpler-approach">Problems with the Simpler Approach </h2>
780804

781805
<p>The equations we discussed above can be coded rather easily (the
782-
framework is similar to what we developed for logistic
783-
regression). We are going to set up a simple case with two classes only and we want to find a line which separates them the best possible way.
806+
framework is similar to what has been developed for say logistic
807+
regression).
784808
</p>
785809

786-
<!-- code=python (!bc pycod) typeset with pygments style "perldoc" -->
787-
<div class="cell border-box-sizing code_cell rendered">
788-
<div class="input">
789-
<div class="inner_cell">
790-
<div class="input_area">
791-
<div class="highlight" style="background: #eeeedd">
792-
<pre style="font-size: 80%; line-height: 125%;">
793-
</pre>
794-
</div>
795-
</div>
796-
</div>
797-
</div>
798-
<div class="output_wrapper">
799-
<div class="output">
800-
<div class="output_area">
801-
<div class="output_subarea output_stream output_stdout output_text">
802-
</div>
803-
</div>
804-
</div>
805-
</div>
806-
</div>
807-
</section>
808-
809-
<section>
810-
<h2 id="problems-with-the-simpler-approach">Problems with the Simpler Approach </h2>
811-
812810
<p>There are however problems with this approach, although it looks
813-
pretty straightforward to implement. When running the above code, we see that we can easily end up with many diffeent lines which separate the two classes.
811+
pretty straightforward to implement. When running such a code, we see that we can easily end up with many diffeent lines which separate the two classes.
814812
</p>
815813

816814
<p>For small
@@ -839,8 +837,12 @@ <h2 id="a-better-approach">A better approach </h2>
839837
<p>&nbsp;<br>
840838

841839
<p>All points are thus at a signed distance from the decision boundary defined by the line \( L \). The parameters \( b \) and \( w_1 \) and \( w_2 \) define this line. </p>
840+
</section>
842841

843-
<p>We seek thus the largest value \( M \) defined by</p>
842+
<section>
843+
<h2 id="largest-value-m">Largest value \( M \) </h2>
844+
845+
<p>We seek the largest value \( M \) defined by</p>
844846
<p>&nbsp;<br>
845847
$$
846848
\frac{1}{\vert \vert \boldsymbol{w}\vert\vert}y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b) \geq M \hspace{0.1cm}\forall i=1,2,\dots, n,
@@ -895,6 +897,10 @@ <h2 id="a-quick-reminder-on-lagrangian-multipliers">A quick Reminder on Lagrangi
895897
df = \frac{\partial f}{\partial x}dx+\frac{\partial f}{\partial y}dy+\frac{\partial f}{\partial z}dz.
896898
$$
897899
<p>&nbsp;<br>
900+
</section>
901+
902+
<section>
903+
<h2 id="not-all-variables-are-indepenent-of-each-other">Not all variables are indepenent of each other </h2>
898904

899905
<p>In many problems the variables \( x,y,z \) are often subject to constraints (such as those above for the margin)
900906
so that they are no longer all independent. It is possible at least in principle to use each
@@ -918,6 +924,10 @@ <h2 id="a-quick-reminder-on-lagrangian-multipliers">A quick Reminder on Lagrangi
918924
d\phi = \frac{\partial \phi}{\partial x}dx+\frac{\partial \phi}{\partial y}dy+\frac{\partial \phi}{\partial z}dz =0.
919925
$$
920926
<p>&nbsp;<br>
927+
</section>
928+
929+
<section>
930+
<h2 id="only-two-independent-variables">Only two independent variables </h2>
921931

922932
<p>Now we cannot set anymore</p>
923933
<p>&nbsp;<br>
@@ -958,6 +968,10 @@ <h2 id="adding-the-multiplier">Adding the Multiplier </h2>
958968
\frac{\partial f}{\partial z}+\lambda\frac{\partial \phi}{\partial z} =0.
959969
$$
960970
<p>&nbsp;<br>
971+
</section>
972+
973+
<section>
974+
<h2 id="more-details">More details </h2>
961975

962976
<p>We need to remember that we took \( dx \) and \( dy \) to be arbitrary and thus we must have</p>
963977
<p>&nbsp;<br>
@@ -987,7 +1001,7 @@ <h2 id="adding-the-multiplier">Adding the Multiplier </h2>
9871001
</section>
9881002

9891003
<section>
990-
<h2 id="setting-up-the-problem">Setting up the Problem </h2>
1004+
<h2 id="setting-up-the-problem">Setting up the problem </h2>
9911005
<p>In order to solve the above problem, we define the following Lagrangian function to be minimized </p>
9921006
<p>&nbsp;<br>
9931007
$$
@@ -996,6 +1010,10 @@ <h2 id="setting-up-the-problem">Setting up the Problem </h2>
9961010
<p>&nbsp;<br>
9971011

9981012
<p>where \( \lambda_i \) is a so-called Lagrange multiplier subject to the condition \( \lambda_i \geq 0 \).</p>
1013+
</section>
1014+
1015+
<section>
1016+
<h2 id="setting-up-derivaties">Setting up derivaties </h2>
9991017

10001018
<p>Taking the derivatives with respect to \( b \) and \( \boldsymbol{w} \) we obtain </p>
10011019
<p>&nbsp;<br>
@@ -1018,9 +1036,13 @@ <h2 id="setting-up-the-problem">Setting up the Problem </h2>
10181036
$$
10191037
<p>&nbsp;<br>
10201038

1021-
<p>subject to the constraints \( \lambda_i\geq 0 \) and \( \sum_i\lambda_iy_i=0 \).
1022-
We must in addition satisfy the <a href="https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions" target="_blank">Karush-Kuhn-Tucker</a> (KKT) condition
1023-
</p>
1039+
<p>subject to the constraints \( \lambda_i\geq 0 \) and \( \sum_i\lambda_iy_i=0 \). </p>
1040+
</section>
1041+
1042+
<section>
1043+
<h2 id="karush-kuhn-tucker-condition">Karush-Kuhn-Tucker condition </h2>
1044+
1045+
<p>We must in addition satisfy the <a href="https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions" target="_blank">Karush-Kuhn-Tucker</a> (KKT) condition</p>
10241046
<p>&nbsp;<br>
10251047
$$
10261048
\lambda_i\left[y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b) -1\right] \hspace{0.1cm}\forall i.
@@ -1094,7 +1116,10 @@ <h2 id="the-last-steps">The last steps </h2>
10941116
b = \frac{1}{N_s}\sum_{j\in N_s}\left(y_j-\sum_{i=1}^n\lambda_iy_i\boldsymbol{x}_i^T\boldsymbol{x}_j\right).
10951117
$$
10961118
<p>&nbsp;<br>
1119+
</section>
10971120

1121+
<section>
1122+
<h2 id="classifier-equations">Classifier equations </h2>
10981123
<p>With our hyperplane coefficients we can use our classifier to assign any observation by simply using </p>
10991124
<p>&nbsp;<br>
11001125
$$
@@ -1132,10 +1157,17 @@ <h2 id="a-soft-classifier">A soft classifier </h2>
11321157
$$
11331158
<p>&nbsp;<br>
11341159

1135-
<p>with the requirement \( \xi_i\geq 0 \). The total violation is now \( \sum_i\xi \).
1136-
The value \( \xi_i \) in the constraint the last constraint corresponds to the amount by which the prediction
1137-
\( y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b)=1 \) is on the wrong side of its margin. Hence by bounding the sum \( \sum_i \xi_i \),
1138-
we bound the total amount by which predictions fall on the wrong side of their margins.
1160+
<p>with the requirement \( \xi_i\geq 0 \). The total violation is now \( \sum_i\xi \). </p>
1161+
</section>
1162+
1163+
<section>
1164+
<h2 id="misclassification">Misclassification </h2>
1165+
1166+
<p>The value \( \xi_i \) in the constraint the last constraint corresponds to
1167+
the amount by which the prediction \( y_i(\boldsymbol{w}^T\boldsymbol{x}_i+b)=1 \) is on
1168+
the wrong side of its margin. Hence by bounding the sum \( \sum_i
1169+
\xi_i \), we bound the total amount by which predictions fall on the
1170+
wrong side of their margins.
11391171
</p>
11401172

11411173
<p>Misclassifications occur when \( \xi_i > 1 \). Thus bounding the total sum by some value \( C \) bounds in turn the total number of
@@ -1161,6 +1193,10 @@ <h2 id="soft-optmization-problem">Soft optmization problem </h2>
11611193
<p>&nbsp;<br>
11621194

11631195
<p>with the requirement \( \xi_i\geq 0 \).</p>
1196+
</section>
1197+
1198+
<section>
1199+
<h2 id="derivatives-with-respect-to-b-and-boldsymbol-w">Derivatives with respect to \( b \) and \( \boldsymbol{w} \) </h2>
11641200

11651201
<p>Taking the derivatives with respect to \( b \) and \( \boldsymbol{w} \) we obtain </p>
11661202
<p>&nbsp;<br>
@@ -1182,6 +1218,10 @@ <h2 id="soft-optmization-problem">Soft optmization problem </h2>
11821218
\lambda_i = C-\gamma_i \hspace{0.1cm}\forall i.
11831219
$$
11841220
<p>&nbsp;<br>
1221+
</section>
1222+
1223+
<section>
1224+
<h2 id="new-constraints">New constraints </h2>
11851225

11861226
<p>Inserting these constraints into the equation for \( {\cal L} \) we obtain the same equation as before</p>
11871227
<p>&nbsp;<br>
@@ -1306,7 +1346,9 @@ <h2 id="kernels-and-non-linearity">Kernels and non-linearity </h2>
13061346
<section>
13071347
<h2 id="the-equations">The equations </h2>
13081348

1309-
<p>Suppose we define a polynomial transformation of degree two only (we continue to live in a plane with \( x_i \) and \( y_i \) as variables)</p>
1349+
<p>Suppose we define a polynomial transformation of degree two only (we
1350+
continue to live in a plane with \( x_i \) and \( y_i \) as variables)
1351+
</p>
13101352
<p>&nbsp;<br>
13111353
$$
13121354
z = \phi(x_i) =\left(x_i^2, y_i^2, \sqrt{2}x_iy_i\right).
@@ -1327,9 +1369,13 @@ <h2 id="the-equations">The equations </h2>
13271369
$$
13281370
<p>&nbsp;<br>
13291371

1330-
<p>from which we also find \( b \).
1331-
To compute \( \boldsymbol{z}_i^T\boldsymbol{z}_j \) we define the kernel \( K(\boldsymbol{x}_i,\boldsymbol{x}_j) \) as
1332-
</p>
1372+
<p>from which we also find \( b \).</p>
1373+
</section>
1374+
1375+
<section>
1376+
<h2 id="defining-the-kernel">Defining the kernel </h2>
1377+
1378+
<p>To compute \( \boldsymbol{z}_i^T\boldsymbol{z}_j \) we define the kernel \( K(\boldsymbol{x}_i,\boldsymbol{x}_j) \) as</p>
13331379
<p>&nbsp;<br>
13341380
$$
13351381
K(\boldsymbol{x}_i,\boldsymbol{x}_j)=\boldsymbol{z}_i^T\boldsymbol{z}_j= \phi(\boldsymbol{x}_i)^T\phi(\boldsymbol{x}_j).
@@ -1342,6 +1388,10 @@ <h2 id="the-equations">The equations </h2>
13421388
K(\boldsymbol{x}_i,\boldsymbol{x}_j)=[x_i^2, y_i^2, \sqrt{2}x_iy_i]^T\begin{bmatrix} x_j^2 \\ y_j^2 \\ \sqrt{2}x_jy_j \end{bmatrix}=x_i^2x_j^2+2x_ix_jy_iy_j+y_i^2y_j^2.
13431389
$$
13441390
<p>&nbsp;<br>
1391+
</section>
1392+
1393+
<section>
1394+
<h2 id="kernel-trick">Kernel trick </h2>
13451395

13461396
<p>We note that this is nothing but the dot product of the two original
13471397
vectors \( (\boldsymbol{x}_i^T\boldsymbol{x}_j)^2 \). Instead of thus computing the
@@ -1381,7 +1431,10 @@ <h2 id="the-problem-to-solve">The problem to solve </h2>
13811431
\( \boldsymbol{y}=[y_1,y_2,\dots,y_n] \).
13821432
If we add the slack constants this leads to the additional constraint \( 0\leq \lambda_i \leq C \).
13831433
</p>
1434+
</section>
13841435

1436+
<section>
1437+
<h2 id="convex-optimization">Convex optimization </h2>
13851438
<p>We can rewrite this (see the solutions below) in terms of a convex optimization problem of the type</p>
13861439
<p>&nbsp;<br>
13871440
$$
@@ -1399,7 +1452,7 @@ <h2 id="the-problem-to-solve">The problem to solve </h2>
13991452
</section>
14001453

14011454
<section>
1402-
<h2 id="different-kernels-and-mercer-s-theorem">Different kernels and Mercer's theorem </h2>
1455+
<h2 id="different-kernels">Different kernels </h2>
14031456

14041457
<p>There are several popular kernels being used. These are</p>
14051458
<ol>
@@ -1410,6 +1463,10 @@ <h2 id="different-kernels-and-mercer-s-theorem">Different kernels and Mercer's t
14101463
</ol>
14111464
<p>
14121465
<p>and many other ones.</p>
1466+
</section>
1467+
1468+
<section>
1469+
<h2 id="mercer-s-theorem">Mercer's theorem </h2>
14131470

14141471
<p>An important theorem for us is <a href="https://en.wikipedia.org/wiki/Mercer%27s_theorem" target="_blank">Mercer's
14151472
theorem</a>. The
@@ -1666,8 +1723,8 @@ <h2 id="mathematical-optimization-of-convex-functions">Mathematical optimization
16661723
vector \( \boldsymbol{\lambda}=[\lambda_1, \lambda_2,\dots, \lambda_n] \) is the optimization variable we are dealing with.
16671724
</p>
16681725

1669-
<p>In our case we are particularly interested in a class of optimization problems called convex optmization problems.
1670-
In our discussion on gradient descent methods we discussed at length the definition of a convex function.
1726+
<p>In our case we are particularly interested in a class of optimization
1727+
problems called convex optmization problems.
16711728
</p>
16721729

16731730
<p>Convex optimization problems play a central role in applied mathematics and we recommend strongly <a href="http://web.stanford.edu/~boyd/cvxbook/" target="_blank">Boyd and Vandenberghe's text on the topics</a>.</p>
@@ -1740,6 +1797,10 @@ <h2 id="a-simple-example">A simple example </h2>
17401797
\end{align*}
17411798
$$
17421799
<p>&nbsp;<br>
1800+
</section>
1801+
1802+
<section>
1803+
<h2 id="rewriting-in-terms-of-vectors-and-matrices">Rewriting in terms of vectors and matrices </h2>
17431804

17441805
<p>The minimization problem can be rewritten in terms of vectors and matrices as (with \( x \) and \( y \) being the unknowns)</p>
17451806
<p>&nbsp;<br>
@@ -1754,6 +1815,10 @@ <h2 id="a-simple-example">A simple example </h2>
17541815
\begin{bmatrix} -1 & 0 \\ 0 & -1 \\ -1 & -3 \\ 2 & 5 \\ 3 & 4\end{bmatrix}\begin{bmatrix} x \\ y\end{bmatrix} \preceq \begin{bmatrix}0 \\ 0\\ -15 \\ 100 \\ 80\end{bmatrix}.
17551816
$$
17561817
<p>&nbsp;<br>
1818+
</section>
1819+
1820+
<section>
1821+
<h2 id="rewriting-inequalities">Rewriting inequalities </h2>
17571822

17581823
<p>We have collapsed all the inequalities into a single matrix \( \boldsymbol{G} \). We see also that our matrix </p>
17591824
<p>&nbsp;<br>
@@ -1835,9 +1900,11 @@ <h2 id="back-to-the-more-realistic-cases">Back to the more realistic cases </h2>
18351900
</section>
18361901

18371902
<section>
1838-
<h2 id="support-vector-machines-for-regression">Support vector machines for regression </h2>
1839-
1840-
<p>Material will be added here.</p>
1903+
<h2 id="plans-for-next-week">Plans for next week </h2>
1904+
<ol>
1905+
<p><li> Discussion of quantum support vector machines</li>
1906+
<p><li> Introducing quantum neural networks</li>
1907+
</ol>
18411908
</section>
18421909

18431910

0 commit comments

Comments
 (0)