\documentclass[preview]{standalone}
\usepackage{header_problems}
\usepackage{header_template}
\begin{document}
\fontsize{12}{15}\selectfont
\Question{Getting Started}
\textbf{Read through this page carefully.} You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Deliverables:
\begin{enumerate}
\item Submit a PDF of your writeup, \textbf{with an appendix for your code}, to assignment on Gradescope, ``<> Write-Up". If there are graphs, include those graphs in the correct sections. Do not simply reference your appendix.
\item If there is code, submit all code needed to reproduce your results, ``<<title>> Code".
\item If there is a test set, submit your test set evaluation results, ``<<title>> Test Set".
\end{enumerate}
After you've submitted your homework, watch out for the self-grade form.
\begin{Parts}
\Part Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
\vspace{15pt}
\framebox(465, 75){}
\Part Please copy the following statement and sign next to it. We just want to make it \textit{extra} clear so that no one inadvertently cheats.
\textit{I certify that all solutions are entirely in my words and that I have not looked at another student's solutions. I have credited all external sources in this write up.}
\vspace{15pt}
\framebox(465, 75){}
\end{Parts}
\pagebreak
\newcommand{\lag}{\mathcal{L}}
\newcommand{\blam}{\mathbb{\lambda}}
\newcommand{\bnu}{\mathbb{\nu}}
\newcommand{\balpha}{\mathbb{\alpha}}
\Question{SVM with custom margins}
In the lecture, we covered the soft margin SVM. The objective to be optimized over the training set
$\{(\vec x_1, y_1), (\vec x_2, y_2), \dots, (\vec x_n, y_n)\}$ is
\begin{align}
\min_{\w, b, \xi_i} & \frac{1}{2}||\w||^2 + C\sum_{i=1}^{n} \xi_i \\
s.t. \quad & y_i (\w\tran \x_i -b) \geq 1 - \xi_i \quad \forall i \\
& \xi_i \geq 0 \quad \forall i
\end{align}
In this problem, we are interested in a modified version of the soft margin SVM where we have
a custom margin for each of the $n$ data points. In the standard soft margin SVM, we pay a penalty
of $\xi_i$ for each of the data point. In practice, we might not want to treat each training point
equally, since with prior knowledge, we might know that some data points are more important than the
others. There is some connection to weighted least squares.
We formally define the following optimization problem:
\begin{align}
\min_{\w, b, \xi_i} & \frac{1}{2}||\w||^2 + C\sum_{i=1}^{n} \phi_i \xi_i \\
s.t. \quad & y_i (\w\tran \x_i -b) \geq 1 - \xi_i \quad \forall i \\
& \xi_i \geq 0 \quad \forall i
\end{align}
Note that the only difference is that we have a weighting factor $\phi_i > 0$ for each of the slack variables $\xi_i$
in the objective function. $\phi_i$ are some constants given by the prior knowledge, thus they can be treated as
known constants in the optimization problem. Intuitively, this formulation weights each of the violations ($\xi_i$)
differently according to the prior knowledge ($\phi_i$).
\begin{Parts}
\Part For the standard soft margin SVM, we have shown that the constrained optimization problem is equal to the following
unconstrained optimization problem, i.e. regularized empirical risk minimization problem with hinge loss:
\begin{align}
\min_{\w, b} \frac{1}{2} ||\w||^2 + C \sum_{i=1}^{n} \max(1-y_i(\w\tran \x_i - b), 0)
\end{align}
\textbf{What's the corresponding unconstrained optimization problem for the SVM with custom margins?}
\Part The dual of the standard soft margin SVM is:
\begin{align}
\max_{\balpha} & \balpha\tran \mathbf{1} - \frac{1}{2}\balpha\tran \mathbf{Q} \balpha \\
s.t. & \sum_{i=1}^{n} \alpha_i y_i = 0 \\
& 0 \leq \alpha_i \leq C \quad i=1, \cdots , n
\end{align}
where $\mathbf{Q} = (\diag\y) \X \X^T (\diag \y)$
\textbf{What's the dual form of the SVM with custom margin? Show the derivation steps in detail. }
\Part \textbf{From the dual formulation above, how would you kernelize the SVM with custom margins?
What role does the $\phi_i$ play in the kernelized version? }
\end{Parts}
\Question{Nearest Neighbors, from A to Z}
For this problem, we will use data from the UN to have some fun with the nearest neighbors approach to learning. A lot of the code you will need has been provided for you.
The data we are using is called the ``World Values Survey.'' It consists of survey data collection over several years from almost all countries. The survey asked ``Which of these are most important for you and your family?'' There were $16$ possible responses, including needs like ``Freedom from Discrimination and Persecution'' and ``Better Transport and Roads.'' The data reported is the fraction of responses in each country that chose each option.
We would like to use these $16$ features of each country (the citizen's responses to the survey) to predict that country's HDI (Human Development Index). In reality, the HDI is a complex measure which takes into account lots of data about a country, including factors like life expectancy, education, per capita income, etc. Intuitively though, you might expect citizens of countries with different HDI to have different priorities. For that reason, predicting the HDI from survey data might be a reasonable endeavor.
Note that throughout the problem we will be using RMSE, which stands for Root Mean Squared Error.
\begin{Parts}
\Part (Bonus): \textbf{Fill out the ``Berkeley's S2018 Values Survey.''} The purpose of this is so that you have a sense of how the data was generated, a useful first step in any ML problem. Just for fun, at the end of this problem we will attempt to predict what the HDI of Berkeley would be if it were its own country.
\Part \label{pt:corrs} First, we should do some basic data exploration. \textbf{Compute the correlation of each feature with HDI. Which feature is the most positively correlated with HDI? Which feature is the most negatively correlated with HDI? Which feature is the least correlated with HDI (closest to $0$)?}
\Part \textbf{For each of these three features identified in \ref{pt:corrs} (most positively correlated, most negatively correlated, least correlated), plot ``HDI versus [Feature].''} You will create three plots in total. \textbf{What do you observe?}
\Part Let's visualize the data a bit more. \textbf{Plot the data in its first two PCA dimensions, colored by HDI.} The code to do this has been provided for you.
\Part Now, let's use our first ML technique. \textbf{Use the code provided to train and cross-validate ridge regression to predict a country's HDI from its citizens' world values survey responses.} \textbf{What is the best RMSE?}
\Part Let's try another ML technique. \textbf{Use the code provided to train and cross-validate LASSO regression to predict a country's HDI from its citizens' world values survey responses.} \textbf{What is the best RMSE?}
\Part \textbf{Examine the model returned by LASSO regression (that is, the $16$ feature weights). Does LASSO regression indeed give more $0$ weights?}
\Part In lecture, we covered $k$-Nearest Neighbors for classification problems. We decided that the class of a test point would be the plurality of the classes of the $k$ nearest training points. That algorithm makes sense when the outputs are discrete, so we can vote. Here, the outputs are continuous. \textbf{How would you adapt the $k$ Nearest Neighbors algorithm for a regression problem?}
\Part \label{pt:knncountry} \textbf{Which countries are the $7$ nearest neighbors of the USA (in order)?}
\Part \label{pt:knnplot} The most important meta-parameter of $k$ nearest neighbors is $k$ itself. \textbf{Plot the RMSE of kNN regression versus $k$, where $k$ is the number of neighbors. What is the best value of $k$? What is the RMSE?}
\Part \textbf{Explain your plot in (\ref{pt:knnplot}) in terms of bias and variance.} This is tricky, so take some time to think about it. Think about the spirit of bias and variance more than their precise definitions.
\Part We do not need to give every neighbor an equal weight. Maybe closer neighbors are more relevant. For the sake of this problem, let's weight each neighbor by the inverse of its distance to the test point. \textbf{Plot the RMSE of kNN regression with distance weighting versus $k$, where $k$ is the number of features. What is the best value of $k$? What is the RMSE?}
\Part One of the challenges of $k$ Nearest Neighbors is that it is very sensitive to the scale of the features. For example, if one feature takes on values $0$ or $0.1$ and another takes on values $0$ or $10$, then the nearest neighbors approach will almost certainly pick nearest neighbors according to the second feature. \textbf{Which countries are the $7$ nearest neighbors of the USA after scaling (in order)? Compare your result to \ref{pt:knncountry}.}
\Part \textbf{Add scaling to your $k$ nearest neighbors pipeline (continue to use distance weighting). Plot RMSE versus $k$. What is the best value for $k$? What is the RMSE?}
\Part (Bonus): \textbf{Rather than scaling each feature to have unit variance, explore ways of scaling the features non-uniformly. How much does this help, if at all?}
\Part You have been given a set of test features: countries where the responses to the world values survey are given but the HDI is not known. \textbf{Using the best model developed so far, predict the HDI values of the countries in the test set. Submit your predictions on Gradescope.}
\Part So far we have dealt with the regression problem. Let's take a brief look at classification. A naive classifier is a classifier which disregards the features and just classifies everything as belonging to a single class. \textbf{In any classification problem with $k$ classes, at least what accuracy are we guaranteed to get with the best naive classifier?} (Hint: there are $k$ possible naive classifiers. Use the pigeonhole principle).
\Part \label{pt:pcaclass} We will split countries into two groups: high HDI (more than $0.7$) and low HDI (less than $0.7$). \textbf{Plot the countries by their first two PCA dimensions again, but now color them by class.}
\Part Examine the graph generated in (\ref{pt:pcaclass}). \textbf{How well do you think a linear SVM would do in classification?}
\Part \label{svmintro} We will use an SVM classifier to predict whether a country's HDI is ``high'' or ``low'' based on the responses of their citizens to the World Values Survey. \textbf{Use the code provided to train and cross-validate an SVM classifier using a linear kernel. What is the accuracy of the classifier?}
\Part We are going to modify the classifier from (\ref{svmintro}). \textbf{Add a PCA step and Scaling step to the SVM pipeline. Your hyper-parameter search should now be over all possible dimensions for the PCA reduction. Does the accuracy improve?}
\Part Change the kernel in \ref{svmintro} from linear to ``radial basis function'' (rbf). For this part, do not use PCA or Scaling. \textbf{What is the accuracy?}
\Part Now we are going to use $k$ Nearest Neighbors for the same task. That is, we would like to predict whether a country's HDI is ``high'' or ``low'' based on the responses of their citizens to the World Values Survey. \textbf{Train and cross-validate a $k$ Nearest Neighbors classifier using distance weighting. What is its accuracy? Does scaling help?}
\Part (Bonus): Towards the end of the week, we will post the ``Berkeley's S2018 Values Survey.'' \textbf{If this course were an independent country, what do you predict its HDI would be?}
\Part (Bonus): \textbf{Describe how you would use kNN to revisit the sensor location problem from previous homework. How well do you think it will work?}
\Part (Bonus): \textbf{What did you learn from this problem? Do you have any useful feedback for the problem author?}
\end{Parts}
\newcommand{\DD}{\mathcal{D}}
\newcommand{\E}{\mathbb{E}}
\Question{Stability: A Unified Approach to Generalization for Classification and Regression}
In this problem, we will study how well machine learning algorithms can
generalize from the dataset they have been trained on (the training set)
to an unseen dataset (the test set). Assume all data is generated from
some distribution $\mathcal D$ and we sample a training set
$S\sim \mathcal D^n$ where $\mathcal D^n$ is the distribution of $n$
datapoints drawn i.i.d{.} from $\mathcal D$. In this problem
we consider estimators of the form
\begin{equation}\label{learning_algorithm}
\w(S) = \arg\min_{\vec w} L_S(\vec w) + \lambda \lVert \vec w\rVert^2
\end{equation}
where $L_S(\vec w)$ is the empirical loss function
\begin{align*}
L_S(\vec w) = \frac 1 n \sum_{i=1}^n \ell(\vec w, z_i)\qquad\text{where $z_i = (\vec x_i, y_i)$}
\end{align*}
evaluated on the training data set $S = (z_1, z_2, \dots, z_n)$. We are interested in the loss
\begin{equation}
L_{\DD} (\vec w) = \E_{z\sim \DD} [\ell(\vec w, z)]
\end{equation}
on a yet unseen test dataset drawn from $\DD$.
We call an estimator \textbf{$\epsilon_n$-stable} if
\begin{align*}
\E_{S\sim \DD^{n}, z'\sim \DD, i\sim U(n)} [\ell(\w(S^{(i)}(z')), z_i) - \ell(\w(S), z_i)] \leq \epsilon_n
\end{align*}
where $S = (z_1, z_2, \dots, z_n)$ and $S^{(i)}(z') = (z_1, z_2, \dots, z_{i-1}, z', z_{i+1}, \dots, z_n)$
and $U(n)$ is the uniform distribution on $\{1, \dots, n\}$.
We will show that if an estimator is \textbf{$\epsilon_n$-stable},
then the test error is close to the training error in expectation, namely
\begin{equation}\label{test_error_bound}
\E_{S\sim \DD^n}[L_\DD(\w(S))] \leq \E_{S\sim \DD^n}[L_S(\w(S))] + \epsilon_n
\end{equation}
In the first part of the problem, we will establish this result and in the second
part, we will apply it to show the generalization properties of ridge regression and
soft margin SVMs.
\begin{Parts}
\item We first link stability to generalization via the fundamental property
\begin{equation}
\label{eq:fundaprop}
\E_{S\sim \DD^n} [L_\DD(\w(S)) - L_S(\w(S))] =
\E_{S \sim \DD^{n}, i\sim U(n), z'\in \DD} [\ell(\w(S^{(i)}(z'), z_i) - \ell(\w(S), z_i)]
\end{equation}
where $S = (z_1, \dots, z_n)$ be an iid sequence of samples and $z'$ be another
iid sample and $U(n)$ be the uniform distribution over $\{1, 2, \dots, n\}$. \textbf{Show equation \eqref{eq:fundaprop} and conclude that \eqref{test_error_bound} holds if $A$ is $\epsilon$-stable}.
Conceptual hint: When computing the expectation over both training points and a test point, switching any one training point and the test point gives the same result.\\
Technical hint: Use that $\frac{1}{n}\sum_{i=1}^n f(z_i) = \E_{i\sim U(n)} f(z_i)$.
\item As a simple example, let's walk through the steps of the proof in the simple example where
we estimate the mean
\begin{equation}
\mu(S) = \frac 1 n \sum_{i=1}^n x_i
\end{equation}
of a dataset $S = (x_1, x_2, \dots, x_n)$ with scalars $x_1, \dots, x_n\in \R$.
The loss function in this case is
\begin{equation*}
\ell(\mu, x_i) = \frac 1 2 (\mu - x_i)^2
\end{equation*}
and the regularization parameter is $\lambda = 0$.
\textbf{First, show that}
\begin{equation*}
\ell(\mu(S^{(i)}(z')), x_i) - \ell(\mu(S), x_i) \leq \frac{C \cdot B\cdot \max_i |x_i|}{n}
\end{equation*}
where $|\mu(S)|, |\mu(S^{(i)}(z'))| \leq B$ and $C$ is some numerical constant.
\textbf{Conclude that the generalization error is bounded by}
\begin{equation*}
\E_{S\sim \DD^n} [L_\DD(\mu(S)) - L_S(\mu(S))] \leq
\frac{C \cdot B\cdot \max_i |x_i|}{n}
\end{equation*}
\item We now consider {\bf ridge regression}. The loss function is \eqref{learning_algorithm}
with $\ell(\vec w, z) = \frac 1 2 (\vec w^\top \vec x_i - y_i)^2$. \textbf{First, show that $\ell(\vec w, z)$ is $\beta$-smooth, i. e.}
\begin{align*}
\lVert \nabla \ell(\vec v, z) - \nabla \ell(\vec w, z)\lVert \leq
\beta \lVert \vec v - \vec w\rVert
\end{align*}
with $\beta = \max_i \norm{x_i}$.
\item \textbf{Derive a generalization bound \eqref{test_error_bound} for ridge regression.}
You may use \emph{without proof} that for a $\beta$-smooth loss function,
we have
\begin{equation}\label{smooth_stability}
\E[\ell(\w(S^{(i)}(z')), z_i) - \ell(\w(S), z_i)] \leq \frac{C\beta}{\lambda n}
\end{equation}
for some constant $C$.
\item Show that if $\ell$ is $\rho$-Lipschitz in the first argument, i.e.
\begin{equation}
\ell(\vec w(S^{(i)}(z')), z_i) - \ell(\vec w(S), z_i) \leq \rho \lVert \vec w(S^{(i)}(z')) - \vec w(S) \rVert
\end{equation}
then the learning algorithm \eqref{learning_algorithm} is $\epsilon$-stable with
\begin{align*}
\ell(\vec w(S^{(i)}(z')), z_i) - \ell(\vec w(S), z_i) \leq \frac{2\rho^2}{\lambda n}.
\end{align*}
Hint: First show
\begin{equation}\label{lipschitz_hint}
\lambda \lVert \vec w(S^{(i)}(z')) - \vec w(S)\rVert^2 \leq
\frac{\ell(\vec w(S^{(i)}), z_i) - \ell(\vec w(S), z_i)}{n} +
\frac{\ell(\vec w(S), z') - \ell(\vec w(S^{(i)}), z')}{n}.
\end{equation}
\item We now consider the {\bf soft margin SVMs}. The associated loss function is
the hinge loss $\ell(\vec w, z_i) = \max(0, 1-y_i \vec w^\top \vec x_i)$.
\textbf{First show that $\ell(\vec w, z_i)$ is $\rho$-Lipschitz with $\rho = \max_i\norm{x_i}$.}
\item \textbf{Derive a
generalization bound \eqref{test_error_bound} for soft margin SVMs.}
\end{Parts}
\Question{Your Own Question}
{\bf Write your own question, and provide a thorough solution.}
Writing your own problems is a very important way to really learn
the material. The famous ``Bloom's Taxonomy'' that lists the levels of
learning is: Remember, Understand, Apply, Analyze, Evaluate, and
Create. Using what you know to create is the top-level. We rarely ask
you any HW questions about the lowest level of straight-up
remembering, expecting you to be able to do that yourself. (e.g. make
yourself flashcards) But we don't want the same to be true about the
highest level.
As a practical matter, having some practice at trying to create
problems helps you study for exams much better than simply counting on
solving existing practice problems. This is because thinking about how
to create an interesting problem forces you to really look at the
material from the perspective of those who are going to create the
exams.
Besides, this is fun. If you want to make a boring problem, go
ahead. That is your prerogative. But it is more fun to really engage
with the material, discover something interesting, and then come up
with a problem that walks others down a journey that lets them share
your discovery. You don't have to achieve this every week. But unless
you try every week, it probably won't happen ever.
\end{document}