\documentclass[preview]{standalone}
\usepackage{header_problems}
\usepackage{header_template}
\begin{document}
\fontsize{12}{15}\selectfont
\Question{Getting Started}
\textbf{Read through this page carefully.} You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Deliverables:
\begin{enumerate}
\item Submit a PDF of your writeup, \textbf{with an appendix for your code}, to assignment on Gradescope, ``<> Write-Up". If there are graphs, include those graphs in the correct sections. Do not simply reference your appendix.
\item If there is code, submit all code needed to reproduce your results, ``<<title>> Code".
\item If there is a test set, submit your test set evaluation results, ``<<title>> Test Set".
\end{enumerate}
After you've submitted your homework, watch out for the self-grade form.
\begin{Parts}
\Part Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
\vspace{15pt}
\framebox(465, 75){}
\Part Please copy the following statement and sign next to it. We just want to make it \textit{extra} clear so that no one inadvertently cheats.
\textit{I certify that all solutions are entirely in my words and that I have not looked at another student's solutions. I have credited all external sources in this write up.}
\vspace{15pt}
\framebox(465, 75){}
\end{Parts}
\pagebreak
\Question{$\ell_1$-Regularized Linear Regression}
The $\ell_1$-norm is one of the popular regularizers used to enhance the robustness of regression models.
Regression with an $\ell_1$-penalty is referred to as the Lasso regression.
It promotes sparsity in the resulting solution.
In this problem, we will explore the optimization objective of the Lasso.
Assume the training data points are denoted as the rows of a $n\times d$ matrix
$\X$ and their corresponding output value as an $n\times 1$ vector $\y$.
The parameter vector and its optimal value are represented by $d\times 1$
vectors $\w$ and $\w^*$, respectively.
For the sake of simplicity, assume columns of data have been standardized to have
mean $0$ and variance $1$, and are uncorrelated (i.e. $\X^\top\X=n\I$).
(\emph{We center the data mean to zero, so that the lasso penalty treats
all features similarly. We assume uncorrelated features in order to reasons about Lasso in the
upcoming parts.})
For lasso regression, the optimal parameter vector is given by:
\[
\w^* = \argmin_{\w} \{J_{\lambda}(\w) = \frac{1}{2} \|\y- \X\w\|_2^2 +\lambda \|\w\|_1\},
\]
where $\lambda >0$.
\begin{Parts}
\Part {\bf Show that for data with uncorrelated features, one
can learn the parameter $w_i$ corresponding to each $i$-th feature
independently from the other features, one at a time, and get
a solution which is equivalent to having learned them all
jointly as we normally do.}
To show this, write $J_{\lambda}(\w)$ in the following form for appropriate functions $g$ and $f$:
\[
J_{\lambda}(\w) = g(\y) + \sum_{i=1}^d f(\X_i,\y,w_i,\lambda)
\]
where $\X_i$ is the $i$-th column of $\X$.
\Part Assume that $w_i^* >0$. {\bf What is the value of $w_i^*$ in this case?}
\Part Assume that $w_i^* <0$. {\bf What is the value of $w_i^*$ in this case?}
\Part From the previous two parts, {\bf what is the condition for $w_i^*$ to be zero?}
\Part Now consider the ridge regression problem where the regularization term is replaced by
$\lambda\|\w\|_2^2$ where the optimal parameter vector is now given by:
\[
\w^* = \argmin_{\w} \{J_{\lambda}(\w) = \frac{1}{2} \|{\y}- \X\w\|_2^2 +\lambda \|\w\|^2_2\},
\]
where $\lambda >0$.
{\bf What is the condition for $w_i^*=0$? How does it differ from the condition
you obtained in the previous part?} Can you see why the $\ell_1$ norm promotes sparsity?
\Part Assume that we have a sparse image vectorized in the vector $\w$
(so $\w$ is a sparse vector).
We have a gaussian matrix $n \times d$ matrix $\X$ and an $n \times 1$
noise vector $\z$ where $n>1$.
Our measurements take the form $\y = \X\w + \z$.
We want to extract the original image $\w$ given matrix $\X$ knowing that this
image is sparse. The fact that $\w$ is sparse suggests using $\ell_1$ regularization.
Use the provided iPython notebook and apply $\ell_1$ regularization.
{\bf Change the hyperparameter $\lambda$ to extract the best looking image and report it.}
\end{Parts}
\Question{Variance of Sparse Linear Models Obtained by Thresholding}
In this question, we will analyze the variance of sparse linear models. In particular, we will analyze two procedures that perform feature selection in linear models, and show quantitatively that feature selection lowers the variance of linear models.
This should make sense to you at an intuitive level: enforcing sparsity is equivalent to deliberately constraining model complexity; think about where this puts you on the bias variance trade-off.
However note that there is a subtle difference between feature selection before training and after training.
If we use less number of features to begin with, our results so far imply that we will have low variance because
of smaller model complexity.
What we learn from working through this problem is that selecting features adaptively (that is based on the training data)
does not hurt either, if done in certain ways.
In other words, although there is a difference between doing feature selection before or after using the data, post training feature selection still leads to variance reduction under certain assumptions.
First, some setup. Data from a sparse linear model is generated using
\begin{align*}
y = \X w^* + z,
\end{align*}
where $y \in \mathbb{R}^n$ denotes a vector of responses, $\X \in \mathbb{R}^{n \times d}$ is our data matrix, $w^* \in \mathbb{R}^d$ is an unknown, $s$-sparse vector (with at most $s$ non-zero entries), and $z \sim N(0, \sigma^2 I_n)$ is an $n$-dimensional vector of i.i.d. Gaussian noise of variance $\sigma^2$.
Before we begin analyzing this model, recall that we have already analyzed the performance of the simple least-squares estimator $\widehat{w}_{{\sf LS}} = (\X^\top \X)^{-1} \X^\top y$. In particular, we showed in (Problem 4)[HW 3] and the (Problem 4)[Practice Midterm] that
\begin{align}
\mathbb{E} \left[ \frac{1}{n} \| \X(\widehat{w}_{{\sf LS}} - w^*) \|_2^2 \right] = \sigma^2 \frac{d}{n}, \text{ and} \label{eq:pred} \\
\mathbb{E} \left[ \| \widehat{w}_{{\sf LS}} - w^* \|_2^2 \right] = \sigma^2 \mathsf{trace} \left[(\X^\top \X)^{-1} \right], \label{eq:est}
\end{align}
respectively. Equations \eqref{eq:pred} and \eqref{eq:est} represent the ``prediction" error (or variance) and the mean squared
error of the parameters of our model, respectively. For algebraic convenience, \emph{we will assume in this problem that the matrix $\X$ has orthonormal columns}, and so the bounds become
\begin{align}
\mathbb{E} \left[ \frac{1}{n} \| \X(\widehat{w}_{{\sf LS}} - w^*) \|_2^2 \right] = \sigma^2 \frac{d}{n}, \text{ and} \label{eq:pred1} \\
\mathbb{E} \left[ \| \widehat{w}_{{\sf LS}} - w^* \|_2^2 \right] = \sigma^2 d, \label{eq:est1}
\end{align}
In this problem, we will analyze two estimators that explicitly take into account the fact that $w^*$ is sparse, and consequently attain lower error than the vanilla least-squares estimate.
Let us define two operators. Given a vector $v \in \mathbb{R}^d$, the operation $\tau_k (v)$ zeroes out all but the top $k$ entries of $v$ measured in absolute value. The operator $T_{\lambda}(v)$, on the other hand, zeros out all entries that are less than $\lambda$ in absolute value.
Recall that the least squares estimate was given by $\widehat{w}_{{\sf LS}} = \X^{\dagger} y = \X^\top y$, where $\X^\dagger$ is the pseudo-inverse of $\X$ (it is equal to the transpose since $\X$ has orthonormal columns). We now define
\begin{align*}
\widehat{w}_{{\sf top}}(s) &= \tau_s(\widehat{w}_{{\sf LS}}) \\
\widehat{w}_T(\lambda) &= T_\lambda (\widehat{w}_{{\sf LS}}),
\end{align*}
which are the two sparsity-inducing estimators that we will consider.
The solution to first three parts of the problem can be filled out in the provided iPython notebook.
Parts (d)-(j) must have a separate, written solution. All logarithms are to the base $e$.
\begin{Parts}
\Part Let us first do some numerical exploration. {\bf In the provided iPython notebook, you will find code to generate and plot the behavior of the least squares algorithm.}
\Part Now implement the two estimators described above and {\bf plot their performance as a function of $n$, $d$ and $s$.}
\Part {\bf Now generate data from a non-sparse linear model, and numerically compute the estimators
for this data.
Explain the behavior you are seeing in these plots in terms of the bias-variance trade-off.}
\Part In the rest of the problem, we will theoretically analyze the variance of the top-k procedure, and try to explain the curves above. We will need to use a handy tool, which is a bound on the maximum of Gaussian random variables.
{\bf Show that given $d$ Gaussians $\{Z_i\}_{i=1}^d$ (not necessarily independent) with mean $0$ and variance $\sigma^2$, we have}
\begin{align*}
\Pr \left\{ \max_{i \in \{1, 2, \ldots, d\}} |Z_i| \geq 2 \sigma \sqrt{\log d} \right\} \leq \frac{1}{d}.
\end{align*}
Hint 1: You may use without proof the fact that for a Gaussian random variable $Z \sim N(0, \sigma^2)$ and scalar $t > 0$, we have $\Pr\{|Z| \geq t\} \leq e^{- \frac{t^2}{2\sigma^2}}$.
Hint 2: For the maximum to be large, one of the Gaussians must be large. Now use the union bound.
\Part {\bf Show that $\widehat{w}_{{\sf top}}(s)$ returns the top $s$ entries of the vector $w^* + z'$ in absolute value, where $z'$ is i.i.d. Gaussian with variance $\sigma^2$.}
\Part {\bf Argue that the (random) error vector $e = \widehat{w}_{{\sf top}}(s) - w^*$ is always (at most) $2s$-sparse.}
\Part Let us now condition on the event $\mathcal{E} = \{\max |z_i'| \leq 2 \sigma \sqrt{\log d} \}$. Conditioned on this event, {\bf show that we have $|e_i| \leq 4 \sigma \sqrt{\log d}$ for each index $i$.}
\Part {\bf Conclude that with probability at least $1 - 1/d$, we have}
\begin{align*}
\| \widehat{w}_{{\sf top}}(s) - w^* \|_2^2 \leq 32 \sigma^2 s\log d.
\end{align*}
Compare this with equation \eqref{eq:est1} (assuming for the moment that the above quantity is actually an expectation, which we can also show but haven't). When is parameter estimation with the top-$s$ procedure better than the least squares estimator according to these calculations? Pay particular attention to the regime in which $s$ is constant (we have 10 features in our problem that we consider important, although we continue to collect tons of additional ones.)
\Part Use the above part to {\bf show that with probability at least $1 - 1/d$, we have}
\begin{align*}
\frac{1}{n} \| \X (\widehat{w}_{{\sf top}}(s) - w^*) \|_2^2 \leq 32 \sigma^2 \frac{s \log d}{n},
\end{align*}
and conclude that we have smaller variance as long as $s \leq \frac{d}{32 \log d}$.
In conclusion, roughly speaking the sparse solution has a mean prediction error upper bound of $O(\sigma^2 \frac{s \log d}{n})$ with high probability. On the other hand, the least square solution has an expected mean prediction error of $\sigma^2 \frac{d}{n}$. Thus when $s \log d < c \times d$, the sparse solution has smaller error than the least square solution, where $c$ is some constant.
\Part {\bf Now consider the case if we already knew the important $s$ features to begin with. What would be the variance
of the sparse OLS estimator using the $s$ important features? How does this variance compare to the variance bound
for the sparse $\widehat{w}_{{\sf top}}(s)$ derived above?}
\Part {\bf BONUS: Derive the variance of the threshold estimator} $\widehat{w}_{T}(\lambda)$ with $\lambda = 2 \sigma \sqrt{\log d}$.
\end{Parts}
\Question{Decision Trees and Random Forests}
In this problem, you will implement decision trees and random forests for classification on two datasets:
\begin{enumerate}
\item Titanic Dataset: predict Titanic survivors
\item Spam Dataset: predict if a message is spam
\end{enumerate}
In lectures, you were given a basic introduction to decision trees and how such trees are trained. You were also introduced to random forests. Feel free to research different decision tree techniques online.
\textbf{NOTE:} You should NOT use any software package for decision trees for Part~\ref{pt:treescratch}.
\begin{Parts}
\Part \label{pt:treescratch} \textbf{Implement the information gain and gini impurity splitting rules for a decision tree.}
See \texttt{decision\_tree\_starter.py} for the recommended starter code. The code sample is a simplified implementation, which combines decision tree and decision node functionalities and splits only on one feature at a time. \textbf{Include your code for information gain and gini impurity.}
\textbf{Note}: The sample implementation assumes that all features are continuous. You may convert all your features to be continuous or augment the implementation to handle discrete features.
\Part Before applying the decision tree to the Titanic dataset, described below, we will first preprocess the dataset. We provide the code to preprocess the data. Describe how we deal with the following problems:
\begin{itemize}
\item Some data miss class labels;
\item Some features are not numerical values;
\item Some data miss features.
\end{itemize}
\paragraph{Data Processing for Titanic}
Here is a brief overview of the fields in the Titanic dataset.
\begin{enumerate}
\item survived - 1 is survived; 0 is not. This is the class label.
\item pclass - Measure of socioeconomic status: 1 is upper, 2 is middle, 3 is lower.
\item sex - Male/Female
\item age - Fractional if less than 1.
\item sibsp - Number of siblings/spouses aboard the Titanic
\item parch - Number of parents/children aboard the Titanic
\item ticket - Ticket number
\item fare - Fare.
\item cabin - Cabin number.
\item embarked - Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
\end{enumerate}
\Part \textbf{Apply your decision tree to the titanic dataset.} \textbf{Train a shallow decision tree} (for example, a depth $3$ tree, although you may choose any depth that looks good) and \textbf{visualize your tree}. Include for each non-leaf node the feature name and the split rule, and include for leaf nodes the class your decision tree would assign.
We provide you a code snippet to draw the tree using \texttt{pydot} and \texttt{graphviz}. If it is hard for you to install these dependencies, you need to draw the diagram by hand.
\Part From this point forward, you are allowed to use \texttt{sklearn.tree.*} and the classes we have imported for you below in the starter code snippets.
You are NOT allowed to use other functions from \texttt{sklearn}.
\textbf{Implement bagged trees as follows:} for each tree up to $n$, sample \textit{with replacement} from the original training set until you have as many samples as the training set.
Fit a decision tree for each sampling.
\textbf{Include your bagged trees code.} Below is optional starter code.
\begin{verbatim}
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
class BaggedTrees(BaseEstimator, ClassifierMixin):
def __init__(self, params=None, n=200):
if params is None:
params = {}
self.params = params
self.n = n
self.decision_trees = [
DecisionTreeClassifier(random_state=i,
**self.params) for i in
range(self.n)]
def fit(self, X, y):
# TODO implement function
pass
def predict(self, X):
# TODO implement function
pass
\end{verbatim}
\Part \textbf{Apply bagged trees to the titanic and spam datasets.} \textbf{Find and state the most common splits made at the root node of the trees.} For example:
\begin{enumerate}
\item (``viagra") $\geq$ 3 (20 trees)
\item (``thanks") $<$ 4 (15 trees)
\item (``nigeria") $\geq$ 1 (5 trees)
\end{enumerate}
\paragraph{Data format for Spam}
The preprocessed spam dataset given to you as part of the homework in \texttt{spam\_data.mat} consists of 11,029 email messages, from which 32 features have been extracted as follows:
\begin{itemize}
\item 25 features giving the frequency (count) of words in a given message which match the following words: pain, private, bank, money, drug, spam, prescription, creative, height, featured, differ, width, other, energy, business, message, volumes, revision, path, meter, memo, planning, pleased, record, out.
\item 7 features giving the frequency (count) of characters in the email that match the following characters: \texttt{;, \$, \#, !, (, [, \&}.
\end{itemize}
The dataset consists of a training set size 5172 and a test set of size 5857.
\Part \textbf{Implement random forests as follows:} again, for each tree in the forest, sample \textit{with replacement} from the original training set until you have as many samples as the training set. Fit a decision tree for each sample, this time using a randomly sampled subset of the features (instead of the full set of features) to find the best split on the data. Let \texttt{m} denote the number of features to subsample. \textbf{Include your random forests code.} Below is optional starter code.
\begin{verbatim}
class RandomForest(BaggedTrees):
def __init__(self, params=None, n=200, m=1):
if params is None:
params = {}
# TODO implement function
pass
\end{verbatim}
\Part \textbf{Apply bagged random forests to the titanic and spam datasets.} \textbf{Find and state the most common splits made at the root node of the trees.}
\Part Implement the AdaBoost random forests as follows: this time, we will collect one sampling at a time and we will change the weights on the data after each new tree is fit to generate more trees that tackle some of the more challenging data. Let $w \in \mathbb{R}^N$ denote the probability vector for each datum (initially, uniform), where $N$ denotes the number of data points. To start off, as before, sample \textit{with replacement} from the original training set accordingly to $w$ until you have as many samples as the training set. Fit a decision tree for this sample, again using a randomly sampled subset of the features. Compute the weight for tree $j$ based on its accuracy:
\begin{align*}
a_j = \frac{1}{2} \log \frac{1-e_j}{e_j}
\end{align*}
where $e_j$ is the weighted error:
\begin{align*}
e_j = \frac{\sum_{i=1}^N{I_j(x_i) w_i}}{\sum_{i=1}^N{w_i}}
\end{align*}
and $I_j(x_i)$ is an indicator for datum $i$ being \textit{incorrect}.
Then update the weights as follows:
\begin{align*}
w_i^+ = \begin{cases} w_i \exp(a_j) & \quad if I_j(x_i) = 1 \\
w_i \exp(-a_j) & \quad \text{otherwise}
\end{cases}
\end{align*}
Repeat until you have $M$ trees.
Predict by first calculating the score $z(x, c)$ for a data sample $x$ and class label $c$:
\begin{align*}
z(x, c) = \sum_{j=1}^M a_j I_{j}(x, c).
\end{align*}
where $I_j(x,c)$ is now an indicator variable for if tree $j$ predicts data $x$ with class label $c$.
Then, the class with the highest weight is the prediction (classification result):
\begin{align*}
\hat{y} = \arg \max_c z(x, c)
\end{align*}
\textbf{Include your boosted random forests code.} Below is optional starter code. How are the trees being weighted? \textbf{Describe qualitatively what this algorithm is doing. What does it mean when $a_i < 0$, and how does the algorithm handle such trees?}
\begin{verbatim}
class BoostedRandomForest(RandomForest):
def fit(self, X, y):
self.w = np.ones(X.shape[0]) / X.shape[0] # Weights on data
self.a = np.zeros(self.n) # Weights on decision trees
# TODO implement function
return self
def predict(self, X):
# TODO implement function
pass
\end{verbatim}
\Part \textbf{Apply boosted random forests to the titanic and spam datasets. For the spam dataset only: Describe what kind of data are the most challenging to classify and which are the easiest. Give a few examples. Describe your procedure for determining which data are easy or hard to classify.}
\Part \textbf{Summarize the performance evaluation of each of the above trees and forests: a single decision tree, bagged trees, random forests, and boosted random forests}. For each of the $2$ datasets, report your training and validation accuracies. You should be reporting $24$ numbers ($2$ datasets $\times$ $4$ classifiers $\times$ $3$ data splits). Describe qualitatively which types of trees and forests performed best. Detail any parameters that worked well for you. \textbf{In addition, for each of the $2$ datasets, train your best model and submit your predictions to Gradescope}. Your best Titanic classifier should exceed 73\
\Part You should submit
\begin{itemize}
\item a PDF write-up containing your \textit{answers, plots, and code} to Gradescope;
\item a .zip file of your \textit{code}.
\item a file, named \texttt{submission.txt}, of your titantic predictions (one per line).
\item a file, named \texttt{submission.txt}, of your spam predictions (one per line).
\end{itemize}
\end{Parts}
\Question{Your Own Question}
{\bf Write your own question, and provide a thorough solution.}
Writing your own problems is a very important way to really learn
the material. The famous ``Bloom's Taxonomy'' that lists the levels of
learning is: Remember, Understand, Apply, Analyze, Evaluate, and
Create. Using what you know to create is the top-level. We rarely ask
you any HW questions about the lowest level of straight-up
remembering, expecting you to be able to do that yourself. (e.g. make
yourself flashcards) But we don't want the same to be true about the
highest level.
As a practical matter, having some practice at trying to create
problems helps you study for exams much better than simply counting on
solving existing practice problems. This is because thinking about how
to create an interesting problem forces you to really look at the
material from the perspective of those who are going to create the
exams.
Besides, this is fun. If you want to make a boring problem, go
ahead. That is your prerogative. But it is more fun to really engage
with the material, discover something interesting, and then come up
with a problem that walks others down a journey that lets them share
your discovery. You don't have to achieve this every week. But unless
you try every week, it probably won't happen ever.
\end{document}