\documentclass[preview]{standalone}
\usepackage{header_problems}
\usepackage{header_template}
\begin{document}
\fontsize{12}{15}\selectfont
\Question{Getting Started}
\textbf{Read through this page carefully.} You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Deliverables:
\begin{enumerate}
\item Submit a PDF of your writeup, \textbf{with an appendix for your code}, to assignment on Gradescope, ``<> Write-Up". If there are graphs, include those graphs in the correct sections. Do not simply reference your appendix.
\item If there is code, submit all code needed to reproduce your results, ``<<title>> Code".
\item If there is a test set, submit your test set evaluation results, ``<<title>> Test Set".
\end{enumerate}
After you've submitted your homework, watch out for the self-grade form.
\begin{Parts}
\Part Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
\vspace{15pt}
\framebox(465, 75){}
\Part Please copy the following statement and sign next to it. We just want to make it \textit{extra} clear so that no one inadvertently cheats.
\textit{I certify that all solutions are entirely in my words and that I have not looked at another student's solutions. I have credited all external sources in this write up.}
\vspace{15pt}
\framebox(465, 75){}
\end{Parts}
\pagebreak
\Question{SGD on OLS}
\newcommand{\mycomment}[1]{\textcolor{blue}{#1}}
\newcommand{\updt}[1]{\w^{#1}}
\newcommand{\lipcon}{M_g}
\newcommand{\best}{\w^\star}
\newcommand{\EE}{\mathbb{E}}
\newcommand{\tilm}{\tilde{m}}
In this problem, we carefully walk through the key ingredients of a
proof for SGD convergence for the specific example of a loss function
which we are very familiar with: Ordinary Least Squares with positive
definite matrix $\X^\top \X$. In particular we show that in this case,
even though gradient descent converges to the optimal solution of OLS,
with small enough constant stepsize, SGD is not guaranteed to do so!
We then show that in contrast, when applying SGD with decaying step,
as outlined in lecture, it does converge to the same optimum as
gradient descent.
This phenomenon that constant stepsize gets stuck and decaying stepsizes
are the way out is analogous to common observations in practical machine
learning problems, where decreasing the learning rate using
heuristic learning rate schedules helps to decrease the training error.
Recall that the ordinary least squares problem can be written as:
$$\min_{\w} f(\w) = \min_{\w} \frac{1}{2} || \X\w - \y||^2_2 = \min_{\w} \frac{1}{2} \sum_{i=1}^n (\x_i^\top \w - \y_i)^2 = \min_{\w} \frac{1}{2} \sum_{i=1}^n f_i(\w)$$
where $f_i(\w) := (\x_i^\top \w - \y_i)^2$. Here $\x_i^\top$ is the
$i$th row of matrix $\X$ and $f_i$ is the loss of the training
example $i$. We implement SGD as follows:
\begin{equation*}
\w^{t+1} = \w^{t} - \alpha_{t} \nabla f_{i_t}(\w^{t}).
\end{equation*}
where $i_t$ is uniformly sampled from all samples $\{ 1, 2, \cdots ,
n\}$ (and is independently drawn for each iteration $t$). Let's define
the short hand $G_{t} = \nabla f_{i_t}(\w^{t})$ to represent the
random gradient at step $t$. We are interested in how $\w^t$
approaches the optimal solution $\w^{*} := \arg \min_{\w\in \R^d}
f(\w)$. One way to characterize this is to monitor the squared
distance of the iterate to the optimum, i.e. $|| \w^t - \w^* || ^2_2$.
Throughout, we will assume that the following bound holds for the
squared norm of the stochastic gradient:
\begin{equation}
\label{eq:ass}
\ev ||G(\w)||_2^2 \leq M_g^2 ||\w-\w^*||_2^2 + B^2.
\end{equation}
where $M_g$ and $B$ are constants dependent on the model and loss function $f$
We will find concrete values of them later in this problem for specific examples. (Notice that in lecture we assumed $M_g$ to be zero, which is too restrictive even for the most basic least squares loss.)
\paragraph{Problem outline:} Parts (a)-(c) help to derive a recursive formula of the form $\ev
|| \w^{t+1} - \w^* || ^2_2 \leq \gamma \ev || \w^t - \w^* || ^2_2
+ \tilde{\gamma}$.
%% Both $\gamma $ and $\tilde{\gamma}$ depend on the
%% stepsize and loss function characteristics. For convergence we always
%% assume that we choose the stepsize such that $\gamma < 1$.
In part (d) we show that we achieve linear (also called geometric)
convergence of $\w^t$ to the optimum $\w^*$ for SGD with constant
stepsize when $f(\w^*) = 0$. That means we can write $\ev || \w^{t}
- \w^* || ^2_2 \leq \gamma^t || \w^{0} - \w^* || ^2_2$ for some
$\gamma <1 $. In part (e) we explore what happens with the iterates
of SGD with constant stepsize if instead $f(\w^*) > 0$. In part (f)
we visualize how the convergence bounds translate to actual training
error decrease for OLS using SGD and GD.
The bonus parts (g)-(h) are for those students who ``love math''
according to the midterm preliminary questions, where you are
asked to prove that SGD with decaying stepsize converges as $1/t$
for ``nice'' functions using induction.
\begin{Parts}
\Part \textbf{Show the following relation between the $t+1$-step error and the $t$-step error: $||\w^{t+1} - \w^* ||_2^2 = ||\w^{t} - \w^*||_2^2 - 2\alpha_t \langle G_t, \w^t - \w^* \rangle + \alpha_t^2 ||G_t||_2^2$}
\Part \emph{Here we prove the key relation which underlies the success of stochastic gradient methods. This is where we use that stochastic gradients are unbiased!} Since we have stochastic gradients, we want to make guarantees in terms of expectations. For notational convenience let us define the short hand $\Delta_t := \ev ||\w^t-\w^*||_2^2$, the \emph{average squared error}, where the expectation is taken over all the random indeces drawn by the stochastic gradient method up to time $t$. We want to show a clean relation between $\Delta_t$ and $\Delta_{t+1}$ when the stochastic gradient satisfies~\eqref{eq:ass}.
Remind yourself which random indeces $\w^t$ depends on and use the
unbiasedness of the stochastic gradient to {\bf show that}
\begin{equation}
\label{eq:crucial}
\ev [ \langle G(\updt{t}), \updt{t} - \best\rangle] = \ev [ \langle \nabla f(\updt{t}),\updt{t}-\best\rangle ].
\end{equation}
\textbf{Now use equality ~\eqref{eq:crucial} and (a) to prove that}
\begin{equation}
\label{eq:useful}
\Delta_{t+1} \leq (1+\alpha_t^2 M_g^2)\Delta_t - 2\alpha_t \ev \langle \nabla f(\w^t), \w^t - \w^* \rangle + \alpha_t^2 B^2.
\end{equation}
Hint 1: You may take the law of iterated expectation (also called tower property) as given, i.e.
\begin{equation*}
\ev [ \langle G(\updt{t}), \updt{t} - \best\rangle] = \ev_{i_1,\dots,i_{t-1}}[ \ev_{i_t}
[ \langle G(\updt{t}), \updt{t} - \best\rangle | i_1, \dots, i_{t-1}
]].
\end{equation*}
See Discussion for the derivation.
Hint 2: Unbiased stochastic gradient means that $\ev_{i_t} \nabla f_{i_t}(\w) = \nabla f(\w)$ for any $\w$ independent of the random index, where the expectation is with respect to the random index $i_t$ at time $t$. Refer to the Discussion for brief intro on conditional expectations.
%% $\ev \left [ \ev \left[ X|Y \right]\right ] = \ev \left[ X \right]$. \\
%% Hint 1: The tower property (also called law of iterated expectation) which you derived in discussion might be helpful, i.e.
%% Hint 2: The stochastic gradient $G(\w)$ is an unbiased estimate of the full gradient $\nabla f(\w)$.
\Part \emph{In this step we see that we need for our analysis that our loss function is ``nice'' (strongly convex to be precise), which holds for OLS when $\lambda_{min}(\X\tran \X ) > 0$. That is, all the guarantees for SGD we discuss in this problem only hold for such ``nice'' functions. In general, at least convexity is needed for any convergence proof.} \textbf{Using inequality~\eqref{eq:useful}, now show that $\Delta_{t+1} \leq (1+\alpha_t^2 M_g^2 - 2\alpha_t m )\Delta_t + \alpha_t^2 B^2$, where we assume that the minimum eigenvalue of matrix $\X \tran \X$ denoted by $m$ is positive, i.e. $m:=\lambda_{min}(\X\tran \X ) > 0$. } \\
Hint: Use the fact that $\nabla f(\w^*) = \vec{0}$, and hence
\begin{equation*}
\langle \nabla f(\w^t), \w - \w^*\rangle = \langle \nabla f(\w^t) - \nabla f(\w^*), \w - \w^*\rangle.
\end{equation*}
\Part We now examine how close we can get to the optimal solution $\w^*$ using SGD with {\bf constant stepsize}, i.e. $\alpha_t = \alpha$ for all $t$ in two different scenarios. First, for this question we assume that $\X \w^* = \y$, or in other words, that the minimum loss is $0$. \textbf{Show that inequality~\eqref{eq:ass} holds with $B=0$ and $M_g = \max_i||\x_i||_2^4$ in this case. Find the optimum learning rate and show that $\Delta_t \leq (1-\frac{m^2}{M_g^2})^t \Delta_0$.} This means that with $B=0$ we have linear (geometric) convergence!\\
Hint: You may want to use that for any matrix $\A$ and vector $\w$ it holds that $\|\A \w\|_2^2 \leq \lambda_{max}(\A^\top \A) \|\w\|_2^2$ and the same bound applies when taking the expectation on both sides.
\Part Instead of assuming that there exists a $\w^*$ that satisfies $\X \w^* = \y$ exactly, we now consider the case where $f(\w^*) = \frac{1}{2}\|\X\w^* - y\|^2 >0$. One can show that in this case $\ev ||G(\w)||_2^2 \leq M_g^2 ||\w-\w^*||_2^2 + B^2$ holds for some $M_g$ and $B$, where $M_g \neq 0$ and $B \neq 0$. Suppose that you are given the constants $M_g$ and $B$ and the stepsize is still constant, i.e. $\alpha_t = \alpha$ for all $t$.
We define the short hand notation $\gamma:=1+\alpha^2 M_g^2 - 2\alpha
m $ and assume $\alpha$ is large enough such that $\gamma <
1$. \textbf{Using part (c), prove that
$\Delta_{t} \leq \gamma^t \Delta_{0} + \frac{\alpha B^2}{\alpha M_g^2
- 2m}$.} We now want to interpret this bound. {\bf Does it guarantee
convergence $\w^t \to \w^\star$ when $t$ goes to infinity? Explain.}
Hint: You may find the following inequality helpful:
$\sum_{t=1}^n \gamma^t \leq \frac{1}{1-\gamma}$ for $\gamma <1 $.
\Part We now want use simulations to compare the first order methods for solving OLS for the cases in (d) and (e). In particular, we want to plot the estimation error $||\w^t - \w^*||_2$ against the gradient descent step in the following 3 scenarios:
\begin{itemize}
\item When there exists an exact solution $\X\w^* = \y$, plot the errors when you are using SGD and a constant learning rate.
\item When there does not exist an exact solution $\X\w^* = \y$, plot the errors when you are using SGD and a constant learning rate. Also plot the errors for GD and the same constant learning rate.
\item When there does not exist an exact solution $\X\w^* = \y$, plot the errors when you are using SGD and a decaying learning rate.
\end{itemize}
Using the attached starter code, {\bf implement the gradient updates} in the function \texttt{sgd()} , while all the plotting functions are already there. \textbf{Show the 3 plots you obtain using the starter code. Report the average squared error computed in the starter code. What's your conclusion?}
\Part (Bonus) {\bf Prove that there is some constant $S$ and $k_0$ such that if the learning rate is decaying, the error $\Delta_k$ satisfies $\Delta_k \leq \frac{S}{k+k_0}$}. In this part, prove for the case of $M_g=0$ and $B\neq 0$. \\
Hint: induction is your friend.
\Part (Bonus) {\bf Prove that the same conclusion holds for the case of $M_g \neq 0$ and $B \neq 0$, for some different constant $M_g$ and $B$. }
Hint: this general case can be reduced to the case above.
\end{Parts}
\Question{[BONUS] Genome-Wide Association Study}
\textbf{All the following text is present in the accompanying jupyter notebook, but the notebook has additional explanations and accompanying figures and code. We recommend you do not read this pdf, but go directly to the jupyter notebook. This pdf was included for quick reference.}
\underline{Overall goal}: This real world problem is one in computational biology which uses many of the techniques and concepts you have been introduced to, all together, in particular, linear regression, PCA, non-iid noise, diagonalizing multivariate Gaussian covariance matrices, and bias-variance trade-off. We will also tangentially introduce you to concepts of statistical testing. \textbf{This homework problem is effectively a demo in that we will ask you to execute code and answer questions about what you observe. You are not required to code anything at all.}
%\underline{Background}: The decreasing cost of DNA sequencing over the past few decades has enabled researchers to make huge strides in understanding the genetic basis of disease. Given a set of people for whom the genetics have been measured, as well as the status of some trait, such as blood pressure, or "has cardiac disease", one can now use data-driven methods to deduce which genetic effects are responsible for the disease.
\underline{Setup and problem statement}: Given a set of people for whom genetics (DNA) has been measured, and also a corresponding trait for each person, such as blood pressure, or "is-a-smoker", one can use data-driven methods to deduce which genetic effects are likely responsible for the trait. We have collected blood from $n$ individuals who either smoke ($y_i=1$) or do not smoke ($y_i=0$). Their blood samples have been sequenced at $m$ positions along the genome, yielding the feature matrix $\mat{X} \in\mathbb{R}^{n \times m}$, composed of the genetic variants (which take on values 0, 1, or 2). Specifically, $X_{i,j}$ is a numeric encoding of the DNA for the $i^{\text{th}}$ person at genetic feature $j$. \footnote{Technically, the entries of the matrix correspond to having zero, one or two mutant versions of the DNA, but we will treat them as real-valued in this problem.} We want to deduce which of the $m$ genetic features are associated with smoking, by considering one at a time. \textit{In the data we give you, it will turn out that there is no true signal in the data; however, we will see that without careful modelling, we would erroneously conclude that the data were full of signal.}
\underline{Overall modelling approach}: The basic idea will be to "test" one genetic feature at a time and assign a score (a p-value) indicating our level of belief that it is associated with the trait. We will start with a simple linear regression model, and then build increasingly more correct linear predictive models. The three models we will use are (1) linear regression, (2) linear regression with PCA features, (3) linear regression with all genetic variants as features. The first model is naive, because it assumes the individuals are iid. The fundamental modelling challenges with these kinds of analyses is that the individuals in the study are not typically not iid, owing to differences in race and family-relatedness (e.g., sisters, brothers, patients, grandparents in the data set), which violate the iid assumption. Left unmodelled, such structure creates misleading results, yielding signal where none exists. Luckily, as we shall see, one can visualize these modelling issues via quantile-quantile plots, which will soon be briefly introduced.
\underline{How to test each variant}:
Herein we provide a minimal exposition to allow you to do the homework. To estimate how implicated each genetic feature is we will use a score, in the form of a p-value. One can think of the p-value as a proxy for how informative the genetic feature is for prediction of the trait (e.g. "is smoker"). More precisely, to get the p-value for the $j^{\text{th}}$ genetic feature, we first include it in the model (for a given model class, such as linear regression) and compute the maximum likelihood, $LL_{j}$ (this is our alternative hypothesis). Then we repeat this process, but having removed the genetic feature of interest from the model, yielding $LL_{-j}$ (this is our null hypothesis). To be clear, the null hypothesis will have none of the $m$ genetic variants that are being tested. You can refer to the jupyter notebook for a brief explanation of hypothesis testing. The p-value is then a simple monotonic decreasing function of the difference in these two likelihoods, $ \verb|diff_ll|=LL_j - LL_{-j}$--one that we will give you. P-values lie in $[0,1]$ and the smaller the p-value, the larger the difference in the likelihoods, and the more evidence that the genetic marker is associated with the trait (assuming the model is correct).
\begin{figure}
\centering
\includegraphics[width=0.6\linewidth]{src/problems/pca/gwas/intro.png}
\caption{Example of quantile-quantile plot that shows large deviation from expectation.}
\label{fig:intro}
\end{figure}
%\underline{Three models:} The three models we will use are (1) linear regression ,(2) linear regression with PCA features, (3) linear regression with all genetic variants as features. The first model is naive, because it assumes the individuals are iid.
%The other two models try to fix this in different ways: each, some way to "condition on" (aka add in to the model) all the other genetic markers for each test, which are proxies for race and family relatedness. PCA features are able to correct for broad race non-iid'ness which tends to be low rank structure that PCA is good at capturing. However, family-structure sits in the tail-end of the eigen-spectrum, and for that PCA fails. Thus we need to capture the full eigenspectrum. One way to do this is to effectively adjust the iid noise in the gaussian by the pairwise genetic similarity of all the individuals. That is, we set $\Sigma$ in $p(\vec{y}=N(y|XW, I \sigma_e^2 + X^T X \sigma^2_g)$. Note that $\vec{y} \in\mathbb{R}^n$ contains the values for whether someone smokes or not (0 or 1). The feature matrix $\mat{X} \in\mathbb{R}^{n \times d}$ is composed of the genetic variants (which take on values 0, 1, or 2) where $X_{i,j}$ is the variant type of the ith person for genetic variant j.
% Let's do the naive solution
\begin{Parts}
\Part To diagnose if something is amiss in our genetic analyses, we will make use of the following: (1) we assume that if any genetic signal is present, it is restricted to a small fraction of the genetic features, (2) p-values corresponding to no signal in the data are distributed as $p \sim \text{Unif}[0,1]$. Combining these two assumptions, we will assume that p-values arising from a valid model should largely be drawn from $\text{Unif}[0,1]$, and also that large deviations suggest that our model is incorrect.
Quantile-quantile plots let us visualize if we have deviations or not. Quantile-quantile plots are a way to compare two probability distributions by comparing their quantile values (e.g. how does the smallest p-value in each distribution compare, and then the second smallest, etc.). In the quantile-quantile plot, you will see $m$ points, one for each genetic marker. The x-coordinate of each point corresponds to the theoretical quantile we would expect to see if the distribution was in fact a $Unif[0, 1]$ and the y-coordinate corresponds to the observed value of the quantile. An example is shown in Figure 1, where the line on the diagonal results from an analysis where the model is correct, and hence the theoretical and empirical p-value quantiles match, while the other line, which deviates from the diagonal, indicates that we have likely made a modelling error. If there are genetic signals in the data, these would simply emerge as a handful of outlier points from the diagonal (not shown).
Before we dive into developing our models, we need to be able to understand whether the p-values we get back from our hypothesis tests look like $m$ random draws from a $Unif[0, 1]$.
\textbf{Use the qqplot function to make a qq-plot for each of the 3 distributions provided below and explain your findings. What should we observe in our qq-plot if our empirical distribution looks more and more similar to a Unif[0, 1]?}. Note that we use two kinds of qq-plots: one in p-value space and one negative log p-value space. The former is for intuition, while the latter is for higher resolution. The green lines in the negative log p-value qq-plots indicate the error bars.
\Part We will use linear models in the genetic marker we are testing. In particular, when testing the $j^{\text{th}}$ genetic feature, we have that the trait, $\vec{y}$, is a linear function of the genetic variant, i.e. $ \vec{y} = \vec{x_j} w_1 + \vec{w_0} + \vec{\epsilon}$ where $\epsilon_i$ is random noise distributed as $N(0, \sigma^2)$, $w_1 \in \mathbb{R}^{1}$, $\vec{w_0}$ $\in \mathbb{R}^{n \times 1}$ is a constant vector, and $\vec{x_j}$ is the jth column of $\mat{X}$, representing data for the jth genetic variant. To simplify matters, we will add a column of ones to the right end of $\vec{x_j}$ and rewrite the regression as $\vec{y}=[\vec{x_j}, \vec{1}] \vec{w} + \vec{\epsilon}$ where $[\vec{x_j}, \vec{1}]$ is the jth column of $\mat{X}$ with a vector of ones appended to the end and $\vec{w} \in \mathbb{R}^{2 \times 1}$. The model without any genetic information, $\vec{y} = \vec{1}w + \vec{\epsilon}$ is referred to as the \emph{null model} in the parlance of statistical testing The \emph{alternative model}, which includes the information we are testing (one genetic marker) is $\vec{y} = [\vec{x_j}, \vec{1}] \vec{w} + \vec{\epsilon}$ where $\vec{x_j}$ is the jth column of X, i.e. the data using only the jth genetic variant as a feature. \textbf{Plot the quantile-quantile plot of p-values using linear regression as just describe, a so-called naive approach, by running the function} $\verb|naive_model|$. \textbf{From the plot, what do you conclude about the suitability of linear regression for this problem?}
% This function $\verb|qqplot(pvals)|$ which takes in a numpy array of p-values and outputs a quantile-quantile plot of expected p-values versus observed.
%\textbf{Derive expressions for the maximum likelihood estimator for $\vec{w}$ and $\sigma^2$. Implement a method to compute the log likelihood given $\mat{X}$ and $\vec{y}$.}
%
%
%Try the PCA-corrected model
\Part From the quantile-quantile plot in the previous part, it appears that the model is picking up on more association than theoretically expected. The reason for this is owing to the assumption of iid noise being correct. In particular, this data set contains individuals from different racial backgrounds, and also has clusters of individuals from extended families (e.g. grandparents, parents, siblings). This means that their genetics are not iid, and hence linear regression yields \emph{spurious results}--all the genetic features seem to be implicated in the trait. Thus we need to modify our noise assumptions by somehow accounting for the hidden structure in the data set. The main idea is that when testing one genetic feature, all the other genetic features, jointly, are a proxy to the racial and family background. If we could include them in our model, we could correct the problem. \textbf{Ideally we would like to use all the genetic features in the linear regression model, however this is not a good idea. Why not?} Hint: There are roughly 1300 individuals and 7500 genetic variants. A written, English answer is sufficient.
So instead of using all genetic features, we will try using PCA to reduce the number of genetic features. As we saw in class, PCA on a genetic similarity matrix can capture geography, which correlates to race, quite well. So instead of adding all the genetic features, we will instead use only three features\footnote{One needs to choose this number, but we have done so for you.}, $\mat{X_{proj}}$, which are the $\mat{X}$ projected onto the top 3 principal components of $\mat{X}$. Consequently, the updated null model is $\vec{y} = \mat{X_{proj}}\vec{w_{proj}} + \vec{\epsilon}$ where $\vec{w_{proj}} \in \mathbb{R}^{3 \times 1}$, while the alternative model is $\vec{y} = [\vec{x_j}, \mat{X_{proj}}, \vec{1}] \vec{w} + \vec{\epsilon}$ where $\vec{w} \in \mathbb{R}^{5 \times 1}$ for genetic variant j.
\textbf{Plot the quantile-quantile plot from obtaining p-values with this PCA linear regression approach by running the function} $\verb|pca_corrected_model|$.
\textbf{How does this plot compare to the first plot? What does this tell you about this model compared to the previous model?}
% Introduce the mixed model
\Part PCA got us part of the way there. However, PCA truncates the eigenspectrum; if the tail-end of that spectrum is important, as it is for family-relatedness, then it will not fully correct for our problem. So we want a method which (a) is well-behaved in terms of number of parameters that need to be estimated, and (b) includes all of the information we need. So rather than adding the projections as features, we use an modelling approach called linear mixed models which effectively adjust the iid noise in the gaussian by the pairwise genetic similarity of all the individuals. That is, we set $\Sigma$ in $\vec{y} \sim N(y|[\vec{x_j}, 1] \vec{w}, I \sigma^2 + \mat{X}\mat{X}^\top \sigma^2_k)$.
Specifically, $\vec{y} = [\vec{x_j}, 1] \vec{w} + \vec{z} + \vec{\epsilon}$ where $\vec{z} \sim N(0, \sigma_k^2\mat{K})$ where $\mat{K} = \mat{X}\mat{X}^\top$, $\sigma_k$, $\vec{w} \in \mathbb{R}^{m \times 1}$, and $\sigma$ are parameters we want to estimate. Notice that $\vec{y} \sim N([\vec{x_j}, 1] \vec{w}, \sigma^2I + \sigma_k^2K)$. Evaluation of the likelihood is thus on the order of $O(n^3)$ from the required matrix inverse and determinant of $\sigma^2I + \sigma_k^2K$. To test $m$ genetic variants, the time complexity becomes $O(mn^3)$, which is extremely slow for large datasets. \textbf{Given the eigen-decomposition $\mat{K} = \mat{U}\mat{D}\mat{U}^\top$, how can we make this faster if we have to test thousands of genetic feature? Would this be computationally more efficient if you only have one genetic feature to test?}
\textbf{Finally, make the quantile-quantile plot for this last model by running the function }$\verb|lmm|$ \textbf{. What can we conclude about this model relative to the other two models?}
HINT: Since the manipulations needed for $\sigma^2I + \sigma_k^2K$ is the bottleneck here, we would like a transformation which makes this covariance of the multi-variate gaussian be a diagonal matrix.
%Another approach to incorporating associations is a linear mixed model. In this setting, we explicitly allow for both fixed effects (effects from variables of interest assumed to not vary from study to study) and random effect (effects from variables whose values are a random sample from a larger population of values). For example, here the random effects are the confounding effects from the population structure, while the fixed effects are the direct effects from the genetic variants.
%
% \Part \textbf{Plot the quantile-quantile plot from obtaining p-values with the linear mixed model approach using the function} $\verb|lmm|$.
%
\end{Parts}
\Question{Gradient Descent Framework}
In HW1, you modeled the classification of digit numbers of the MNIST dataset as
a linear regression problem and solved it using its closed-form solution. In
this homework, you will model it better by using classification models such as
logistic regression and neural networks, and solve it using stochastic gradient
descent. The goal of this problem is to show the power of modern machine
learning frameworks such as TensorFlow and PyTorch, and how they can solve a
wide range of problems. TensorFlow is the recommended framework in this
homework and we also provide the starter kit in PyTorch.
\begin{Parts}
\Part The starter code contains an implementation of linear regression for the
MNIST dataset to classify 10 digits. Let $\x_i$ be the feature vector and
$\y_i$ be a one-hot vector encoding of its class, i.e., $y_{i,j}=1$ if and only
if $i$th images is in class $j$ and $y_{i,j}=0$ if not. In order to use linear
regression to solve a multi-class classification, we will use the
\emph{one-vs-all} strategy here. In particular, for each $j$ we will fit a
linear regression model $(\w_j, b_j)$ to minimize $\sum_i (\w_j \T \x_i + b_j -
y_{i,j})_2^2$. Then for any image $\x$, the prediction of its class will be
$\argmax_j (\w_j\T \x + b_j)$.
Read the implementation and run it with batch size equal to 50, 100, and 200.
\textbf{Attach the ``epoch vs validation accuracy'' plot. Report the running
time for all the batch sizes. Explain the potential benefits and drawbacks of
using small batch size, i.e., SGD vs GD.}
\Part \textbf{Implement the \texttt{train\_logistic} function to do the
multi-class logistic regression using softmax function. Attach the plot of the
``epoch vs validation accuracy'' curve.}. The loss function of the multi-class
logistic regression is
\begin{equation}
\ell = -\sum_{i=1}^n \log\left[\mathbf{softmax}\left(\W \x_i + \b\right)\T \y_i\right]
\end{equation}
where the softmax function $\mathbf{softmax}:\R^d \to \R^d$ is defined as
\begin{equation}
\mathrm{softmax}(\z)_j
=\frac{\exp(z_j)}{\sum_k \exp(z_k)}
=\frac{\exp(z_j-z')}{\sum_k \exp(z_k-z')}.
\end{equation}
Here $z'=\max_j z_j$. The expression on the right is a numerical stable
formula to compute the softmax. You may NOT use any functions in
\texttt{tf.nn.*}, \texttt{tf.losses.*}, and \texttt{torch.nn.*} for all the
parts of this problem.
\Part Copy your code from \texttt{train\_logistic} to \texttt{train\_nn} and
add an additional $\tanh$ nonlinear layer to it. You loss function should be
something like
\begin{equation}
\ell = -\sum_i \log\left[\mathbf{softmax}\left(\W^{(2)} \textbf{tanh}\left(\W^{(1)}\x_i\right) + \b\right)\T \y_i\right].
\end{equation}
\textbf{Attach the plot of the ``epoch vs validation accuracy'' curve}. You
have the freedom but \textbf{are NOT required to} to choose the hyper-parameters
to get the best performance.
\Part In our previous lecture, we learned how to solve the total least squares
by doing the SVD decomposition. Now we will try to model the problem from the
probabilistic perspective and solve it using stochastic gradient descent. Let
the probabilistic model be
\begin{equation}
y_i - z_y = \w\T \left( \x_i - \z_{\x} \right) \label{eq:tls}
\end{equation}
where $\x_i$ and $y_i$ is the observed data, $y_i-z_y$ is $y_{\mathrm{true}}$,
$\x_i-z_{\x}$ is $x_{\mathrm{true}}$, $z_y \sim \mathcal{N}(0, 1)$, and
$z_{\x,j} \sim \mathcal{N}(0, 1)$ for all $j$. \textbf{Prove that the
log-likelihood for the model in Equation \eqref{eq:tls} is}
\begin{equation}
\sum_{i=1}^n \log P_{\w}(y_i \mid \x_i) =
C-
\frac n 2\log\left(\lVert \w \rVert_2^2+1\right) -
\frac{1}{2{\left(\lVert \w \rVert_2^2+1\right)}} \sum_{i=1}^n (y_i-\w\T\x_i)^2,
\end{equation}
where $n$ is the number of samples, and $C$ is a constant that is not related
to $\w$, $y$, and $\x$. \emph{Note that the maximum likelihood estimation of
this probabilistic model is not the same as the SVD solution of TLS as we did
in lecture.}
\Part \textbf{Implement the stochastic gradient descent to find the MLE for the
model above.} In the starter code, we generate some data for you and you will
need to recover $\w$ using the observed data. \textbf{Report the error $\lVert
w^*-w_{\mathrm{true}}\rVert_2^2$} where $w^*$ is the one that you recover. Try
to play with hyper-parameters such as the batch size and the learning rate.
\textbf{Is the solution of SGD sensitive to its hyper-parameters in this
problem?}
\end{Parts}
\Question{Your Own Question}
{\bf Write your own question, and provide a thorough solution.}
Writing your own problems is a very important way to really learn
the material. The famous ``Bloom's Taxonomy'' that lists the levels of
learning is: Remember, Understand, Apply, Analyze, Evaluate, and
Create. Using what you know to create is the top-level. We rarely ask
you any HW questions about the lowest level of straight-up
remembering, expecting you to be able to do that yourself. (e.g. make
yourself flashcards) But we don't want the same to be true about the
highest level.
As a practical matter, having some practice at trying to create
problems helps you study for exams much better than simply counting on
solving existing practice problems. This is because thinking about how
to create an interesting problem forces you to really look at the
material from the perspective of those who are going to create the
exams.
Besides, this is fun. If you want to make a boring problem, go
ahead. That is your prerogative. But it is more fun to really engage
with the material, discover something interesting, and then come up
with a problem that walks others down a journey that lets them share
your discovery. You don't have to achieve this every week. But unless
you try every week, it probably won't happen ever.
\end{document}