\documentclass[preview]{standalone}
\usepackage{header}
\begin{document}
\fontsize{12}{15}\selectfont
\Question{Getting Started}
\textbf{Read through this page carefully.} You may typeset your homework in latex or submit neatly handwritten/scanned solutions. Please start each question on a new page. Delivera$
\begin{enumerate}
\item Submit a PDF of your writeup, \textbf{with an appendix for your code}, to assignment on Gradescope, ``<> Write-Up". If there are graphs, include those graphs in the $
\item If there is code, submit all code needed to reproduce your results, ``<<title>> Code".
\item If there is a test set, submit your test set evaluation results, ``<<title>> Test Set".
\end{enumerate}
After you've submitted your homework, watch out for the self-grade form.
\begin{Parts}
\Part Who else did you work with on this homework? In case of course events, just describe the group. How did you work on this homework? Any comments about the homework?
\vspace{15pt}
\framebox(465, 75){}
\Part Please copy the following statement and sign next to it. We just want to make it \textit{extra} clear so that no one inadverdently cheats.
\textit{I certify that all solutions are entirely in my words and that I have not looked at another student's solutions. I have credited all external sources in this write up.}
\vspace{15pt}
\framebox(465, 75){}
\end{Parts}
\pagebreak
\Question{MLE of Multivariate Gaussian}
In lecture, we discussed uses of the multivariate Gaussian
distribution. We just assumed that we knew the parameters of the
distribution (the mean vector $\mu$ and covariance matrix
$\Sigma$). In practice, though, we will often want to estimate $\mu$
and $\Sigma$ from data. (This will come up even beyond regression-type
problems: for example, when we want to use Gaussian models for
classification problems.) This problem asks you to derive the Maximum
Likelihood Estimate for the mean and variance of a multivariate
Gaussian distribution.
\begin{Parts}
\Part Let $\mat{X}$ have a multivariate Gaussian distribution with mean
$\mu \in \mathbb{R}^d$ and covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$.
{\bf Write the log likelihood of drawing the $n$ i.i.d. samples
$\vec{x}_1, \ldots, \vec{x}_n \in \mathbb{R}^d$ from $X$ given $\Sigma$ and $\mu$}.
\Part \textbf{Find MLE of $\mu$ and $\Sigma$.} For taking derivatives
with respect to matrices, you may use any formula in ``The Matrix
Cookbook'' without proof. This is a reasonably involved problem part
with lots of steps to get to the answer. We recommend students first
do the one-dimensional case and then the two-dimensional case to warm
up.
Note: Conventions for gradient and derivative in ``The Matrix Cookbook'' may
vary from the conventions we saw in the discussion.
\Part Use the following code to sample from a two-dimensional Multivariate Gaussian and plot the samples:
\begin{verbatim}
import numpy as np
import matplotlib.pyplot as plt
mu = [15, 5]
sigma = [[20, 0], [0, 10]]
samples = np.random.multivariate_normal(mu, sigma, size=100)
plt.scatter(samples[:, 0], samples[:, 1])
plt.show()
\end{verbatim}
Try the following three values of $\Sigma$:
$\Sigma =
\begin{bmatrix}
20 & 0 \\
0 & 10
\end{bmatrix}
$; \quad
$\Sigma =
\begin{bmatrix}
20 & 14 \\
14 & 10
\end{bmatrix}
$; \quad
$\Sigma =
\begin{bmatrix}
20 & -14 \\
-14 & 10
\end{bmatrix}
$.
{\bf Calculate the mean and covariance matrix of these distributions from the samples
(that is, implement part (b))}. Report your results. Include your code in your write-up.
Note: you are allowed to use numpy.
\end{Parts}
\Question{Tikhonov Regularization and Weighted Least Squares}
In lecture, you have seen this worked out in one way. In homework 2 we
introduced Tikhonov regularization as a generalization of ridge
regression. In this problem, we look at Tikhonov regularization from a
probabilistic standpoint.
The main goal is to deepen your understanding of how priors and thus
the right regularization affect the MAP estimator. First, you will
work out how introducing a certain probabilistic prior before
maximizing the posterior is equivalent to adding the Tikhonov
regularization term: by adding the Tikhonov regularization term, we
effectively constrain our optimization space. Similarly, using a
probabilistic prior drives our optimization towards solutions that
have a high (prior) probability of occurring. In the second half of
the problem you will then do some simulations to see how different
priors influence the estimator explicitly, as well as how this
effect changes as the number of samples grows.
\begin{Parts}
\Part Let $\vec{x} \in \mathbb{R}^d$ be a $d$-dimensional vector and
$Y \in \mathbb{R}$ be a one-dimensional random variable. Assume a
linear model: $Y=\vec{x}^\top\vec{w}+Z$ where $Z\in\mathbb{R}$ is
a standard Gaussian random variable $Z \sim N(0,1)$ and
$\vec{w}\in\mathbb{R}^d$ is a $d$-dimensional Gaussian random vector
$\vec{w}\sim N(0,\matG{\Sigma})$. $\matG{\Sigma}$ is a known symmetric
positive definite covariance matrix. Note that $\vec{w}$ is independent
of the observation noise. {\bf What is the conditional distribution
of $Y$ given $\vec{x}$ and $\vec{w}$?}
\Part (Tikhonov regularization) Let us assume that we are given $n$
training data points\\
$\{(\vec{x}_1,Y_1),(\vec{x}_2,Y_2),\cdots, (\vec{x}_n,Y_n)\}$ which we
know are generated i.i.d. according to the model of $(\vec{x},Y)$ in
the previous part, i.e. we draw one $\vec{w}$ and use this to generate
all $Y_i$ given distinct but arbitrary $\{\vec{x}_i\}_{i=1}^n$ , but
the observation noise $Z_i$ varies across the different training
points. {\bf Derive the posterior distribution of $\vec{w}$ given the
training data. Based on your result, what is the MAP estimate of
$\vec{w}$? Comment on how Tikhonov regularization is a
generalization of ridge regression from a probabilistic
perspective.}
Note: $\vec{w}$ and $\vec{Y} = (Y_1, Y_2, \ldots, Y_n)$ are jointly Gaussian in
this problem given $\{\vec{x}_i\}_{i=1}^n$.
Hint: (You may or may not find this useful) If the probability density function
of a random variable is of the form
\begin{align*}
f(\vec{v})=C \cdot\exp\{-\frac{1}{2}\vec{v}^\top\mat{A}\vec{v}+\vec{b}^\top \vec{v}\},
\end{align*}
where $C$ is some constant to make $f(\vec{v})$ integrates to $1$, then the mean
of $\vec{v}$ is $\mat{A}^{-1}\vec{b}$. This can be used to help complete squares
if you choose to go that way.
\Part We have so far assumed that the observation noise has a standard
normal distribution. While this assumption is nice to work with, we
would like to be able to handle a more general noise model. In
particular, we would like to extend our result from the previous part
to the case where the observation noise variables $Z_i$ are no longer
independent across samples, i.e.
$\mat{Z}$ is no longer
$N(\vec{0}, \mathbb{I}_n)$ but instead distributed as
$N(\mu_z, \matG{\Sigma}_z)$ for some mean $\mu_z$ and some covariance
$\matG{\Sigma}_z$ (still independent of the parameter $\vec{w}$).
{\bf Derive the posterior distribution of $\vec{w}$ by appropriately
changing coordinates.} We make the reasonable assumption that the
$\matG{\Sigma}_z$ is invertible, since otherwise, there would be some
dimension in which there is no noise.\\
Hint: Write $\mat{Z}$ as a function of a standard normal Gaussian
vector $\vec{V}\sim N(\vec{0}, \mathbb{I}_n)$ and use the result in (b) for an
equivalent model of the form $\widetilde{\mat{Y}} = \widetilde{\mat{X}}\vec{w} + \vec{V}$.
\Part (Compare the effect of different priors) In this part, you will
generate plots that show how different priors on $\vec{w}$ affect our
prediction of the true $\vec{w}$ which generated the data points. Pay
attention to how the amount of data used and the choice of prior
relative to the true $\vec{w}$ we use are related to the final
prediction.
Do the following for
$\matG{\Sigma} = \matG{\Sigma}_1,\matG{\Sigma}_2,\matG{\Sigma}_3,\matG{\Sigma}_4,\matG{\Sigma}_5,\matG{\Sigma}_6$ respectively, where
\begin{align*}
&\matG{\Sigma}_1 = \begin{bmatrix}
1 & 0\\
0 & 1
\end{bmatrix}; \quad
\matG{\Sigma}_2 = \begin{bmatrix}
1 & 0.25\\
0.25 & 1
\end{bmatrix} ; \quad
\matG{\Sigma}_3 = \begin{bmatrix}
1 & 0.9\\
0.9 & 1
\end{bmatrix} ; \\
&\matG{\Sigma}_4 = \begin{bmatrix}
1 & -0.25\\
-0.25 & 1
\end{bmatrix} ; \quad
\matG{\Sigma}_5 = \begin{bmatrix}
1 & -0.9\\
-0.9 & 1
\end{bmatrix} ; \quad
\matG{\Sigma}_6 = \begin{bmatrix}
0.1 & 0 \\
0 & 0.1
\end{bmatrix}
\end{align*}
Under the priors above, the coordinates of the (random) vector
$\vec{w}$ are: (1) independent with large variance, (2) mildly positively
correlated, (3) strongly positively correlated, (4) mildly negatively
correlated, (5) strongly negatively correlated, and (6) independent with
small variances respectively.
Using the starter code, generate data points (in the range [5, 500])
$Y = x_1 + x_2 + Z$ with $x_1, x_2 \sim N(0, 5)$ and $Z \sim N(0, 1)$
as training data (here, the true $\vec{w}$ is thus
$\begin{bmatrix} 1 & 1 \end{bmatrix}^T$). Note that the randomness of
$x_i$ here is only for the generation of the plot but in our
probabilistic model for parameter estimation we consider them as fixed
and given. The starter code helps you generate an interactive plot
where you can adjust the covariance prior and the number of samples
used to calculate the posterior. {\bf Include 6 plots of the contours
of the posteriors on $\vec{w}$ for various settings of
$\matG{\Sigma}$ and number of data points. Write the covariance
prior and number of samples for each plot. What do you observe as
the number of data points increases? }
\Part (Influence of Priors) For our simulations, we will generate $n$
training data samples from $Y=x_1+x_2+Z$ where again
$x_1,x_2 \sim N(0,5)$ and $Z\sim N(0,1)$ (all of them independent of
each other) as before. Notice that the true parameters
$w_1 = 1, w_2 = 1$ are moderately large and positively correlated with
each other. We want to quantitatively understand how the effect of the
prior influences the mean square error as we get more training data.
This should corroborate the qualitative results you saw in the
previous part.
In this case, we could directly compute the ``test error'' for a given
estimator $\widehat{\vec{w}}$ of the parameter $\vec{w}$ (our
prediction for $Y$ given a new data point $\vec{x} = (x_1, x_2)^\top$
is then $\widehat{Y} = \widehat{w}_1 x_1 + \widehat{w}_2
x_2$). Specifically, considering $\widehat{\vec{w}}$ now fixed, the
expected error for a randomly drawn $Y$ given the true (but unknown)
parameter vector $\vec{w} = (1, 1)^\top$ is equal to
$\EE_{Z,\vec{x}}(Y-\widehat{Y})^2 = 5(\widehat{w}_1-1)^2 +
5(\widehat{w}_2 - 1)^2 + 1$. We call this the \emph{theoretical
average test error}. Note that here by our choice of definition, the
expectation for new test samples is over $\vec{x}$ as well, although
our estimator is not taking the randomness of $\vec{x}$ in the
training data into account.
In practice, the expectation with respect to the true conditional
distribution of $Y$ given $\vec{w}$ cannot be computed since the true
$\vec{w}$ is unknown. Instead, we are only given a finite amount of
samples from the model (which we call the \emph{test set}, which independent
of the training data, but identically distributed) so that
it is only possible to compute
\begin{equation*}
\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2
\end{equation*}
which we call the \emph{empirical average test error} (also known as MSE).
Again, note that here,
$\hat{Y}_i = \vec{x}_i^\top\vec{\widehat{w}}$ where
$\vec{x}_i\in\mathbb{R}^2$ and $\vec{\widehat{w}}$ in your model is
the solution to the least square problem with Tikhonov regularization
given the training data.
{\bf Generate a test set of $500$ data points $(\vec{x}_i, Y_i)$ from the
above model. Plot the empirical and theoretical mean square error
between $\hat Y_i$ and $Y_i$ over the test data with respect to the
size of training data $n$ (increase $n$ from $5$ to $200$ in
increments of $5$). }
Note: If we just plotted both the empirical and theoretical average
test errors with respect to the amount of training data for one
``round'' or training data, the results would still look jagged. In
order to give a quantitive statement about the test error with respect
to the training data $n$ with a ``smoother'' plot, what we really want
to know is the expectation of the theoretical average test error with
respect to $\vec{w}$ and the training samples $(\vec{x}_i,Y_i)$, i.e.
$\EE_{(\vec{x}_1,Y_1),\dots, (\vec{x}_n,Y_n)}\EE_{Z,\vec{x}}(Y-\widehat{Y})^2$ (note
that in this term, only $\widehat{Y}$ depends on the training data
$(\vec{x}_1,Y_1),\dots, (\vec{x}_n,Y_n)$ whereas $(\vec{x},Y)$ is an independent fresh
test sample). Consequently, as an approximation, it is worth
replicating the entire experiment a few times (say 100 times) to get
an empirical estimate of this quantity. (It is also insightful to look
at the spread.) {\bf Compare what happens for different priors as the amount
of training data increases. Try plotting the theoretical MSE
with logarithmic x and y-axes and explain the plot. What
constitutes a ``good'' prior and which of the given priors are
``good'' choices for our particular $\vec{w} = (1,1)^\top$? Describe
how the influence of different priors changes with the number of
data points.}
\end{Parts}
\Question{Kernel Ridge Regression: Theory}
In ridge regression, we are given a vector $\y \in \R^{n}$ and a matrix $\X \in \mathbb{R}^{n \times
\ell}$, where $n$ is the number of training points and $\ell$ is the dimension of the raw data points. In
most settings we don't want to work with just the raw feature space, so we augment the
data points with features and replace $\X$ with $\bm{\Phi} \in \mathbb{R}^{n \times d}$, where
$\pmb{\phi}_i\T = \pmb{\phi}(\x_i) \in \R^{d}$. Then we solve a well-defined optimization problem
that involves the matrix $\bm{\Phi}$ and $\y$ to find the parameters $\w \in \R^{d}$. Note the problem that
arises here. If we have polynomial features of degree at most $p$ in the raw $\ell$ dimensional space,
then there are $d = \binom{\ell+p}{p}$ terms that we need to optimize, which can be very, very large
(much larger than the number of training points $n$). Wouldn't it be useful, if instead of solving
an optimization problem over $d$ variables, we could solve an equivalent problem over $n$ variables
(where $n$ is potentially much smaller than $d$), and achieve a computational runtime independent of the number of
augmented features? As it turns out, the concept of kernels (in addition to a technique called the
kernel trick) will allow us to achieve this goal.
\begin{Parts}
\Part (Dual perspective of the kernel method)
In lecture, you saw a derivation of kernel ridge regression involving Gaussians and conditioning.
There is also a pure optimization perspective that uses
Lagrangian multipliers to find the dual of the ridge regression problem. First, we could rewrite the
original problem as
\begin{align*}
\operatorname*{minimize}_{\w, \vec r} & \qquad \frac 1 2 \left[
\left\lVert \vec r \right\rVert_2^2 +
\lambda\left\lVert \w \right\rVert_2^2
\right] \\
\text{subject to} & \qquad \vec r = \X\w-\y.
\end{align*}
\textbf{Show that the solution of this is equivalent to}
\begin{equation}
\min_{\w,\r}\max_{\bm{\alpha}} L(\w, \vec r, \bm{\alpha}) :=
\min_{\w,\r}\max_{\bm{\alpha}} \left[
\frac 1 2 \left\lVert \r \right\rVert_2^2
+ \frac \lambda 2 \left\lVert \w \right\rVert_2^2
+ \bm{\alpha}\T \left( \vec r - \X\w + \y\right)\right],
\end{equation}
where $L(\w, \vec r, \bm{\alpha})$ is the Lagrangian function.
\Part Using the minmax theorem\footnote{\url{https://www.wikiwand.com/en/Minimax_theorem}}, we can
swap the min and max (think about what does the order of min and max mean here and why it is important):
\begin{equation}
\min_{\w,\r}\max_{\bm{\alpha}} L(\w, \vec r, \bm{\alpha}) = \max_{\bm{\alpha}} \min_{\w,\r} L(\w, \vec r, \bm{\alpha}).
\end{equation}
\textbf{Argue that the right hand side is equal to}
\begin{equation}
\argmin_{\bm{\alpha}} \left[ \frac{1}{2} \bm{\alpha}\T (\mat K+\lambda \I) \bm{\alpha} - \lambda \bm{\alpha}\T \y \right]
\text{where $\mat K = \X\X\T \in \R^{n \times n}$}.
\end{equation}
You can do this by setting the appropriate partial derivative of the Lagrangian $L$ to zero. This
is often call \emph{the Lagrangian dual problem} of the original optimization problem.
\Part \textbf{Finally, prove
that the optimal $\w^*$ can be computed using}
\begin{equation}
\w^* = \X\T \left( \mat K + \lambda\I \right)^{-1}\y.
\end{equation}
\Part (Polynomial Regression from a kernelized view) In this part, we will show that polynomial
regression with a particular Tiknov regularization is the same as kernel ridge regression with a
polynomial kernel for second-order polynomials. Recall that a degree 2 polynomial kernel function on
$\mathbb R^d$ is defined as
\begin{equation}
K(\x_i,\x_j) = (1+\x_i\T\x_j)^2,
\end{equation}
for any $\x_i,\x_j\in\mathbb R^\ell$. Given a dataset $(\x_i,y_i)$ for $i=1,2,\dots, n$, \textbf{show
the solution to kernel ridge regression is the same as the regularized least square solution to polynomial
regression (with unweighted monomials as features) for $p=2$ given the right choice of Tikhonov regularization for the polynomial
regression.} That is, show for any new point ${\x}$ given in the prediction stage, both methods
give the same prediction $\hat{y}$ with the same training data. \textbf{What is the Tikhonov regularization
matrix here?}
Hint: You may or may not use the following matrix identity:
\begin{equation}
\A(a \I_d + \A\T \A)^{-1} = (a \I + \A \A\T)^{-1}\A,
\end{equation}
for any matrix $\A\in\mathbb R^{n\times d}$ and any positive real number $a$.
\Part In general, for any polynomial regression with $p$th order polynomial on $\mathbb R^\ell$ with an
appropriately specified Tikhonov regression, we can show the equivalence between it and kernel ridge regression
with a polynomial kernel of order $p$. {\bf Comment on the computational complexity of doing least
squares for polynomial regression with this Tikhonov regression directly and that of doing kernel
ridge regression in the training stage.} (That is, the complexity of finding $\pmb\alpha$ and finding
$\vec w$.) {\bf Compare with the computational complexity of actually doing prediction as well.}
\end{Parts}
\Question{Kernel Ridge Regression: Practice}
In the following problem, you will implement Polynomial Ridge Regression and its kernel variant
Kernel Ridge Regression, and compare them with each other. You will be dealing with a 2D
regression problem, i.e., $\vec x_i \in \mathbb{R}^2$. We give you three datasets,
\texttt{circle.npz} (small dataset), \texttt{heart.npz} (medium dataset), and
\texttt{asymmetric.npz} (large dataset). In this problem, we choose $y_i \in \{-1, +1\}$, so you
may view this question as a classification problem. Later on in the course we will learn about
logistic regression and SVMs, which can solve classification problems much better and can also
leverage kernels.
\begin{Parts}
\Part \textbf{Use \texttt{matplotlib.pyplot} to visualize all the datasets and attach the plots to
your report}. Label the points with different $y$ values with different colors and/or shapes. You
are only allow to use \texttt{numpy.*} and \texttt{numpy.linalg.*} in the following
questions.
\Part \textbf{Implement polynomial ridge regression} (non-kernelized version that you should
already have implemented in your previous homework) \textbf{to fit the datasets
\texttt{circle.npz}, \texttt{asymmetric.npy}, and \texttt{heart.npz}}. Use the first 80\the training dataset and the last 20\average training squared loss and the average validation squared for polynomial order $p \in \{1, \dots,
16\}$}. Use the regularization term $\lambda=0.001$ for all $p$. \textbf{Visualize your result
and attach the heatmap plots for the learned predictions over the entire 2D domain for $p \in \{2,
4, 6, 8, 10, 12\}$ in your report.} You can start with the code from homework 2, problem 5.
\Part \textbf{Implement kernel ridge regression to fit the datasets \texttt{circle.npz},
\texttt{asymmetric.npy}, and \texttt{heart.npz}}. Use the polynomial kernel $K(\vec x_i, \vec x_j)
= (1 + \vec x_i^\top \vec x_j)^p$. Use the first 80\20\average validation squared loss for polynomial order $p \in \{1,\dots, 16\}$}. Use the regularization
term $\lambda=0.001$ for all $p$. The sample code for generating heatmap plot is included in the
start kit. \textbf{For \texttt{circle.npz}, also report the average training squared loss and validation
squared loss for polynomial order $p \in \{1,\dots, 24\}$ when you use only the first 15\training dataset and the rest 85\\textbf{comment on when you want to use a high-order polynomial in linear/ridge regression.}
\Part (Diminishing influence of the prior with growing amount of data) With increasing of amount
of data, the prior (from the statistical view) and regularization (from the optimization view) will
be washed away and become less and less important. Sample the training data from the first 80\data from \texttt{asymmetric.npz} and use the data from the last 20\\textbf{Make a plot whose $x$ axis is the amount of the training data and $y$ axis is the
validation squared loss of the non-kernelized ridge regression algorithm. Optionally, repeat the same for
kernel ridge regression.} Include 6 curves for hyper-parameters $\lambda \in \{0.0001, 0.001,
0.01\}$ and $p = \{5, 6\}$. Your plot should demonstrate that with same $p$, the validation squared loss
will converge with enough data, regardless of the choice of $\lambda$ and/or the regularizer matrix.
You can use log plot on $x$ axis for clearity and you need to resample the data multiple times for
the given $p$, $\lambda$, and the amount of training data in order to get a smooth curve.
\Part A popular kernel function that is widely used in various kernelized learning algorithms is
called the radial basis function kernel (RBF kernel). It is defined as
\begin{equation}
K(\x, \x') = \exp \left(-\frac{\lVert \x-\x'\rVert_2^2}{2\sigma^2}\right).
\end{equation}
\textbf{Implement the RBF kernel function for kernel ridge regression to fit the dataset
\texttt{heart.npz}}. Use the regularization term $\lambda=0.001$. \textbf{Report the average
squared loss, visualize your result and attach the heatmap plots for the fitted functions over the 2D
domain for $\sigma \in \{10, 3, 1, 0.3, 0.1, 0.03\}$ in your report.} You may want to vectorize
your kernel functions to speed up your implementation. \textbf{Comment on the effect of $\sigma$.}
\Part For polynomial ridge regression, \textbf{which of your implementation is more efficient, the
kernelized one or the non-kernelized one?} For RBF kernel, \textbf{explain whether it is possible
to implement it in the non-kernelized ridge regression.} \textbf{Summarize when you
prefer the kernelized to the non-kernelized ridge regression.}
\Part Disable the \texttt{clip} option in the provided \texttt{heatmap} function and redraw the
heatmap plots for the functions learned by the polynomial kernel and RBF kernel. Experiment on the
provided datasets and \textbf{describe one potential problem of the polynomial kernel related to
what you see here.} Does the RBF kernel have such problem? \textbf{Compute, compare, comment, and
attach the heatmap plots of the polynomial kernel and the RBF kernel on \texttt{heart.npz}
dataset.}
\end{Parts}
\Question{Your Own Question}
{\bf Write your own question, and provide a thorough solution.}
Writing your own problems is a very important way to really learn
the material. The famous ``Bloom's Taxonomy'' that lists the levels of
learning is: Remember, Understand, Apply, Analyze, Evaluate, and
Create. Using what you know to create is the top-level. We rarely ask
you any HW questions about the lowest level of straight-up
remembering, expecting you to be able to do that yourself. (e.g. make
yourself flashcards) But we don't want the same to be true about the
highest level.
As a practical matter, having some practice at trying to create
problems helps you study for exams much better than simply counting on
solving existing practice problems. This is because thinking about how
to create an interesting problem forces you to really look at the
material from the perspective of those who are going to create the
exams.
Besides, this is fun. If you want to make a boring problem, go
ahead. That is your prerogative. But it is more fun to really engage
with the material, discover something interesting, and then come up
with a problem that walks others down a journey that lets them share
your discovery. You don't have to achieve this every week. But unless
you try every week, it probably won't happen ever.
\end{document}