Homework 6

Homework 6 (100 pts)

This problem set is due Wednesday (11/16/2016) at 11:59pm. Please turn in your work by uploading to Canvas. f you have questions, please post them on the course forum, rather than emailing the course staff. This will allow other students with the same question to see the response and any ensuing discussion. The goal of this problem set is to review clustering through the process of spike sorting action potentials data. You should submit both commented code and a write-up for this homework assignment.

Description of data set

The dataset for this problem is linked in the numpy file SpikeData05.npz, which contains numpy arrays SpikeWaveforms and ~~RawData~~. (The raw data was too large to upload. It is available on request.) You can load it by calling

import numpy as np
data = np.load('SpikeData05.npz')
SpikeWaveforms = data['SpikeWaveforms']
# RawData = data['RawData']

RawData: This is a matrix of raw data recorded from a tetrode (a four channel electrode) in the hippocampus. The sampling rate is 30 kHz, and the data are simultaneous voltage signals (in μV) from each channel. I considered having you do the spike extraction for this data - this involves two steps (1) filtering out LFP (600 Hz 2-pole highpass and 2-pole 6 kHz low pass and (2) finding instances where one of the channels crosses a threshold value. For the sake of time, I decided to do this for you. Each snippet is 40 points long, and the threshold crossing happens between the 10th and 11th data points. In this data, I chose a value of 60 uV for the threshold.
SpikeWaveforms : This is a matrix of action potential waveforms sampled simultaneously on all four channels of the tetrode. The size is $M\times 40 \times 4$ , where $M = 33787$ is the number of action potentials detected, and the recorded portion of the action potential waveform (“snippet”) is in a 40 sample window around the threshold crossing.

Questions

Plot the spike peaks in 4-D space (10 pts)

Find the peak amplitude (post-threshold) of each snippet on each channel. If you look at some waveforms, you’ll see that the peak typically happens within 5-10 data points of the threshold crossing (later peaks are noise!). Make a 6 subplot figure showing a point in each panel for each detected action potential. The point should be the peak amplitude on channel A vs the peak amplitude on channel B, where A and B are the combinatorial pairs within {1; 2; 3; 4} (i.e., 1 vs. 2, 1 vs. 3, etc.). Once you have found the snippet peaks as a list or numpy array (Peaks), the following code can be useful:
```
import seaborn as sns
PP = pd.DataFrame(np.array(Peaks))
g = sns.PairGrid(PP)
g = g.map_lower(plt.hexbin,gridsize=50, mincnt=1, cmap='seismic',bins='log')
for i, j in zip(*np.triu_indices_from(g.axes, 0)):
  g.axes[i, j].set_visible(False)
```
Clustering with K-Means (40 pts)

Note that SciPy has an implementation of K-means. For this assignment, however, you are to implement your own version of it. Use your implementation to determine the neuron responsible for each recorded spike. Use only the data from the first channel of the tetrode. You may only use basic NumPy commands, i.e. linear algebra and functions that you would find on a graphing calculator. Treat each snippet as a point in $\mathbf{x}_m \in \mathbb{R}^D, \, m\in \lbrace 1, 2, \ldots M \rbrace$ , where $D$ is 40, the length of each snippet, and $M$ is the total number of snippets. In this problem, we will assume there are $K = 3$ neurons contributing spikes to the recorded waveform. For computational simplicity, use only the first 2000 snippets ( $M = 2000$ ).

a. For each cluster (k = 1, 2, 3), create a separate “voltage versus time” plot containing the following:
- the cluster center $\boldsymbol \mu_k$ generated by K-Means as a red waveform trace (i.e., the prototypical action potential for the k-th neuron and
- all of the waveform snippets assigned to the k-th cluster
b. Plot the objective number $J$ versus iteration number (as in Figure 9.2 in PRML). How many iterations did it take for K-Means to converge?

c. Repeat the problem with K = 4 clusters.
Gaussian Mixtures (20 pts)

THIS QUESTION IS INCLUDED ONLY FOR REFERENCE - DO NOT TURN IN.

Consider the plot of snippet peaks you made for problem (1). The clouds of points you find do not seem circular. Thus, we’d like to find clusters that would fit to the shape of what we observe. This is a case where a Gaussian mixture model can work really well. In class, we derived the EM algorithm for the Gaussian mixture model by taking derivatives of the data likelihood $\Pr(\lbrace \mathbf{x} \rbrace \vert \theta)$ , where $\lbrace \mathbf{x} \rbrace$ represents the training data $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_M$ , and $\theta$ represents all the model parameters ${\boldsymbol \mu}_k, {\boldsymbol\Sigma}_k, \text{ and } \pi_k, (k = 1, \ldots, K)$ . An alternate approach is to apply the general EM algorithm (shown on p. 440-441 in PRML) to the Gaussian mixture model. Both approaches should yield the same update equations, which we will show in this problem.

The Gaussian mixture model is defined by the prior probability of each mixture component indexed by $z$ :
$\Pr(z = k) = \pi_k$
and the conditional distribution of the data point $#\mathbf{x}#$ given the mixture component
$\Pr(\mathbf{x} \vert z = k) = \mathcal{N} (\mathbf{x} \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k)$
where $k = 1, \ldots, K$ . We will denote the n-th data point as $\mathbf{x}_n$ and its corresponding latent variable as $z_n$ , where $n = 1, \ldots, N$ .

a. In the E-step of the general EM algorithm, we evaluate $\Pr(z_n = k \vert \mathbf{x}_n)$ . For the Gaussian mixture model, show that
$\Pr(z = k \vert \mathbf{x}_n) = \frac{\mathcal{N}(\mathbf{x}_n \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k) \pi_k}% {\sum_{j=1}^K \mathcal{N}(\mathbf{x}_n \vert \boldsymbol \mu_j, \boldsymbol \Sigma_j) \pi_j}$
b. In the M-step of the general EM algorithm, we maximize the expected log joint distribution (summed across all $N$ data points)
$\begin{split} \mathcal{Q}(\theta) &=\sum_{n=1}^N \text{E}\left[\log \Pr(\mathbf{x}_n, z_n \vert \theta)\right] \\ &=\sum_{n=1}^N \sum_{k=1}^K \Pr(z_n=k \vert \mathbf{x}_n) \log \Pr(\mathbf{x}_n, z_n \vert \theta) \end{split}$
with respect to the model parameters $\theta$ . Note that the expectation above is taken with respect to the distribution found in the E-step in part (a). We seek to find
$\theta^{\text{new}} = \underset{\theta}{\text{argmax}} \mathcal{Q}(\theta).$
For the Gaussian mixture model, let $\gamma_{nk} = \Pr(z_n = k \vert \mathbf{x}_n)$ for notational simplicity. By maximizing $\mathcal{Q}(\theta)$ with respect to each of the model parameters, show that
$\begin{split} \boldsymbol \mu_k^{\text{new}} &= \frac{1}{N_k} \sum_{n=1}^N \gamma_{n k} \mathbf{x}_n \\ \boldsymbol \Sigma_k^{\text{new}} &= \frac{1}{N_k} \sum_{n=1}^N \gamma_{n k} (\mathbf{x}_n - \boldsymbol \mu_k) (\mathbf{x}_n - \boldsymbol \mu_k)^T \\ \pi_k^{\text{new}} &= \frac{N_k}{N} \end{split}$
where $N_k = \sum_{n=1}^N \gamma_{n k}$ .

(Hint: The distribution found in the E-step, $\Pr(z_n = k \vert \mathbf{x}_n$ , should be treated as a fixed distribution (with no parameters) in the M-step. In other words, the $\gamma_{n k}$ should be treated as constants in $\mathcal{Q}(\theta)$ . The model parameters should only appear in the joint distribution $\Pr(\mathcal{x}_n, z_n \vert \theta)$ .)
Using Gaussian Mixtures (50 pts)

In python, the scikit-learn package has nice support for mixture models.
```
from sklearn import mixture
gmix = mixture.GMM(n_components=2, covariance_type='full')
gmix.fit(peaks)
```
You will use this package to define and train a model of the 4-channel waveform peaks. To make this problem more interesting, load data from the file SpikeData12.npz, and find the waveform peaks as in (1).

a. Use the first 5000 snippet-peaks (i.e., $\mathbf{x}_n \in \mathcal{R}^4$ are the peak of the spike snippet in each tetrode channel for $n = 1, \ldots 5000$ snippets), and learn the Gaussian mixture parameters for $K=10$ neurons. Plot the resulting cluster assignments in a six panel plot as in (2) with the clusters color-coded. A well documented example can be found in from the scikit-learn docs.

b. Find the cluster assignments for the next 5000 snippet-peaks. You will use these as test data to evaluate how many clusters there should be in the data. Calculate the model likelihood of the second set of 5000 snippet-peaks using the parameters you found in (a). Now, repeat the EM-learning in (a), but with $K = 8, 9, \ldots, 20$ . What is the likelihood of the test data for each model? Which model would you use if you wanted only well-clustered neurons for your analysis? For the most likely value of $K$ , plot the cluster assignments as in (a).