Homework 4

Homework 4 (100 pts)

This problem set is due Tuesday (10/19/2021) at 11:59pm. Please turn in your work by uploading to Canvas. If you have questions, please post them on the course forum, rather than emailing the course staff. This will allow other students with the same question to see the response and any ensuing discussion. The goal of this problem set is to review clustering through the process of spike sorting action potentials data. You should submit both commented code and a write-up for this homework assignment.

Description of the first data set

The dataset for this problem is linked in the numpy file SpikeData05.npz (SpikeData05.mat), which contains numpy arrays SpikeWaveforms and ~~RawData~~. (The raw data was too large to upload. It is available on request.) You can load it by calling

import numpy as np
data = np.load('SpikeData05.npz')
SpikeWaveforms = data['SpikeWaveforms']
# RawData = data['RawData']

RawData: This is a matrix of raw data recorded from a tetrode (a four channel electrode) in the hippocampus. The sampling rate is 30 kHz, and the data are simultaneous voltage signals (in μV) from each channel. I considered having you do the spike extraction for this data - this involves two steps (1) filtering out LFP (600 Hz 2-pole highpass and 2-pole 6 kHz low pass and (2) finding instances where one of the channels crosses a threshold value. For the sake of time, I decided to do this for you. Each snippet is 40 points long, and the threshold crossing happens between the 10th and 11th data points. In this data, I chose a value of 60 uV for the threshold.
SpikeWaveforms : This is a matrix of action potential waveforms sampled simultaneously on all four channels of the tetrode. The size is $M\times 40 \times 4$, where $M = 33787$ is the number of action potentials detected, and the recorded portion of the action potential waveform (“snippet”) is in a 40 sample window around the threshold crossing.

Questions

Plot the spike peaks in 4-D space (10 pts)

Find the peak amplitude (post-threshold) of each snippet on each channel. If you look at some waveforms, you’ll see that the peak typically happens within 5-10 data points of the threshold crossing (later peaks are noise!). Search for peaks that are between timepoint 5 and timepoint 25. If you are new to NumPy, the function np.argmax() is a good place to start! You can choose whether to find the index of the snippet of the peak across all 4 channels independently, or jointly (e.g., averaging the 4 channels, finding the peak of the averaged waveforms, and then extracting the 4-D value at the resulting index). Make a 6 subplot figure (like the one shown in the lecture) showing a point in each panel for each detected action potential. The point should be the peak amplitude on channel A vs the peak amplitude on channel B, where A and B are the combinatorial pairs within {1; 2; 3; 4} (i.e., 1 vs. 2, 1 vs. 3, etc.). Once you have found the snippet peaks as a list or numpy array (Peaks), the following code can be useful:
```
import pandas as pd
import seaborn as sns
PP = pd.DataFrame(np.array(Peaks))
g = sns.PairGrid(PP)
g = g.map_lower(plt.hexbin,gridsize=50, mincnt=1, cmap='seismic',bins='log')
for i, j in zip(*np.triu_indices_from(g.axes, 0)):
  g.axes[i, j].set_visible(False)
```
Clustering with K-Means (40 pts)

Note that SciPy has an implementation of K-means. For this assignment, however, you are to implement your own version of it. Use your implementation to determine the neuron responsible for each recorded spike. Use only the data from the first channel of the tetrode. You may only use basic NumPy commands, i.e. linear algebra and functions that you would find on a graphing calculator. Treat each snippet as a point in $\mathbf{x}_m \in \mathbb{R}^D, \, m\in \lbrace 1, 2, \ldots M \rbrace$, where $D$ is 40, the length of each snippet, and $M$ is the total number of snippets. In this problem, we will assume there are $K = 3$ neurons contributing spikes to the recorded waveform. For computational simplicity, use only the first 2000 snippets ($M = 2000$).

a. For each cluster (k = 1, 2, 3), create a separate “voltage versus time” plot containing the following:
- the cluster center $\boldsymbol \mu_k$ generated by K-Means as a red waveform trace (i.e., the prototypical action potential for the k-th neuron and
- all of the waveform snippets assigned to the k-th cluster
b. Plot the objective number $J$ versus iteration number (as in Figure 9.2 in PRML). How many iterations did it take for K-Means to converge?

c. Repeat the problem with K = 4 clusters.
Gaussian Mixtures (20 pts)

THIS QUESTION IS INCLUDED ONLY FOR REFERENCE - DO NOT TURN IN.

Consider the plot of snippet peaks you made for problem (1). The clouds of points you find do not seem circular. Thus, we’d like to find clusters that would fit to the shape of what we observe. This is a case where a Gaussian mixture model can work really well. In class, we derived the EM algorithm for the Gaussian mixture model by taking derivatives of the data likelihood $\Pr(\lbrace \mathbf{x} \rbrace \vert \theta)$, where $\lbrace \mathbf{x} \rbrace$ represents the training data $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_M$, and $\theta$ represents all the model parameters ${\boldsymbol \mu}_k, {\boldsymbol\Sigma}_k, \text{ and } \pi_k, (k = 1, \ldots, K)$. An alternate approach is to apply the general EM algorithm (shown on p. 440-441 in PRML) to the Gaussian mixture model. Both approaches should yield the same update equations, which we will show in this problem.

The Gaussian mixture model is defined by the prior probability of each mixture component indexed by $z$:
\[\Pr(z = k) = \pi_k\]
and the conditional distribution of the data point $#\mathbf{x}#$ given the mixture component
\[\Pr(\mathbf{x} \vert z = k) = \mathcal{N} (\mathbf{x} \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k)\]
where $k = 1, \ldots, K$. We will denote the n-th data point as $\mathbf{x}_n$ and its corresponding latent variable as $z_n$, where $n = 1, \ldots, N$.

a. In the E-step of the general EM algorithm, we evaluate $\Pr(z_n = k \vert \mathbf{x}_n)$. For the Gaussian mixture model, show that
\[\Pr(z = k \vert \mathbf{x}_n) = \frac{\mathcal{N}(\mathbf{x}_n \vert \boldsymbol \mu_k, \boldsymbol \Sigma_k) \pi_k}% {\sum_{j=1}^K \mathcal{N}(\mathbf{x}_n \vert \boldsymbol \mu_j, \boldsymbol \Sigma_j) \pi_j}\]
b. In the M-step of the general EM algorithm, we maximize the expected log joint distribution (summed across all $N$ data points)
\[\begin{split} \mathcal{Q}(\theta) &=\sum_{n=1}^N \text{E}\left[\log \Pr(\mathbf{x}_n, z_n \vert \theta)\right] \\ &=\sum_{n=1}^N \sum_{k=1}^K \Pr(z_n=k \vert \mathbf{x}_n) \log \Pr(\mathbf{x}_n, z_n \vert \theta) \end{split}\]
with respect to the model parameters $\theta$. Note that the expectation above is taken with respect to the distribution found in the E-step in part (a). We seek to find
\[\theta^{\text{new}} = \underset{\theta}{\text{argmax}} \mathcal{Q}(\theta).\]
For the Gaussian mixture model, let $\gamma_{nk} = \Pr(z_n = k \vert \mathbf{x}_n)$ for notational simplicity. By maximizing $\mathcal{Q}(\theta)$ with respect to each of the model parameters, show that
\[\begin{split} \boldsymbol \mu_k^{\text{new}} &= \frac{1}{N_k} \sum_{n=1}^N \gamma_{n k} \mathbf{x}_n \\ \boldsymbol \Sigma_k^{\text{new}} &= \frac{1}{N_k} \sum_{n=1}^N \gamma_{n k} (\mathbf{x}_n - \boldsymbol \mu_k) (\mathbf{x}_n - \boldsymbol \mu_k)^T \\ \pi_k^{\text{new}} &= \frac{N_k}{N} \end{split}\]
where $N_k = \sum_{n=1}^N \gamma_{n k}$.

(Hint: The distribution found in the E-step, $\Pr(z_n = k \vert \mathbf{x}_n$, should be treated as a fixed distribution (with no parameters) in the M-step. In other words, the $\gamma_{n k}$ should be treated as constants in $\mathcal{Q}(\theta)$. The model parameters should only appear in the joint distribution $\Pr(\mathcal{x}_n, z_n \vert \theta)$.)
Using Gaussian Mixtures (50 pts)

In python, the scikit-learn package has nice support for mixture models.
```
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(peaks)
```
You will use this package to define and train a model of the 4-channel waveform peaks. To make this problem more interesting, load data from the file SpikeData12.npz (SpikeData12.mat), and find the waveform peaks as in (1).

a. Use the first 5000 snippet-peaks (i.e., $\mathbf{x}_n \in \mathcal{R}^4$ are the peak of the spike snippet in each tetrode channel for $n = 1, \ldots 5000$ snippets), and learn the Gaussian mixture parameters for $K=10$ neurons. Plot the resulting cluster assignments in a six panel plot as in (2) with the clusters color-coded. A well documented example can be found in from the scikit-learn docs.

b. Find the cluster assignments for the next 5000 snippet-peaks. You will use these as test data to evaluate how many clusters there should be in the data. Calculate the model likelihood of the second set of 5000 snippet-peaks using the parameters you found in (a). The score_samples() function returns the log likelihood of one or more data points. The score() function returns the average log likelihood for many data points, which represents the model quality we care about. Now, repeat the EM-learning in (a), but with $K = 8, 9, \ldots, 20$. What is the likelihood of the test data for each model? Which model would you use if you wanted only well-clustered neurons for your analysis? For the most likely value of $K$, plot the cluster assignments as in (a).

c. Train models with a full covariance matrix as in (a) and a diagonal covariance matrix (use the covariance_type='diagonal' option). Compare the model likelihoods on test data. Which model does a better job?
Clustering using Pearson Correlation (50 pts)

This problem replicates a result from the paper “Sub-second dynamics of theta-gamma coupling in hippocampal CA1” by Zhang et al https://elifesciences.org/articles/44320. They calculated the spectrograms of their LFP data and extracted individual theta cycles. They then clustered these spectrograms to assess whether higher frequency gamma oscillations preferentially occur at certain phases of theta.

The data for this problem can be found here. You can load it as:
```
 import numpy as np
 import matplotlib.pyplot as plt

 data = np.load('hw4problem5.npy')
 plt.imshow(data[0,:,:])
```
The data you are given are a series of matrices of size $K$ x $N_{\theta}$, where K is 81 and $N_{\theta}$, the number of phase bins, is 20. The phase goes from $-\pi$ to $\pi$ (i.e., phase = np.linspace(-np.pi, np.pi, 20), and the frequency vector can be calculated as freqs = np.arange(20, 182, 2)[::-1]. Note that in the frequency vector and the data, the 0-th entry corresponds to the highest frequency (180 Hz). This is so that plotting with imshow() works nicely.

Your job is to replicate their analysis.

They use a distance measure derived from the Pearson correlation, specificially $d(x,y) = 1 - \rho(x,y)$. As shown in this paper, https://ieeexplore.ieee.org/abstract/document/6739991, clustering with Pearson correlation distance is equivalent to using K-means on normalized data, $\tilde{x} = \frac{x - \bar{x}}{\sigma_x}$, where $\bar{x}$ is the mean of the data point $x$ (across its dimensions) and $\sigma_x$ is its standard deviation.

To cluster, you can use your K-means implementation or the scikit-learn one. Assuming you’ve created a matrix normdata, which is normalized properly, here’s some example code that works.
```
 from sklearn.cluster import KMeans
 K = 4
 ndata = np.reshape(normdata, (normdata.shape[0], normdata.shape[1] * normdata.shape[2]))
 kmeans = KMeans(n_clusters=K, random_state=0, algorithm="full").fit(ndata)

 cluster0 = np.reshape(kmeans.cluster_centers_[0,:], (normdata.shape[1], normdata.shape[2]))
 plt.imshow(cluster0)
```
a. Cluster the theta cycle spectrograms into K=4 clusters. Plot the mean spectrogram of each cluster as in their Figure 1D.

b. Find 3 data points (i.e., theta cycles) which are close to the centroid of each cluster and plot them. How similar do invidivual cycles look to the averages?

c. Find 3 data points which are close to the boundaries of two or more of the different clusters. (You can do this by finding the distances to each cluster center, and sorting the data points by how close they are!) Plot these examples. Give a qualitative description.