Homework 6 (100 pts)

This problem set is due Wednesday (11/16/2016) at 11:59pm. Please turn in your work by uploading to Canvas. f you have questions, please post them on the course forum, rather than emailing the course staff. This will allow other students with the same question to see the response and any ensuing discussion. The goal of this problem set is to review clustering through the process of spike sorting action potentials data. You should submit both commented code and a write-up for this homework assignment.

Description of data set

The dataset for this problem is linked in the numpy file SpikeData05.npz, which contains numpy arrays SpikeWaveforms and RawData. (The raw data was too large to upload. It is available on request.) You can load it by calling

import numpy as np
data = np.load('SpikeData05.npz')
SpikeWaveforms = data['SpikeWaveforms']
# RawData = data['RawData']

Questions

  1. Plot the spike peaks in 4-D space (10 pts)

    Find the peak amplitude (post-threshold) of each snippet on each channel. If you look at some waveforms, you’ll see that the peak typically happens within 5-10 data points of the threshold crossing (later peaks are noise!). Make a 6 subplot figure showing a point in each panel for each detected action potential. The point should be the peak amplitude on channel A vs the peak amplitude on channel B, where A and B are the combinatorial pairs within {1; 2; 3; 4} (i.e., 1 vs. 2, 1 vs. 3, etc.). Once you have found the snippet peaks as a list or numpy array (Peaks), the following code can be useful:

    import seaborn as sns
    PP = pd.DataFrame(np.array(Peaks))
    g = sns.PairGrid(PP)
    g = g.map_lower(plt.hexbin,gridsize=50, mincnt=1, cmap='seismic',bins='log')
    for i, j in zip(*np.triu_indices_from(g.axes, 0)):
      g.axes[i, j].set_visible(False)
    
  2. Clustering with K-Means (40 pts)

    Note that SciPy has an implementation of K-means. For this assignment, however, you are to implement your own version of it. Use your implementation to determine the neuron responsible for each recorded spike. Use only the data from the first channel of the tetrode. You may only use basic NumPy commands, i.e. linear algebra and functions that you would find on a graphing calculator. Treat each snippet as a point in , where is 40, the length of each snippet, and is the total number of snippets. In this problem, we will assume there are neurons contributing spikes to the recorded waveform. For computational simplicity, use only the first 2000 snippets ().

    a. For each cluster (k = 1, 2, 3), create a separate “voltage versus time” plot containing the following:

    • the cluster center generated by K-Means as a red waveform trace (i.e., the prototypical action potential for the k-th neuron and

    • all of the waveform snippets assigned to the k-th cluster

    b. Plot the objective number versus iteration number (as in Figure 9.2 in PRML). How many iterations did it take for K-Means to converge?

    c. Repeat the problem with K = 4 clusters.

  3. Gaussian Mixtures (20 pts)

    THIS QUESTION IS INCLUDED ONLY FOR REFERENCE - DO NOT TURN IN.

    Consider the plot of snippet peaks you made for problem (1). The clouds of points you find do not seem circular. Thus, we’d like to find clusters that would fit to the shape of what we observe. This is a case where a Gaussian mixture model can work really well. In class, we derived the EM algorithm for the Gaussian mixture model by taking derivatives of the data likelihood , where represents the training data , and represents all the model parameters . An alternate approach is to apply the general EM algorithm (shown on p. 440-441 in PRML) to the Gaussian mixture model. Both approaches should yield the same update equations, which we will show in this problem.

    The Gaussian mixture model is defined by the prior probability of each mixture component indexed by :

    and the conditional distribution of the data point $#\mathbf{x}#$ given the mixture component

    where . We will denote the n-th data point as and its corresponding latent variable as , where .

    a. In the E-step of the general EM algorithm, we evaluate . For the Gaussian mixture model, show that

    b. In the M-step of the general EM algorithm, we maximize the expected log joint distribution (summed across all data points)

    with respect to the model parameters . Note that the expectation above is taken with respect to the distribution found in the E-step in part (a). We seek to find

    For the Gaussian mixture model, let for notational simplicity. By maximizing with respect to each of the model parameters, show that

    where .

    (Hint: The distribution found in the E-step, , should be treated as a fixed distribution (with no parameters) in the M-step. In other words, the should be treated as constants in . The model parameters should only appear in the joint distribution .)

  4. Using Gaussian Mixtures (50 pts)

    In python, the scikit-learn package has nice support for mixture models.

    from sklearn import mixture
    gmix = mixture.GMM(n_components=2, covariance_type='full')
    gmix.fit(peaks)
    

    You will use this package to define and train a model of the 4-channel waveform peaks. To make this problem more interesting, load data from the file SpikeData12.npz, and find the waveform peaks as in (1).

    a. Use the first 5000 snippet-peaks (i.e., are the peak of the spike snippet in each tetrode channel for snippets), and learn the Gaussian mixture parameters for neurons. Plot the resulting cluster assignments in a six panel plot as in (2) with the clusters color-coded. A well documented example can be found in from the scikit-learn docs.

    b. Find the cluster assignments for the next 5000 snippet-peaks. You will use these as test data to evaluate how many clusters there should be in the data. Calculate the model likelihood of the second set of 5000 snippet-peaks using the parameters you found in (a). Now, repeat the EM-learning in (a), but with . What is the likelihood of the test data for each model? Which model would you use if you wanted only well-clustered neurons for your analysis? For the most likely value of , plot the cluster assignments as in (a).