Fall 2011: STAT 598Z: Introduction to Computing for Statisticians

1. Download the a1a and a1a.t datasets from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a1a and modify the Python program we wrote in class to read in these two datasets. Pay special attention to ensure that the dataset dimensions of the training and test files match (Hint: use the larger of the two dimensions).
2. Write a Python function which takes as input two numpy 1-d arrays and computes the Euclidean distance between them. Recall that
$d(x, y) = \sqrt{\sum_i (x_i - y_i)^2} = \sqrt{\langle x, x \rangle + \langle y, y \rangle - 2 \langle x, y \rangle$
3. Given a point in a1a.t compute its distance from each point in a1a using the above function.
4. Given a point in a1a.t find the index of its nearest neighbor. Predict the label as the label of its nearest neighbor. If the predicted label matches the label of the point then it is correctly classified.
5. Repeat the above step for all points in a1a.t and check what fraction are correctly classified.

Solution nn.py