See page 39 in the CIML book, Chap 3. This notebook builds off those experiments.
Simple experiment: we'll generate several random n-dimensional points, and compute the distances between every pair.
Question 1: How many unique pairs of points are there?
import numpy as np
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
%matplotlib inline
def pair_distances_randompoints(n, metric):
"""generates random n-dimensional vectors,
plots histogram of distances between all the pairs of vectors"""
data = np.random.rand(200, n) # random points with co-ordinates in [0, 1] as row vectors
pairwise = pdist(data, metric) # pdist computes all-pairs distances
plt.figure()
plt.hist(pairwise, 50)
plt.xlabel('Distance')
plt.ylabel('Number of Pairs')
plt.title('{0} distances in {1}-dim space. Mean={2:.2f}. Variance/Mean={3:.3f}'.format(metric,
n,
np.mean(pairwise),
np.var(pairwise)/np.mean(pairwise)))
for n in [1, 2, 3, 4, 10, 10000]:
pair_distances_randompoints(n, 'euclidean')