Sunday, December 4, 2016

Why High Dimensional data is hard to work with.

High Dimensional Data is hard to work with. This statement is seen all over the place in reference to data analysis and machine learning. But why exactly does that happen? I'll try to explain two of my favorite reasons here in this post.

Sampling

Random Sampling gives out in high dimensional spaces. It's not random any more. This problem becomes apparent in algorithms like KNN. Take, for example 100 points in an n dimensional "cube".

If I use the inbuilt np.random.random() function from numpy I can easily obtain 100 random points from any n dimensional space. Or can I? If I measure the distances (L2 norm to be exact) between all pairs of these points, I get histograms as seen below.

As we see, higher dimensions lead to increasingly the same distance between points. The sampling loses it's randomness, rather higher dimensions are a little harder to sample from.

Spiky Nature

Higher dimensions are spiky! This is bad for methods which depend on the smooth nature of the cost surface for a loss function. It is wonderfully explained in this post here (http://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres#HN2). I'd  simply like to add a graph to it.
This goes to show that the spike like nature of the surface of an n-sphere continuously increases as the number of dimensions increases. After 9 dimensions, the sphere actually "sticks" out of the cube that encloses it.