Wednesday, December 28, 2016

Rain

As I walk out of my room, I am greeted by this winter rain. A soft spray of cold water caresses my jaw while I move along the balcony, as if to cajole me into joining it in it's emotion.

The wind whips up the beads of sweat born from seasons past into a frenzy so great they conceive ghosts of the land crashing against buildings as though to reclaim the earth which is now vertical against it's wishes. The forest in front of me is awash, blushing it's beauty with the rain washing away all that made it forget who it was.

A part of this torrent is caught in my neighbor's rooftop. The thousand droplet soldiers march to the steady drum of the drainpipe relieving itself on the tin sheet placed under it. The sky is painted blue with the deep color that only clouds with filled bellies can bring. A tinge of red peaks out behind the horizon as if to catch one last glimpse of this scene before going to bed.

Humans scurry about their shelters, filling up containers with what I have to spare. The smoke rises from their mounds and meets my gift halfway, mixing and mating with it until it falls back on land not gift and not smoke anymore. Humans. I've watched them since the beginning of time and as a people, they have watched me too. They have inspired in me emotion from time to time and so have I in them.

We are all prisoners, together in this walled reality of capability. What is, can. Thus the both of us trapped, beautiful things.

Thursday, December 8, 2016

Taking a Look at Competitive Programming

Introduction

Competitive programming is a lucrative business if you have the chops for it. It has however been subject to little analysis AFAIK. This has partially been because of the unavailability of consolidated data set to facilitate this. To resolve that I scraped together data from the Codechef website amounting to a little over 1 million programs submitted to the site. You can get the dataset on Kaggle (link to data).

The purpose of this work is to get started with the OpenAI special project which states
Build an agent to win online programming competitions. A program that can write other programs would be, for obvious reasons, very powerful.

To solve a problem, the first step is always to understand the problem and know it's quirks. This blog post is possibly the first of many exploring this dataset with the ultimate aim of being able to generate a solution program given the statement of the problem.

An Exploratory Data Analysis has been done by Swaraj Khadanga. You can see it at Simple EDA.

A brief description of the data

The dataset presents a table with columns as variables and rows as observations, making it tidy in the process. The columns we are interested in are:
  • QCode : The Question Code, denoting a unique question
  • Solution : The program written by the user
  • Status : Was it a correct attempt or not?
  • Language : The language that the code was written in.

Program Visualization

This visualization makes some assumptions, namely:
  • We consider only those programs which were written in C
  • We reduce each program to another string, keeping only keywords and special characters. The idea is to try and get an approximate representation of the Abstract Syntax Tree by obtaining an infix representation of the code.
  • With these program representations as representations of the programs, we calculate the Levenshtein distance between each pair of representations.
  • Using the above data we have a complete graph with each vertex representing a program and each edge being the Levenshtein distance between two programs.
  • Now, we place each vertex in a 2D plane at coordinates such that the difference between the distances between points on the plane and the Levenshtein distance between two points is minimal. This is similar to what has been done by Ben in his post on numerical optimization in the Last cities example.
We plot the points at the calculated coordinates and color them by the problem they were trying to solve. By doing this we arrive on a graph which is shown below.

All I can say at this point is that it's interesting. The X and Y represent nothing of significance. What is significant is the relative distances between the points. Most programs are of a similar structure with very few actually differing a lot from each other.
There will be more on this in later posts. The plot was created using this gist.

Sunday, December 4, 2016

Why High Dimensional data is hard to work with.

High Dimensional Data is hard to work with. This statement is seen all over the place in reference to data analysis and machine learning. But why exactly does that happen? I'll try to explain two of my favorite reasons here in this post.

Sampling

Random Sampling gives out in high dimensional spaces. It's not random any more. This problem becomes apparent in algorithms like KNN. Take, for example 100 points in an n dimensional "cube".

If I use the inbuilt np.random.random() function from numpy I can easily obtain 100 random points from any n dimensional space. Or can I? If I measure the distances (L2 norm to be exact) between all pairs of these points, I get histograms as seen below.

As we see, higher dimensions lead to increasingly the same distance between points. The sampling loses it's randomness, rather higher dimensions are a little harder to sample from.

Spiky Nature

Higher dimensions are spiky! This is bad for methods which depend on the smooth nature of the cost surface for a loss function. It is wonderfully explained in this post here (http://www.penzba.co.uk/cgi-bin/PvsNP.py?SpikeySpheres#HN2). I'd  simply like to add a graph to it.
This goes to show that the spike like nature of the surface of an n-sphere continuously increases as the number of dimensions increases. After 9 dimensions, the sphere actually "sticks" out of the cube that encloses it.