Recently, I’ve been researching with Edmar Candeia (LAPS) on the topics that surround Smart Grid and its inherent technical difficulties. As a part of our work, we investigated whether using Random Projections helps solving the Big Data problem that comes with the Smart Grid. Actually, this problem is a pretty general, and the answer is yes, RP may help.
Another reason why I’m particularly interested in research this further is that I have a hunch it may help in large-scale computing for the web. So far I started a Ruby project (dieb/random-projections) and already have some working code. Read on.
The problem: insanely huge big data
On a typical smart grid setting, we expect to have millions of devices installed on people’s homes that submit data periodically to the power company. The problem here is how insanely big this dataset can get and how could you possibly run some algorithm on it to know, for instance, the average consumption of power. That sounds like an interesting value to the company, right?
So, as one might think, this problem is pretty general:
How to apply some algorithm or mathematical function to a insanely huge dataset with the constraint that it doesn’t take a lifetime to compute?
Suppose you implemented some way to compute the average consumption of a million homes with a 10^-5 precision, but it takes 3 minutes to compute. Do you really really care about that absurd precision? If that isn’t a requirement of your system, wouldn’t it be more interesting to get a 10^-2 precision with an algorithm that takes no more than a few seconds? This is where Random projections come in handy:
Random projections (RP) is a technique that is particularly useful in algorithm design, especially in approximation algorithms. It enables you to trade-off precision and speed, which basically means “run it faster but it’s okay to have a few negligible errors”.
I won’t be spitting out all the math here, but anyone interested can refer to this particular paper. It provides all the formal foundation required to understand it.
In human-readable words, one may explain RP as follows.
Firstly randomly project data into a lower dimensional space, and then run some algorithm in that space, taking advantage of the speedup produced by working over fewer dimensions.
The theory behind it is beautiful, but not the core content of this post. So, after getting all excited with this, I started experimenting with it.
Implementing random projections in Ruby
I decided to implement it using Ruby, since is my current language of choice and I have a crazy plan to eventually cross-over RP with large-scale web computing. I started by creating a repository for the code at dieb/random-projections and I already implemented the first steps (generating a random matrix for projecting data). Next steps include creating the projection code and running a few experiments with real data.