Reducing the data size to 1/10, only by reshaping!

On the way to the big data concept and burgeoning software, don’t forget to inspect if the data frame or spreadsheet has stored too many redundant records. Just by reshaping, I got the data size reduced to 1/10! 

Since I got my hands on global air pollution mapping, I managed to gather a most comprehensive set of station measurements, through collaborations and investigating on the open science community. My ambition was to do it time-resolved, so I gathered hourly data of a year (8760 hours). I got 6 datasets, some dumped into 365 spread sheets, with several air pollutants, with a total size of 22 Gb; some stored in 13 spread sheets, sorted by space-time, some stored wide table, some long, some has UTM time, some local time. In short, all of them need to be wrangled for a consistent structure for querying and analysis. 

Though focused on array data management during my Ph.D., I’ve never thought point data would bring me trouble in storage. I thought it is easy, I just need a relation database or HDF5-based tools.

I never thought a meeting with my software developer colleagues would be useful, as I thought I know a bit about HDF-5 or a relational database, just to put them in use shouldn’t be hard.

But by a mere look at the column names I provided, my colleague said ” there is lots of replications in the data, isn’t it?”

I have the columns: time, longitude, latitude, sensor type, values, urban type,  … The longitude, latitude, etc. are replicated multiple times when the time integrates.

I have thought this is the format that I’m most familiar for the subsequent analysis, this is the first time I started to care how much space are wasted.

I went back to build a relational database with two tables, one (data table) with

time, value, UID ( a unique id linking to each coordinates)

The other (attribute table) with

UID, Longitude, Latitude, …. 

I firstly investigated on a small dataset, which has 70 stations, that reduced the file size from around 75 Mb to 17 Mb.

Now since I’m more aware of the storage, I got uneasy by the left-over replications: the UID is repeating for time iterations, and time repeating among UID.

I then rearranged the long table to the wide table, with each column a station. This makes the size almost a third of the data table. Compared with the original data, it shrinked to 1/10 of the size. 

Before going for “sophisticated” big data solutions, thinking about “never repeating data”.

Advertisement