On the way to the big data concept and burgeoning software, don’t forget to inspect if the data frame or spreadsheet has stored too many redundant records. Just by reshaping, I got the data size reduced to 1/10!
Since I got my hands on global air pollution mapping, I managed to gather a most comprehensive set of station measurements, through collaborations and investigating on the open science community. My ambition was to do it time-resolved, so I gathered hourly data of a year (8760 hours). I got 6 datasets, some dumped into 365 spread sheets, with several air pollutants, with a total size of 22 Gb; some stored in 13 spread sheets, sorted by space-time, some stored wide table, some long, some has UTM time, some local time. In short, all of them need to be wrangled for a consistent structure for querying and analysis.
Though focused on array data management during my Ph.D., I’ve never thought point data would bring me trouble in storage. I thought it is easy, I just need a relation database or HDF5-based tools.
I never thought a meeting with my software developer colleagues would be useful, as I thought I know a bit about HDF-5 or a relational database, just to put them in use shouldn’t be hard.
But by a mere look at the column names I provided, my colleague said ” there is lots of replications in the data, isn’t it?”
I have the columns: time, longitude, latitude, sensor type, values, urban type, … The longitude, latitude, etc. are replicated multiple times when the time integrates.
I have thought this is the format that I’m most familiar for the subsequent analysis, this is the first time I started to care how much space are wasted.
I went back to build a relational database with two tables, one (data table) with
time, value, UID ( a unique id linking to each coordinates)
The other (attribute table) with
UID, Longitude, Latitude, ….
I firstly investigated on a small dataset, which has 70 stations, that reduced the file size from around 75 Mb to 17 Mb.
Now since I’m more aware of the storage, I got uneasy by the left-over replications: the UID is repeating for time iterations, and time repeating among UID.
I then rearranged the long table to the wide table, with each column a station. This makes the size almost a third of the data table. Compared with the original data, it shrinked to 1/10 of the size.
Before going for “sophisticated” big data solutions, thinking about “never repeating data”.