Why data is hard
Imagine you are a small startup trying to get its place in the market.
You start to build a product, and you realize you might need data to fill those empty screens, you start searching through vendors and external services that are in charge of it, and you find out that data, first of all, is f*cking expensive. But anyway, your company is growing, so it’s OK for the CFO to spend a few more bucks on it. After all, you will need this for potential new customers.
You’d think with all the advancements in data storage, processing, and analytics, this would be a walk in the park by now (maybe if you have an infinite budget, it could be). But let’s not fool ourselves here: data is as complex as it could be.
Noise is everywhere, and because of that, issues are so hard to spot.
When you receive the data, you realize that the data corpus you bought is enormous. You can’t handle it manually.
You start writing a few scripts in that good ol’ language you learned during weekends, and by now, you have a good assessment of what you got, and you realize that after running the script, you have something: more data!
After a while using it on your application, customers actually start to see your data as a reliable and unique source of information until you step into a new little friend: data downtime—the data engineer’s boogeyman.
It’s around every corner now, and you may ask yourself, how do I work this?
So, you start to realize you, by yourself, cannot build something as reliable as what your customers want, and because of that, you start making new friends.
Data pipelines, ETL, warehouses, validation processes, staging environment, quality assurances, and the list goes on and you may ask yourself, am I right? am I wrong?
Anyways, after a couple of weeks, you realize that ensuring quality isn’t just a one-time deal you can build and move on; it’s a continuous process.
You need to keep asking yourself: is the data up-to-date? Complete? Are the null rates within an acceptable range? You start reading about data catalogs: more complexity, more people, and your mind is asking again: How much data could actually exist!?.
Even inside your company, the data keeps growing because of the mere fact of the data being data, and now you realize that even the data of your data has data related as well.
This madness needs to stop, but it’s too late.
Now, your organization is in the data race. Everyone, every morning, will experience some issue related to this: wrong dashboards, numbers that look OK while, in fact, they are not, lousy scrapping, automatic reporting, customers that want raw data instead of consumption through the interface, emails are sent in daily basis to your third-party provider, and again, the list goes on and on.
Your software, which you once thought would be only that, software, started to become this monster of data generation. From application logs with diagnostic information to API responses showing how our third-party services interact with event data from other services, each comes with its own set of problems and challenges we need to consider before it happens.
But always remember “there are unknown unknowns”
Just for you to imagine, last week, a slight change in a raw data file from an overseas vendor wreaked one of our internal dashboards. There I learned the hard way that quality assurance on the data is a continuous (and costly) battle everyone needs to fight against if data it’s really in charge of producing value in your organization.
Validity, accuracy, timeliness, completeness, correctness, schemas, etc.
You wake up one day after a couple of months of wrestling with this issue of what, at first, started as a simple and naïve script, and you realize you are not only solving a technical problem, but you are building the foundations of a data-driven organization.
And that, my friend, is why data is hard.