I am pretty sure that on your data journey you came across some courses, videos, articles, maybe use cases where someone takes some data, builds a classification/regression model, shows you great results, you learn how that model works and why it works that way and not another and everything seems to be fine. You think you just learned a new thing (and you did), you are happy about that (yes, you are ! I am not kidding around here, you’re doing great!) and you continue to the next piece of content.
But later on you start to ask additional questions (everyone has different length of that “later on”), like: where did that data come from? and if I have more data, will that model run so smoothly as it did during the demonstration? does the data in real world exist in such format? can I get similar data and if I can will it be so easy to process? what did the results of that model mean? can I present that data in prettier way? and so on and so on and so on.
When I started to learn about data analytics, data science, world of data in general I was always amused by the results people will get after processing some piece of data, or after running a machine learning model or after getting keys from word buckets etc. But every time I would try to do something on my own it will always appear a new obstacle: the data I would like to analyze is too much or not enough, one model will run with one piece of data, but it won’t with another etc etc.
After having all these difficulties and learning to deal with them the hard way I would like to share with the essential 5 Vs of data that you have to have taken care of before you start your data project/solution.
1st V – Volume
When we talk “volume” in regards of data we have to be aware of amount of data that has to be handled in the project – should we use several servers to handle that volume and distribute the load between them? or maybe our computer with our own hard disk is quite enough to solve the problem?
2nd V – Velocity
Velocity is the speed with which data travels through our model/project/solution. The speed with which it is ingested, processed and delivered to the end client. We have to be aware if this is real-time data, near real-time or maybe this is just historic data which is not going anywhere soon and we can talk her out slowly and efficiently 😉
3rd V – Variety
Data comes from various sources, in various types, structured, semi-structured and not structured at all (officially unstructured XD) and boy, I’ve got burned on it a lot. When my pipeline will expect one data type (because I tested it with the sample and it worked) and then it will give me an error because there is additional type or structure that is not yet supported by my solution. This kind of things has to be defined in the beginning, you have to know the levels of variety of the data you are working with.
4th V – Veracity
Is the data I am working with is worth trusting? Is it trustworthy? Is it still correct after all the manipulations and cleanings? Was the pipe of transformation correct? These are the questions we ask when we talk about the veracity of the data. We can collect all the data we need and it won’t be that difficult, but will it be accurate and consistent, won’t it be falsely altered – that’s another challenge. We all aware that in order to get insights from the data we have to perform a little of preprocessing and we have to make sure that process does not skew the data.
5th V – Value
And the last V goes for value. Because in the end of the day the whole point of all this is to get value from data. That includes creating reports and dashboards, finding useful insights that can improve business, highlighting critical areas to make more informed decisions.
You may object that those are 5 Vs of big data and you will be right. Yes, those are 5 Vs of big data, but not only. Any data project has to deal with these 5 Vs. Big data project will have it more complicated to handle, small data project will be just easier to manage all 5 of Vs.
For example, I was working on a data solution for the HR department and in the beginning we had to address the 5 Vs of the data. Even though we didn’t have terabytes of data, we had a lot of small Excel files were the data was previously stored and distributed (volume). There were 3 different sources of data to collect from: Excel files, corporate DB and corporate CRM (variety). The data would be updated on a daily basis and users would want the actual data as quickly as possible with a maximum delay of 30 minutes – it’s not even close to real-time, but we still have to make sure that the pipeline is executed fast enough (velocity). Data coming from Excels would be always altered by the human at some point of time and there is always a dispute which actualization goes first, so we had to deal with that too (veracity). And in order to get value from the data we had to find a way to visualize it and create a possibility for the end user to explore it and make their own conclusions (value).
We invested our time in the beginning to find the solutions for every V with our data and having done that we were able to finish our project just in time – even with lovely documentation.
So even though you are just going to process Titanic datasets, think of 5 Vs, it will take you 2 minutes, but you will be ready for the unpredictable. despite you know who’s gonna die there XD.