What is Big Data?
Big Data is a term that refers to a quantity of data that traditional software is not capable of obtaining, managing and processing in a reasonable time. The volume of Big Data is constantly growing thanks to the falling prices of storage and processing devices, and this has led to large advances in the digitalization of businesses and society in general.
In order to understand the term better, we can think of Big Data in terms of its “5 Vs”: Volume, Velocity, Variety, Veracity and Value.
The processes of quality control and monitoring what a tradition Data Warehouse would carry out are no longer sufficient. There needs to be new quality metrics, stop working with absolute values, and think of a way of doing it with approximations and confidence intervals.
This doesn’t just refer to the speed at which data is generated, but also the new platforms that need real-time information or in a way that is almost immediate. If the ways of filtering data are not appropriate (that’s to say, the processes of varifying its quality), then this data loses its value for business.
Occasionally, instead of working with a complete dataset, one might resort to working with a sample and increase speed in this way. But clearly, this increase comes at the cost of distorting the data.
Information comes from various sources and it can have varied levels of structure. Therefore, it is impossible to apply the same quality metrics to each one. For example, we can have data from:
- SQL or noSQL data bases, our own ones or those of third-parties.
- The company’s CRM data.
- Social media.
- Invoicing systems.
- Business transaction reports.
This large variety frequently translates into large semantic differences (fields with the same name but different meansings in each department) or syntactic inconsistencies (such as time stamps that are useless since they don’t capture information about the timezone). The first issue can be considerably reduce if we offer sufficient metadata of each source. For the second type, we need to wait for the next phase of Data Engineering where you can select useful fields for predictions and get rid of those that aren’t helpful. This could include fields with random values or dependencies.
Veracity of data refers to the possible biases in the information, noise and abnormal data. As well as a possible lack of precision, data can be inconsistent or unreliable (with regards to its origin, the way it was obtained and processed and security infrastructure).
The cause of this problem is that normally, data providers and and user belong to different organizations where there are different objectives and operational processes. Many times, data providers don’t know what their clients use the data for. This disconnection between the sources of information and its end users is the main cause of quality problems of data (from a veracity perspective).
Data “value” is much more tangible. Businesses use data for distinct goals, and whether they achieve these goals or not allows us to “measure” the quality of their data and define a strategy to improve it.