Big Data: How much is too much?

It is possible to have too much of a good thing and big data is no exception. It is certainly possible for there to be too much data within a project. There are several reasons this happens and it is important, therefore, to make sure that data is properly curated.

Avoiding data overload

If you have too much data, it can be difficult to target and access the information you are looking for. If a company is trying to find a specific insight about a sample of their customer base then taking in a lot of unrelated data means that the risk of gathering irrelevant information increases. You may then find that you cannot find the data you want or even that the data is skewed in a counterproductive way.

Too much data will slow everything down because there is only so much the computer system can deal with at any given time. The bigger the data sets become, the slower everything will be processed.

With so many different projects, the data engineers must work to make sure that the data collected is restricted to the specific sets of data they want to study. It is necessary to make sure that any extraneous data is excluded from the data set.

To help solve this, companies are turning more and more to machine learning and this can lead to other issues. "Overfitting", for example, can occur when machine learning is used on a complex model and the data points match so well the program can no longer adapt to the collection of new data. This is not a case of too much data, rather than there are too many data points in place.

Data cannot simply be collected and placed into a giant store and hope that some insights can be gathered. In the more sophisticated systems, there have to be efforts made to control, manage and curate this data to make sure that the information produced is both useable and accurate. Big data has been of major benefit to companies across multiple sectors but it can become a huge challenge if the organization does not access and utilize the correct data.

