The information age has brought with it unprecedented amounts of data on nearly everything, but is it all worth keeping?
Over the last 20 years, data collection has kept pace with hard drive capacity. The more storage companies and individuals have had, the more information they have stored. The problem with this approach is that the bulk of stored data is generally not indexed or effectively managed. For some companies this can drain significant resources. While the technical reasons for this are different from company to company, the end result is excessive and unnecessary baggage.
In layman’s terms, data management is not unlike packing for a vacation. Practically speaking, a vacation to the beach doesn’t require a parka any more than a ski trip necessitates a bikini, and few people would mix these items. In the data industry, hardware has allowed such massive amounts of portable storage that selecting clothes for a vacation would be irrelevant – you would have simply stored everything and had access to not just every item in the wardrobe, but also every item you had ever owned or even worn.
While proper data management would allow the selection of items by category or tags, not all data is indexed. In fact, more than 80% of the data in the world is unstructured, yet companies continue to collect everything, despite lacking the resources to process it. The mentality tends to be that one should save everything, just in case anything is needed. If you’re thinking of someone who saves absolutely everything around the home, you’ve got a pretty good idea of how most data management is operated.
This has given rise to two types of data management that are slowly but surely reshaping the industry.
The first is quality management, in which a company utilizes data quality software to make sure it is keeping track of relevant, useful, and accurate data. Anything that does not meet the quality criteria is not stored or indexed. In the example of our vacation, this would mean eliminating anything that didn’t fit, was no longer owned, or wasn’t suitable to the trip. It streamlines data management by filtering out irrelevant information, saving time and money.
The second method involves what is known as large scale data management. Hadoop is an open source front-runner in this field. In simple terms, it is a groundbreaking technology that allows petabytes of data to be quickly and efficiently organized, irrespective of whether that data is indexed or unindexed. For companies with large amounts of mixed data, such as Facebook and Google, it is an excellent tool. However, that is primarily because of the large quantities of unstructured data inherent in their business models. For smaller and more precise business functions it is a bit over the top.
That is because, despite the power of Hadoop, storing and indexing everything still requires data management at some point. If data is not relevant or useful to an organization, then manipulating it at all is a waste of resources, no matter how fast or effectively it is indexed. Companies are beginning to recognize this and focus more on data quality and accuracy, as opposed to quantity. This is eliminating a great deal of the data junk in the world, but there is still a long way to go.