There seems to be a lot of confusion in the role that Master Data Management (MDM) plays with big data. Much of this confusion is due to a lack of understanding of exactly what big data is, and how it can be leveraged. The definition of big data has expanded over the years as more and more vendors and companies attempt to capitalize on the buzz. For purposes of this discussion, we will focus on the role of Hadoop in handling big data. Hadoop is a relatively new open source platform designed to allow users to handle ever-increasing volumes of data quickly and efficiently, where any piece of data within a dataset can be used in a query. Hadoop systems are highly scalable and relatively cost efficient, for large corporate projects.
How does it work?
Hadoop breaks large data sets into small “chunks” and spreads these pieces across the servers in a Hadoop cluster. Each chunk is of a requisite size and contains some subset of the original files’ records. This is called the Hadoop Distributed File System (HDFS).
The magic of Hadoop occurs when you apply the MapReduce function. This process essentially runs your query against every chunk of data across all the nodes in the HDFS cluster, then performs a reduce process that combines those results and outputs them to a single data set. Given that no schema is applied to the original files before adding data to the HDFS system, the load is very quick. While it is possible to run any query against this data – you can find many examples on the Internet that show you how to write MapReduce functions against large data sets and return results – you will have to set the bar a little higher if you wish for these results to be actionable; for example, while it may be nice to figure out which customers disliked an Italian restaurant on Yelp, it is far more useful to find out which of your known customers have disliked a restaurant on Yelp in the last three weeks.
Getting to actionable information
Many data sets that are used in big data processes come from either the Internet or sensor data. With either of these sources, data will be provided based on identifiers such as sensor IDs and social media IDs. Companies will need to map these to their current customer or organization identifiers to be able to leverage the value of any insights gained from this data. This can be done by providing results from your MDM data set using a solution such as Master Data Maestro, which allows you to provide actionable information to the correct manager at the correct time. This approach delivers more effective information for the organization, and leads to better management decisions. It can also ensure that results from the HDFS cluster have been vetted against real-world knowledge.
Some data quality and profiling tools work on top of the big data set to deliver cleansed identifiers from the data. While this can provide initial data to be loaded into your MDM system, it is not an effective way to create actionable results. As opposed to attempting to remove duplicates from the HDFS system in this way, master data should be applied at the MapReduce function level, where the entire mastered Customer or Product list can be used to manage data in and out of the system. The freedom that a Hadoop implementation provides must be checked against the known lists of the business to ensure that query results are meaningful and actionable.