Thanks for your emails on Part 1. Is data profiling art or science? One of the emails asked me.
Historically data has been treated as a byproduct of applications and has not received the attention it deserves. As the late Rodney Dangerfield would say "It gets no respect". In the pursuit of the next killer app, "cool" technologies and the forever re-inventing three letter acronyms, data has not gone to the All Star game.
Until now.
Smart executives are realizing the powerful drivers of decision rights and information flow is critical for successful execution of any strategy. Non-standard data is one of the major barriers of free information flow within and across organizations. Profiling of current data is thus essential to begin the discovery towards data standardization.
Key challenges of profiling is to identify which data and how much of it to profile. I have heard war cries "Let us profile everything", "We have tools that can do 10 levels of profiling", "Our technologies can profile millions of rows in seconds". I suggest taking some time to establish data sources to profile and categorize data elements (attributes) targeted for profiling into primary and secondary sets. This will throw some light on your data ownership and stewardship dimension as well.
Data sources you are profiling have one or more data domains of interest e.g. Customer, Product. Each domain has an optimal number of primary and secondary attributes that provide insight into your data, depending on your industry. Contact me for this matrix with helpful guidelines on volumes of data you must profile to get a realistic picture. This can save your project(s) a great deal of time and effort.
It is interesting to note that parsing of current data sources will be a fun challenge. If you do not leverage parsing algorithms from major vendors or your disparate data sources prompt you to custom build your own parsing framework, you have a challenge on your hands and also an incredible opportunity to learn about the building blocks of your data segments.
Profiling should be approached in two phases, surface profiling for leading indicators and deep profiling for additional insight, driving/confirming data relationships and identifying opportunities for enrichment. There are about 40-50 profiling dimensions, which can be found in any trade books. I can help you choose 5-7 dimensions depending on your industry and domain of interest; again, saving you money!
Interpreting profiled results needs deep expertise. Most of the tools can tell you 86% of your data has 3 areas of density for 58,000 rows. OK. So? How will you translate this to your business users in ways they relate to the observation, understand the data situation AND most importantly get encouraged to fixing/alleviating the problem. This is one opportunity you do not want to lose to align IT with business on current state of data. As a leader, the onus is on you to pin point the quality of data, explain business impact and promote the way for its enrichment.
Stay tuned for Part 3 of this topic on data standardization.
I hope this helps you answer the question; Is data profiling art or science?
Sunday, July 19, 2009
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment