I grew up in the Amazon. (How and why is a topic for another forum.) Today, I am an analyst. Although there are few similarities between these two worlds, some principles are common to both.
As children, my siblings and I often dug our toes into the warm clay on the banks of the river just to feel the ooze. Sometimes we added water to the clay and created wonderful mud slides for a thrilling ride into the river. For us, the clay was about fun.
For the women in the village, the clay was more about survival than fun. The clay was a key component for making bowls and water storage containers. I remember trying to imitate the women when they rolled long strands of clay and formed them into careful spirals. My attempts were generally unsuccessful. The problem was not the raw ingredient; my clay came from the same source. The problem was also not in my rolling and sculpting techniques. The problem was in my preparation of the clay.
I later learned that the women added ashes to their clay to give it strength. Years of experience passed along from generation to generation had taught the women just how much ash to add to the clay, which kind of wood was best for making the ashes, and how to manipulate the clay to get the perfect consistency for making pottery.
In analytics, data is the clay. The clay can be fun or utilitarian, but whatever the use, the preparation is vital. This brings us to Tom Khabaza’s third law of data mining:
Law #3: “Data Preparation Law” – Data preparation is more than half of every data mining [location analytics] process
Data preparation includes the tasks of acquiring, cleaning and transforming the data into a format that you can use for your analysis. Think about the potter’s hands squeezing, pulling, and pushing on the clay to integrate just the right amount of ashes to make the clay the right consistency. That’s what we need to do with data before we can create business value and insight from the data. The purpose of this task is to manipulate the data into a form required to answer the questions associated with the analysis objectives. Ask yourself: what do I need to do to make the clay useful for sculpting the clay pot? It is important to emphasize that in both tasks, sculpting clay and analyzing data, there is a great deal of creative, “right brain” thinking involved. So, although some “left brain,” ETL process oriented tasks become part of data preparation, the heart and soul of analytics is the work of a data artist as well as a data scientist.
Gather the clay and add the ashes
When you change the data format, you change what Khabaza calls the “problem space.” We need to define the problem space in several dimensions, which we can define as the “what,” “when,” and “where” questions. The first step in working with a data project is to identify which data variables to include in the analysis and then to acquire this data. Keep in mind that acquiring data can take time if you need to acquire data from different departments within the organization or if you need to collect data from a survey.
Sometimes you have to think creatively to define a proxy variable when you don’t have the data that you would like. For example, if you were studying a travel related business but you didn’t have data on the number of tourists, you could use the number of hotels and rental car companies as proxy variables because tourists purchase these complementary items. For each variable that you choose to study, the business knowledge should guide the creation of an initial hypothesis about why that variable is relevant to the analysis.
At other times you need to use derived (calculated) fields, data subsets (sampling) or aggregated data rather than the raw data. For example, you may need to summarize customer points to a grid in order to see the density pattern properly. You may need to calculate the number of household with income less than $35,000 rather than using the separate income ranges defined by the Census data. Again, business knowledge and data knowledge must guide the data preparation process in order for the analysis to produce value.
Second, there is the time variable of the problem space–the “when” question. Is the data collected from a single point in time or a range of time and if so, how long should the time period be? Segmenting the data for different time periods can lead to different conclusions. Common sense tells us that if we studied the sale of snow shovels in the summer versus winter months, the patterns would be very different. Other time considerations are less obvious and thus, it is necessary to test various time scenarios in order to identify the pattern that is useful for the analytic model.
Third, there is the geographic component—the “where” question. In Location Analytics, we need to define the geographic extent that is being studied. Segmenting the data for a single location, a trade area, a city, multiple locations within a metro area, or a dataset for an entire country is necessary to answer different analysis questions. Defining the spatial extent is also part of selecting the variables that will be applicable within the extent that you have selected. A data variable selected for one level of geography might not be a suitable variable for a different geography. In addition, the aggregation level of the data to various spatial boundaries can have a significant effect on the results of the analysis.
Remove the lumps
As you manipulate the clay with your hands, you sometimes find things that ruin the smooth consistency and you need to remove these organic lumps. Once the variables have been selected, data cleansing is usually required. For example, you may need to remove incomplete records or records that have a lower geocoding score. Likewise, it is not appropriate for customer records that have been geocoded to a zip code centroid level to be used for customer segmentation and psychographic analysis at the block group level. On the other hand, the identification of outliers and anomalies in the data can sometimes yield very useful patterns, so care must be taken to only remove what it truly detrimental to the end result of the analysis goal.
Test the consistency of the clay again and again
Before you can construct the analytics model for the data, you need to test various scenarios of manipulating the data to identify the format of the data that will be most useful for solving your business questions. For example, sometimes you can use a simple correlation analysis in Excel to define which variables might be helpful. If there is a high correlation between the affluent customers and revenue, then you might formulate a hypothesis that the income pattern is not random and that stores in areas with a high density of households with income greater than $150,000 will have a better chance of success. However, keep in mind that you may need to modify this hypothesis after you profile your customers because you might actually have several different customer profile groups and each group may be served by a different product or service being offered. Thus data preparation is not a “one-time” activity at the start of the analysis, it is an iterative process that continues throughout the analysis as you gain insight about the data.
Take the example of a car wash business. The car wash may have several options ranging from a basic wash to a deluxe wash. Depending on the location, you could have a very high density of “basic wash” customers that leads to high revenue or you could have a low density of deluxe wash customers who wash their cars more frequently. Both scenarios can be profitable and you can also have a single location with both patterns present. In order to provide the best recommendations for operations, marketing, and future site selection, you need to understand all of the patterns that are influencing revenue.
Thus, in addition to defining correlations, you may need to complete a grouping analysis as part of the data preparation work. Once groups are defined, you would need to create separate map layers (or layers with subsets of data created from definition queries) so that you can study each group individually. In cases where you have millions of records, you may also need to summarize the data by geographic areas to allow your analysis to complete more quickly.
As the third law points out, the data preparation task takes time and effort—more than half of every data mining process. So, those of you who are the recipients of the analysis results need to give your analysts time to do this very necessary work as part of the analytics process. Remember, if the clay is prepared correctly, you are well on your way to being able to craft your pottery.
To get started with some other data preparation ideas, check out Ricky Ho’s excellent blog post of data preparation techniques for data mining. For those using SAS, you might want to read Data Preparation for Analytics Using SAS from Gerhard Svolba, Ph.D.