Excel 2007 Data Mining and Analysis tools are an add-in that can be used with Excel. Once the add-in is installed, you will have 2 new tabs in Excel: Data Mining and Analyze. This document is divided into 2 sections; the first consists of the Data Mining options, the second consists of the Analyze options.
In order to use these tabs, your worksheet needs to be defined as a Table. To do this, follow these basic steps:
Use filtering to limit your spreadsheet to the rows and columns you’d like to analyze.
Selecting the right attributes from the source data and getting them into the right format for mining typically takes up a large percentage of the time in a typical data mining project. This section provides tools that address common data preparation needs for data mining:
· Explore Data - This option will create a simple histogram of a column of data. It will give you an idea of what your data looks like before you start to mine and analyze it. Below is a sample of gpa spreads.
· Clean Data – There are two options: Remove outliers and Re-label. Remove Outliers allows you to remove uncommon data from a column in your spreadsheet. This can help improve the accuracy of your analysis. Re-label changes discrete values for a column of data. For example, your source data may contain “M” and “F” as states for the Gender column and you prefer to use “Male” and “Female” for clarity when presenting the model results. Or if your data contains both ‘m’ and ‘M’, you can change them both to “Male”.
Sample Data – This option creates a reduced or balanced subset of data from your spreadsheet for building your mining and analysis structures.
· Partition Data - Split the source data into training and test sets, take a random sample of the input data or perform oversampling to adjust for skewed distributions.
Data Modeling Options
This section covers the actual model definition and processing. It provides wizards that help you easily build common types of mining models without worrying about the actual mining algorithms and associated parameters supported on the server.
Also included in this section are advanced options that allow the user to pick the exact mining algorithm and tweak additional parameters.
Classify Data – Build a model based on existing data from one of your Excel tables. This option will select patterns that predict values in a target column based on values in other columns.
Estimate – This wizard leads you through the steps to build an estimation model based on your Excel table. The model uses patters to predict the continuous values of one column based on other columns.
Cluster – This builds a clustering model based on your data. It looks for groups of rows that share similar characteristics and groups them together.
Associate – An Association Rules model finds associations between appear together in multiple rows.
Forecast – Data Mining Forecasting will find patterns in a series of Excel cells, and will use these patterns to predict the evolution of that series of cells.
This tool analyzes the patterns in data that have the strongest influence on a certain outcome. For example, if you have a worksheet of prospects, Analyze Key Influencers can analyze the factors that are key to determining which students will have the highest gpa. It may not be obvious which fields are the key influencers. For example, you might find that having a particular major might lead to a higher gpa. Here’s an example of a report that looks for key influences to membership in Sigma Xi:
To use the Key Influencers function, follow these basic steps:
Create a new worksheet with a Table of the data you’d like to analyze (see top of this document)
Click on the Analyze tab
Click on Analyze Key Influencers
You will see the below set-up window.
In the drop-down box, choose the column in your spreadsheet that you’d like to analyze. In the above example, you would choose the gpa column.
Excel, by default, assumes that you’d like to use every column in the spreadsheet to look for key influences. If you have many columns in your spreadsheet, this may not be practical. Click on the blue text ‘Choose columns to be used for analysis’ to limit these columns.
When you first enter this window, all of the spreadsheet columns will be checked. Uncheck any that you do NOT want to use for analysis.
Excel uses ‘Discrimanant Analysis’ for analyzing key influencers. Discriminant function analysis is used to determine which pieces of data discriminate between two or more naturally occurring groups. For example, an Admissions office may want to investigate which variables discriminate between (1) those who apply to Carleton, and (2) Those who do not. For that purpose the researcher could collect data on numerous variables. Most prospects will fall into one of the 2 categories; applied or did not apply. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of students' subsequent application choice.
The Detect Categories tool detects the rows in the data table that contain similar characteristics and groups them into categories. Each category identified is described by the distinguishing characteristics of the rows assigned to the category. Based on those characteristics, you may choose to give the category a more understandable name. Breaking the data into categories enables you to quickly identify the natural groupings in your data.
The tool also labels each row in the source sheet with the category that it was assigned to, in addition to producing a report like one below:
Fill from Example
This tool automatically fills in missing data in a specific column for all rows in the selected table, based on exemplary values provided by you for some of those rows. The tool employs SQL Server Data Mining to detect patterns in the sample you provide and apply those to the remaining missing values.
In addition to filling in the missing values, the Fill From Example tool also produces a patterns report similar to the Analyze Key Influencers tool that describes the rules that were applied to fill in the missing values
The Forecast tool predicts future values based on trends in existing time series data. For example, you can forecast what annual sales will be two years from now, based on current conditions and factors in your data. The tool appends the forecast values as new rows to the table selected for forecasting.
The tool also produces a chart that graphically plots the existing and forecast values as shown below:
This tool analyzes and highlights data rows that do not appear to match the general patterns found in the rest of the table. These exceptions could be either be due to data entry errors or they may be genuinely unusual values that require further analysis. Highlighting exceptions can be useful because these values can significantly skew averages and trends. If exceptions are caused by data entry errors, it is likely that you will want to fix these errors before doing additional analysis or trending.
Highlight Exceptions not only highlights exception rows, it also highlights the specific column value in each row that is likely to be the cause of the exception. You can change the column value in-place and the row is immediately re-evaluated to determine if it is still an exception.
In addition, the tool produces an interactive exceptions report that can be used to dynamically vary the exception threshold so you can choose to see more or less exceptions.
Here’s sample output from the Highlight Exceptions tool, with the source data and the exceptions report sheets displayed side-by-side:
You can also use the Excel sorting options to sort by ‘color’. This will pull your exceptions to the top of the spreadsheet for easy review.
This tool allows you to model two types of scenarios and report the effect on either a single row or the entire table of input data:
Goal Seek – Analyzes and reports the factors that need to change to achieve a specific goal. Goal Seek answers questions such as “What is the value in column X so that column Y has a value of Z”. For example, the below screen shot shows having a goal of a gpa between 3.20 and 4.00, and the effect of changing the first major. The sample output below shows the output produced by the tool for a single row in the sample data:
Goal seek uses what is called a Monty Carlo method of sampling. For more information on this technique, you can consult Wikipedia at Monte Carlo Method.
What-If – Analyzes and reports the impact of a proposed or hypothetical change. For example, you can see the effect of change in major if the student’s home state changes to Arizona.
With both Goal Seeking and What-if, a confidence is associated with your results. This confidence number ranges from 0 to 100, with 100 being the most confident. If you get a high confidence number, it means that a loarge number of rows in the table support the simulation.
This tool detects the patterns that predict a specific value of a column based on values in the other columns. The tool generates reports and charts of various scores. It may also generate an operational Prediction Calculator and a printer-ready sheet. The below is a sample of student data; looking at how home country might predict gpa.
Shopping Basket Analysis
The Shopping Basket Analysis tool facilitates cross-selling analysis. It can be applied on a table containing transactions. The tool identifies groups of items that tend to appear together as well as rules that can be used in recommendations. If the table contains a Value column associated with each item of each transaction, the tool also computes the "lift" for each group and rule (a measure of how much the value of the respective item or group increases in the context of the pattern identified by the tool).