Data Mining
for the Masses
240
Data Analysis: The process of examining data in a repeatable and structured way in order to
extract meaning, patterns or messages from a set of data. (Page 3)
Data Mart: A location where data are stored for easy access by a broad range of people in an
organization. Data in a data mart are generally archived data, enabling analysis in a setting that
does not impact live operations. (Page 20)
Data Mining: A computational process of analyzing data sets, usually large in nature, using both
statistical and logical methods, in order to uncover hidden, previously unknown, and interesting
patterns that can inform organizational decision making. (Page 3)
Data Preparation: The third in the six steps of CRISP-DM. At this stage, the data miner ensures
that the data to be mined are clean and ready for mining. This may include handling outliers or
other inconsistent data, dealing with missing values, reducing attributes or observations, setting
attribute
roles for modeling, etc. (Page 8)
Data Set: Any compilation of data that is suitable for analysis. (Page 18)
Data Type: In a data set, each attribute is assigned a data type based on the kind of data stored in
the attribute. There are many data types which can be generalized into one of three areas:
Character (Text) based; Numeric; and Date/Time. Within these categories, RapidMiner has
several data types. For example, in the Character area, RapidMiner has Polynominal, Binominal,
etc.; and in the Numeric area it has Real, Integer, etc. (Page 39)
Data Understanding: The second in the six steps of CRISP-DM. At this stage, the data miner
seeks out sources of data in the organization, and works to collect, compile, standardize, define
and document the data. The data miner develops a comprehension of where the data have come
from, how they were collected and what they mean. (Page 7)
Data Warehouse: A large-scale repository for archived data which are available for analysis. Data
in a data warehouse are often stored in multiple formats (e.g. by week, month, quarter and year),
facilitating large scale analyses at higher speeds. The data warehouse is populated by extracting
Glossary
and Index
241
data from operational systems so that analyses do not interfere with live business operations.
(Page 18)
Database: A structured organization of facts that is organized such that the facts can be reliably
and repeatedly accessed. The most common type of database is a relational database, in which
facts (data) are arranged in tables of columns and rows. The data are then accessed using a query
language, usually SQL (Structured Query Language), in order to extract meaning from the tables.
(Page 16)
Decision Tree: A data mining methodology where leaves and nodes are generated to construct a
predictive tree, whereby a data miner can see the attributes which are most predictive of each
possible outcome in a target (label) attribute. (Pages 9, 159).
Denormalization: The process of removing relational organization from data, reintroducing
redundancy into the data, but simultaneously eliminating the need for joins in a relational database,
enabling faster querying. (Page 18)
Dependent Variable (Attribute): The attribute in a data set that is being acted upon by the other
attributes. It is the thing we want to predict, the target, or label, attribute in a predictive model.
(Page 108)
Deployment: The sixth and final of the six steps of CRISP-DM. At this stage, the data miner
takes the results of data mining activities and puts them into practice in the organization. The data
miner watches closely and collects data to determine if the deployment is successful and ethical.
Deployment can happen in stages, such as through pilot programs before a full-scale roll out.
(Page 10)
Descartes' Rule of Change: An ethical framework set forth by Rene Descartes which states that
if an action
cannot be taken repeatedly, it cannot be ethically taken even once. (Page 235)
Design Perspective: The view in RapidMiner where a data miner adds operators to a data mining
stream, sets those operators’
parameters, and runs the model. (Page 41)
Data Mining for the Masses
242
Discriminant Analysis: A predictive data mining model which attempts to
compare the values of
all observations across all attributes and identify where natural breaks occur from one category to
another, and then predict which category each observation in the data set will fall into. (Page 108)
Ethics: A set of moral codes or guidelines that an individual develops to guide his or her decision
making in order to make fair and respectful decisions and engage in right actions. Ethical
standards are higher than legally required minimums. (Page 232)
Evaluation: The fifth of the six steps of CRISP-DM. At this stage, the data miner reviews the
results of the data mining model, interprets results and determines how useful they are. He or she
may also conduct an investigation into false positives or other potentially misleading results. (Page
10)
False Positive: A predicted value that ends up not being correct. (Page 221)
Field: See Attribute (Page 16).
Frequency Pattern: A recurrence of the same, or similar, observations numerous times in a
single data set. (Page 81)
Fuzzy Logic: A data mining concept often associated with neural networks where predictions are
made using a training data set, even though some uncertainty exists regarding the data and a
model’s predictions. (Page 181)
Gain Ratio: One of several algorithms used to construct decision tree models. (Page 168)
Gini Index: An algorithm created by Corrodo Gini that can be used to generate decision tree
models. (Page 168)
Heterogeneity: In statistical analysis, this is the amount of variety found in the values of an
attribute. (Page 119)
Inconsistent Data: These are values in an attribute in a data set that are out-of-the-ordinary
among the whole set of values in that attribute. They can be statistical outliers, or other values that