Glossary and Index
243
simply don’t make sense in the context of the ‘normal’ range of values for the attribute. They are
generally replaced or remove during the Data Preparation phase of CRISP-DM. (Page 50)
Independent Variable (Attribute): These are attributes that act on the dependent attribute (the
target, or label). They are used to help predict the label in a predictive model. (Pages 133)
Jittering: The process of adding a small, random decimal to discrete values in a data set so that
when they are plotted in a scatter plot, they are slightly apart from one another, enabling the
analyst to better see clustering and density. (Pages 17, 70)
Join: The process of connecting two or more tables in a relational database together so that their
attributes can be accessed in a single query, such as in a view. (Page 17)
Kant's Categorical Imperative: An ethical framework proposed by Immanuel Kant which states
that if everyone cannot ethically take some action, then no one can ethically take that action. (Page
234)
k-Means Clustering: A data mining methodology that uses the mean (average) values of the
attributes in a data set to group each observation into a cluster of other observations whose values
are most similar to the mean for that cluster. (Page 92)
Label: In RapidMiner, this is the role that must be set in order to use an attribute as the
dependent, or target, attribute in a predictive model. (Page 108)
Laws: These are regulatory statutes which have associated consequences that are established and
enforced by a governmental agency. According to Lawrence Lessig, these are one of the four
methods for establishing boundaries to define and regulate social behavior. (Page 233)
Leaf: In a decision tree data mining model, this is the terminal end point of a branch, indicating
the predicted outcome for observations whose values follow that branch of the tree. (Page 164)
Linear Regression: A predictive data mining method which uses the algebraic formula for
calculating the slope of a line in order to predict where a given observation will likely fall along that
line. (Page 128)
Data Mining for the Masses
244
Logistic Regression: A predictive data mining method which uses a quadratic formula to predict
one of a set of possible outcomes, along with a probability that the prediction will be the actual
outcome. (Page 142)
Markets: A socio-economic construct in which peoples’ buying, selling, and exchanging behaviors
define the boundaries of acceptable or unacceptable behavior. Lawrence Lessig offers this as one
of four methods for defining the parameters of appropriate behavior. (Page 233)
Mean: See Average. (Pages 47, 77)
Median: With the Mean and Mode, this is one of three generally used Measures of Central
Tendency. It is an arithmetic way of defining what ‘normal’ looks like in a numeric attribute. It is
calculated by rank ordering the values in an attribute and finding the one in the middle. If there
are an even number of observations, the two in the middle are averaged to find the median. (Page
47)
Meta Data: These are facts that describe the observational values in an attribute. Meta data may
include who collected the data, when, why, where, how, how often; and usually include some
descriptive statistics such as the range, average, standard deviation, etc. (Page 42)
Missing Data: These are instances in an observation where one or more attributes does not have
a value. It is not the same as zero, because zero is a value. Missing data are like Null values in a
database, they are either unknown or undefined. These are usually replaced or removed during the
Data Preparation phase of CRISP-DM. (Page 30)
Mode: With Mean and Median, this is one of three common Measures of Central Tendency. It is
the value in an attribute which is the most common. It can be numerical or text. If an attribute
contains two or more values that appear an equal number of times and more than any other values,
then all are listed as the mode, and the attribute is said to be Bimodal or Multimodal. (Pages 42,
47)
Model: A computer-based representation of real-life events or activities, constructed upon the
basis of data which represent those events. (Page 8)
Glossary and Index
245
Name (Attribute): This is the text descriptor of each attribute in a data set. In RapidMiner, the
first row of an imported data set should be designated as the attribute name, so that these are not
interpreted as the first observation in the data set. (Page 38)
Neural Network: A predictive data mining methodology which tries to mimic human brain
processes by comparing the values of all attributes in a data set to one another through the use of a
hidden layer of nodes. The frequencies with which the attribute values match, or are strongly
similar, create neurons which become stronger at higher frequencies of similarity. (Page 176)
n-Gram: In text mining, this is a combination of words or word stems that represent a phrase that
may have more meaning or significance that would the single word or stem. (Page 201)
Node: A terminal or mid-point in decision trees and neural networks where an attribute branches
or forks away from other terminal or branches because the values represented at that point have
become significantly different from all other values for that attribute. (Page 164)
Normalization: In a relational database, this is the process of breaking data out into multiple
related tables in order to reduce redundancy and eliminate multivalued dependencies. (Page 18)
Null: The absence of a value in a database. The value is unrecorded, unknown, or undefined.
See Missing Values. (Page 30)
Observation: A row of data in a data set. It consists of the value assigned to each attribute for
one record in the data set. It is sometimes called a tuple in database language. (Page 16)
Online Analytical Processing (OLAP): A database concept where data are collected and
organized in a way that facilitates analysis, rather than practical, daily operational work. Evaluating
data in a data warehouse is an example of OLAP. The underlying structure that collects and holds
the data makes analysis faster, but would slow down transactional work. (Page 18)
Online Transaction Processing (OLTP): A database concept where data are collected and
organized in a way that facilitates fast and repeated transactions, rather than broader analytical
work. Scanning items being purchased at a cash register is an example of OLTP. The underlying
Data Mining for the Masses
246
structure that collects and holds the data makes transactions faster, but would slow down analysis.
(Page 17)
Operational Data: Data which are generated as a result of day-to-day work (e.g. the entry of
work orders for an electrical service company). (Page 19)
Operator: In RapidMiner, an operator is any one of more than 100 tools that can be added to a
data mining stream in order to perform some function. Functions range from adding a data set, to
setting an attribute’s role, to applying a modeling algorithm. Operators are connected into a
stream by way of ports connected by splines. (Page 34, 41)
Organizational Data: These are data which are collected by an organization, often in aggregate
or summary format, in order to address a specific question, tell a story, or answer a specific
question. They may be constructed from Operational Data, or added to through other means such
as surveys, questionnaires or tests. (Page 19)
Organizational Understanding: The first step in the CRISP-DM process, usually referred to as
Business Understanding, where the data miner develops an understanding of an organization’s
goals, objectives, questions, and anticipated outcomes relative to data mining tasks. The data
miner must understand why the data mining task is being undertaken before proceeding to gather
and understand data. (Page 6)
Parameters: In RapidMiner, these are the settings that control values and thresholds that an
operator will use to perform its job. These may be the attribute name and role in a Set Role
operator, or the algorithm the data miner desires to use in a model operator. (Page 44)
Port: The input or output required for an operator to perform its function in RapidMiner. These
are connected to one another using splines. (Page 41)
Prediction: The target, or label, or dependent attribute that is generated by a predictive model,
usually for a scoring data set in a model. (Page 8)
Premise: See Antecedent. (Page 85)
Glossary and Index
247
Privacy: The concept describing a person’s right to be let alone; to have information about them
kept away from those who should not, or do not need to, see it. A data miner must always respect
and safeguard the privacy of individuals represented in the data he or she mines. (Page 20)
Professional Code of Conduct: A helpful guide or documented set of parameters by which an
individual in a given profession agrees to abide. These are usually written by a board or panel of
experts and adopted formally by a professional organization. (Page 234)
Query: A method of structuring a question, usually using code, that can be submitted to,
interpreted, and answered by a computer. (Page 17)
Record: See Observation. (Page 16)
Relational Database: A computerized repository, comprised of entities that relate to one another
through keys. The most basic and elemental entity in a relational database is the table, and tables
are made up of attributes. One or more of these attributes serves as a key that can be matched (or
related) to a corresponding attribute in another table, creating the relational effect which reduces
data redundancy and eliminates multivalued dependencies. (Page 16)
Repository: In RapidMiner, this is the place where imported data sets are stored so that they are
accessible for modeling. (Page 34)
Results Perspective: The view in RapidMiner that is seen when a model has been run. It is
usually comprised of two or more tabs which show meta data, data in a spreadsheet-like view, and
predictions and model outcomes (including graphical representations where applicable). (Page 41)
Role (Attribute): In a data mining model, each attribute must be assigned a role. The role is the
part the attribute plays in the model. It is usually equated to serving as an independent variable
(regular), or dependent variable (label). (Page 39)
Row: See Observation. (Page 16)
Data Mining for the Masses
248
Sample: A subset of an entire data set, selected randomly or in a structured way. This usually
reduces a data set down, allowing models to be run faster, especially during development and
proof-of-concept work on a model. (Page 49)
Scoring Data: A data set with the same attributes as a training data set in a predictive model, with
the exception of the label. The training data set, with the label defined, is used to create a
predictive model, and that model is then applied to a scoring data set possessing the same
attributes in order to predict the label for each scoring observation. (Page 108)
Social Norms: These are the sets of behaviors and actions that are generally tolerated and found
to be acceptable in a society. According to Lawrence Lessig, these are one of four methods of
defining and regulating appropriate behavior. (Page 233)
Spline: In RapidMiner, these lines connect the ports between operators, creating the stream of a
data mining model. (Page 41)
Standard Deviation: One of the most common statistical measures of how dispersed the values
in an attribute are. This measure can help determine whether or not there are outliers (a common
type of inconsistent data) in a data set. (Page 77)
Standard Operating Procedures: These are organizational guidelines that are documented and
shared with employees which help to define the boundaries for appropriate and acceptable
behavior in the business setting. They are usually created and formally adopted by a group of
leaders in the organization, with input from key stakeholders in the organization. (Page 234)
Statistical Significance: In statistically-based data mining activities, this is the measure of
whether or not the model has yielded any results that are mathematically reliable enough to be
used. Any model lacking statistical significance should not be used in operational decision making.
(Page 133)
Stemming: In text mining, this is the process of reducing like-terms down into a single, common
token (e.g. country, countries, country’s, countryman, etc. → countr). (Page 201)
Glossary and Index
249
Stopwords: In text mining, these are small words that are necessary for grammatical correctness,
but which carry little meaning or power in the message of the text being mined. These are often
articles, prepositions or conjuntions, such as ‘a’, ‘the’, ‘and’, etc., and are usually removed in the
Process Document operator’s sub-process. (Page 199)
Stream: This is the string of operators in a data mining model, connected through the operators’
ports via splines, that represents all actions that will be taken on a data set in order to mine it.
(Page 41)
Structured Query Language (SQL): The set of codes, reserved keywords and syntax defined by
the American National Standards Institute used to create, manage and use relational databases.
(Page 17)
Sub-process: In RapidMiner, this is a stream of operators set up to apply a series of actions to all
inputs connected to the parent operator. (Page 197)
Support Percent: In an association rule data mining model, this is the percent of the time that
when the antecedent is found in an observation, the consequent is also found. Since this is
calculated as the number of times the two are found together divided by the total number of they
could have been found together, the Support Percent is the same for reciprocal rules. (Page 84)
Table: In data collection, a table is a grid of columns and rows, where in general, the columns are
individual attributes in the data set, and the rows are observations across those attributes. Tables
are the most elemental entity in relational databases. (Page 16)
Target Attribute: See Label; Dependent Variable. (Page 108)
Technology: Any tool or process invented by mankind to do or improve work. (Page 11)
Text Mining: The process of data mining unstructured text-based data such as essays, news
articles, speech transcripts, etc. to discover patterns of word or phrase usage to reveal deeper or
previously unrecognized meaning. (Page 190)
Data Mining for the Masses
250
Token (Tokenize): In text mining, this is the process of turning words in the input document(s)
into attributes that can be mined. (Page 197)
Training Data: In a predictive model, this data set already has the label, or dependent variable
defined, so that it can be used to create a model which can be applied to a scoring data set in order
to generate predictions for the latter. (Page 108)
Tuple: See Observation. (Page 16)
Variable: See Attribute. (Page 16)
View: A type of pseudo-table in a relational database which is actually a named, stored query.
This query runs against one or more tables, retrieving a defined number of attributes that can then
be referenced as if they were in a table in the database. Views can limit users’ ability to see
attributes to only those that are relevant and/or approved for those users to see. They can also
speed up the query process because although they may contain joins, the key columns for the joins
can be indexed and cached, making the view’s query run faster than it would if it were not stored
as a view. Views can be useful in data mining as data miners can be given read-only access to the
view, upon which they can build data mining models, without having to have broader
administrative rights on the database itself. (Page 27)
Data Mining for the Masses
251
ABOUT THE AUTHOR
Dr. Matthew North is Associate Professor of Computing and Information Studies at Washington
& Jefferson College in Washington, Pennsylvania, USA. He has taught data management and data
mining for more than a decade, and previously worked in industry as a data miner, most recently at
eBay.com. He continues to consult with various organizations on data mining projects as well.
Dr. North holds a Bachelor of Arts degree in Latin American History and Portuguese from
Brigham Young University; a Master of Science in Business Information Systems from Utah State
University; and a Doctorate in Technology Education from West Virginia University. He is the
author of the book Life Lessons & Leadership (Agami Press, 2011), and numerous papers and articles
on technology and pedagogy. His dissertation, on the topic of teaching models and learning styles
in introductory data mining courses, earned him a New Faculty Fellows award from the Center for
Advancement of Scholarship on Engineering Education (CASEE); and in 2010, he was awarded
the Ben Bauman Award for Excellence by the International Association for Computer
Information Systems (IACIS). He lives with his wife, Joanne, and their three daughters in
southwestern Pennsylvania.
To contact Dr. North regarding this text, consulting or training opportunities, or for speaking
engagements, please access this book’s companion web site at:
https://sites.google.com/site/dataminingforthemasses/
Data Mining for the Masses
252
Dostları ilə paylaş: |