Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	64/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 57 58 59 60 61 62 63 64 65

Data Mining for the Masses
240

Data  Analysis:    The  process  of  examining  data  in  a  repeatable  and  structured  way  in  order  to
extract meaning, patterns or messages from a set of data. (Page 3)

Data Mart:  A location where data are stored for easy access by a broad range of people in an
organization.    Data  in  a data  mart  are  generally  archived  data,  enabling  analysis  in  a  setting  that
does not impact live operations. (Page 20)

Data Mining:  A computational process of analyzing data sets, usually large in nature, using both
statistical  and  logical  methods,  in  order  to  uncover  hidden,  previously  unknown,  and  interesting
patterns that can inform organizational decision making. (Page 3)

Data Preparation:  The third in the six steps of CRISP-DM.  At this stage, the data miner ensures
that the data to be mined are clean and ready for mining.  This may include handling outliers or
other  inconsistent  data,  dealing  with  missing  values,  reducing  attributes  or  observations,  setting
attribute roles for modeling, etc. (Page 8)

Data Set:  Any compilation of data that is suitable for analysis. (Page 18)

Data Type:  In a data set, each attribute is assigned a data type based on the kind of data stored in
the  attribute.    There  are  many  data  types  which  can  be  generalized  into  one  of  three  areas:
Character  (Text)  based;  Numeric;  and  Date/Time.    Within  these  categories,  RapidMiner  has
several data types.  For example, in the Character area, RapidMiner has Polynominal, Binominal,
etc.; and in the Numeric area it has Real, Integer, etc. (Page 39)

Data Understanding:  The second in the six steps of CRISP-DM.  At this stage, the data miner
seeks  out  sources  of  data  in  the  organization,  and  works  to  collect,  compile,  standardize,  define
and document the data.  The data miner develops a comprehension of where the data have come
from, how they were collected and what they mean.  (Page 7)

Data Warehouse:  A large-scale repository for archived data which are available for analysis. Data
in a data warehouse are often stored in multiple formats (e.g. by week, month, quarter and year),
facilitating  large  scale  analyses  at  higher  speeds.    The  data  warehouse  is  populated  by  extracting

Glossary and Index
241
data  from  operational  systems  so  that  analyses  do  not  interfere  with  live  business  operations.
(Page 18)

Database:  A structured organization of facts that is organized such that the facts can be reliably
and  repeatedly  accessed.    The  most  common  type  of  database  is  a  relational  database,  in  which
facts (data) are arranged in tables of columns and rows.  The data are then accessed using a query
language, usually SQL (Structured Query Language), in order to extract meaning from the tables.
(Page 16)

Decision Tree:  A data mining methodology where leaves and nodes are generated to construct a
predictive  tree,  whereby  a  data  miner  can  see  the  attributes  which  are  most  predictive  of  each
possible outcome in a target (label) attribute. (Pages 9, 159).

Denormalization:    The  process  of  removing  relational  organization  from  data,  reintroducing
redundancy into the data, but simultaneously eliminating the need for joins in a relational database,
enabling faster querying. (Page 18)

Dependent Variable (Attribute):  The attribute in a data set that is being acted upon by the other
attributes.  It is the thing we want to predict, the target, or label, attribute in a predictive model.
(Page 108)

Deployment:  The sixth and final of the six steps of CRISP-DM.  At this stage, the data miner
takes the results of data mining activities and puts them into practice in the organization.  The data
miner watches closely and collects data to determine if the deployment is successful and ethical.
Deployment  can  happen  in  stages,  such  as  through  pilot  programs  before  a  full-scale  roll  out.
(Page 10)

Descartes' Rule of Change:  An ethical framework set forth by Rene Descartes which states that
if an action cannot be taken repeatedly, it cannot be ethically taken even once. (Page 235)

Design Perspective:  The view in RapidMiner where a data miner adds operators to a data mining
stream, sets those operators’ parameters, and runs the model. (Page 41)

Data Mining for the Masses
242
Discriminant Analysis:  A predictive data mining model which attempts to compare the values of
all observations across all attributes and identify where natural breaks occur from one category to
another, and then predict which category each observation in the data set will fall into.  (Page 108)

Ethics:  A set of moral codes or guidelines that an individual develops to guide his or her decision
making  in  order  to  make  fair  and  respectful  decisions  and  engage  in  right  actions.    Ethical
standards are higher than legally required minimums. (Page 232)

Evaluation:  The fifth of the six steps of CRISP-DM.  At this stage, the data miner reviews the
results of the data mining model, interprets results and determines how useful they are.  He or she
may also conduct an investigation into false positives or other potentially misleading results.  (Page
10)

False Positive:  A predicted value that ends up not being correct. (Page 221)

Field:  See Attribute (Page 16).

Frequency  Pattern:    A  recurrence  of  the  same,  or  similar,  observations  numerous  times  in  a
single data set. (Page 81)

Fuzzy Logic:  A data mining concept often associated with neural networks where predictions are
made  using  a  training  data  set,  even  though  some  uncertainty  exists  regarding  the  data  and  a
model’s predictions. (Page 181)

Gain Ratio:  One of several algorithms used to construct decision tree models. (Page 168)

Gini  Index:    An  algorithm  created  by  Corrodo  Gini  that  can  be  used  to  generate  decision  tree
models. (Page 168)

Heterogeneity:    In  statistical  analysis,  this  is  the  amount  of  variety  found  in  the  values  of  an
attribute. (Page 119)

Inconsistent  Data:    These  are  values  in  an  attribute  in  a  data  set  that  are  out-of-the-ordinary
among the whole set of values in that attribute.  They can be statistical outliers, or other values that

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 57 58 59 60 61 62 63 64 65