Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	8/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 4 5 6 7 8 9 10 11 ... 65

PURPOSES, INTENTS AND LIMITATIONS OF DATA MINING
DATABASE, DATA WAREHOUSE, DATA MART, DATA SET…
OLTP (online transaction processing)

Chapter 2: Organizational Understanding and Data Understanding
15


Explain some of the ethical dilemmas associated with data mining and outline possible
solutions

PURPOSES, INTENTS AND LIMITATIONS OF DATA MINING

Data mining, as explained in Chapter 1 of this text, applies statistical and logical methods to large
data sets.  These methods can be used to categorize the data, or they can be used to create predictive
models.    Categorizations  of  large  sets  may  include  grouping  people  into  similar  types  of
classifications, or in identifying similar characteristics across a large number of observations.

Predictive  models  however,  transform  these  descriptions  into  expectations  upon  which  we  can
base decisions.  For example, the owner of a book-selling Web site could project how frequently
she may need to restock her supply of a given title, or the owner of a ski resort may attempt to
predict the earliest possible opening date based on projected snow arrivals and accumulations.

It is important to recognize that data mining cannot provide answers to every question, nor can we
expect that predictive models will always yield results which will in fact turn out to be the reality.
Data mining is limited to the data that has been collected.  And those limitations may be many.
We must remember that the data may not be completely representative of the group of individuals
to which we would like to apply our results.  The data may have been collected incorrectly, or it
may  be  out-of-date.    There  is  an  expression  which  can  adequately  be  applied  to  data  mining,
among many other things: GIGO, or Garbage In, Garbage Out.  The quality of our data mining results
will directly depend upon the quality of our data collection and organization.  Even after doing our
very best to collect high quality data, we must still remember to base decisions not only on data
mining results, but also on available resources, acceptable amounts of risk, and plain old common
sense.

DATABASE, DATA WAREHOUSE, DATA MART, DATA SET…?

In  order  to  understand  data  mining,  it  is  important  to  understand  the  nature  of  databases,  data
collection and data organization.  This is fundamental to the discipline of Data Mining, and will
directly  impact  the  quality  and  reliability  of  all  data  mining  activities.    In  this  section,  we  will

Data Mining for the Masses
16
examine  the  differences  between  databases,  data  warehouses,  and  data  sets.    We  will  also
examine some of the variations in terminology used to describe data attributes.

Although we will be examining the differences between databases, data warehouses and data sets,
we will begin by discussing what they have in common.  In Figure 2-1, we see some data organized
into  rows  (shown  here  as  A,  B,  etc.)  and  columns  (shown  here  as  1,  2,  etc.).    In  varying  data
environments, these may be referred to by differing names.  In a database, rows would be referred
to as tuples or records, while the columns would be referred to as fields.

Figure 2-1: Data arranged in columns and rows.

In data warehouses and data sets, rows are sometimes referred to as observations, examples or
cases, and columns are sometimes called variables or attributes.  For purposes of consistency in
this book, we will use the terminology of observations for rows and attributes for columns.  It is
important to note that RapidMiner will use the term
examples for rows of data, so keep this in
mind throughout the rest of the text.

A  database  is  an  organized  grouping  of  information  within  a  specific  structure.    Database
containers,  such  as  the  one  pictured  in  Figure  2-2,  are  called  tables  in  a  database  environment.
Most databases in use today are relational databases—they are designed using many tables which
relate to one another in a logical fashion.   Relational databases generally contain dozens or even
hundreds of tables, depending upon the size of the organization.

Chapter 2: Organizational Understanding and Data Understanding
17

Figure 2-2: A simple database with a relation between two tables.

Figure  2-2  depicts  a  relational  database  environment  with  two  tables.    The  first  table  contains
information about pet owners; the second, information about pets.  The tables are related by the
single column they have in common: Owner_ID.  By relating tables to one another, we can reduce
redundancy of data and improve database performance.  The process of breaking tables apart and
thereby reducing data redundancy is called normalization.

Most relational databases which are designed to handle a high number of reads and writes (updates
and retrievals of information) are referred to as OLTP (online transaction processing) systems.
OLTP systems are very efficient for high volume activities such as cashiering, where many items
are being recorded via bar code scanners in a very short period of time.  However, using OLTP
databases for analysis is generally not very efficient, because in order to retrieve data from multiple
tables at the same time, a query containing joins must be written.  A query is simple a method of
retrieving data from database tables for viewing.  Queries are usually written in a language called
SQL (Structured Query Language; pronounced ‘
sequel’).  Because it is not very useful to only
query pet names or owner names, for example, we must join
two or more tables together in order
to retrieve both pets and owners at the same time.  Joining requires that the computer match the
Owner_ID column in the Owners table to the Owner_ID column in the Pets table.  When tables
contain thousands or even millions of rows of data, this matching process can be very intensive
and time consuming on even the most robust computers.

For  much  more  on  database  design  and  management,  check  out  geekgirls.com:
(
http://www.geekgirls.com/ menu_databases.htm
).

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 65