Chapter 2: Organizational Understanding and Data Understanding
15
Explain some of the ethical dilemmas associated with data mining and outline possible
solutions
PURPOSES, INTENTS AND LIMITATIONS OF DATA MINING
Data mining, as explained in Chapter 1 of this text, applies statistical and logical methods to large
data sets. These methods can be used to categorize the data, or they can be used to create predictive
models. Categorizations of large sets may include grouping people into similar types of
classifications, or in identifying similar characteristics across a large number of observations.
Predictive models however, transform these descriptions into expectations upon which we can
base decisions. For example, the owner of a book-selling Web site could project how frequently
she may need to restock her supply of a given title, or the owner of a ski resort may attempt to
predict the earliest possible opening date based on projected snow arrivals and accumulations.
It is important to recognize that data mining cannot provide answers to every question, nor can we
expect that predictive models will always yield results which will in fact turn out to be the reality.
Data mining is limited to the data that has been collected. And those limitations may be many.
We must remember that the data may not be completely representative of the group of individuals
to which we would like to apply our results. The data may have been collected incorrectly, or it
may be out-of-date. There is an expression which can adequately be applied to data mining,
among many other things: GIGO, or Garbage In, Garbage Out. The quality of our data mining results
will directly depend upon the quality of our data collection and organization. Even after doing our
very best to collect high quality data, we must still remember to base decisions not only on data
mining results, but also on available resources, acceptable amounts of risk, and plain old common
sense.
DATABASE, DATA WAREHOUSE, DATA MART, DATA SET…?
In order to understand data mining, it is important to understand the nature of databases, data
collection and data organization. This is fundamental to the discipline of Data Mining, and will
directly impact the quality and reliability of all data mining activities. In this section, we will
Data Mining for the Masses
16
examine the differences between databases, data warehouses, and data sets. We will also
examine some of the variations in terminology used to describe data attributes.
Although we will be examining the differences between databases, data warehouses and data sets,
we will begin by discussing what they have in common. In Figure 2-1, we see some data organized
into rows (shown here as A, B, etc.) and columns (shown here as 1, 2, etc.). In varying data
environments, these may be referred to by differing names. In a database, rows would be referred
to as tuples or records, while the columns would be referred to as fields.
Figure 2-1: Data arranged in columns and rows.
In data warehouses and data sets, rows are sometimes referred to as observations, examples or
cases, and columns are sometimes called variables or attributes. For purposes of consistency in
this book, we will use the terminology of observations for rows and attributes for columns. It is
important to note that RapidMiner will use the term
examples for rows of data, so keep this in
mind throughout the rest of the text.
A database is an organized grouping of information within a specific structure. Database
containers, such as the one pictured in Figure 2-2, are called tables in a database environment.
Most databases in use today are relational databases—they are designed using many tables which
relate to one another in a logical fashion. Relational databases generally contain dozens or even
hundreds of tables, depending upon the size of the organization.
Chapter 2: Organizational Understanding and Data Understanding
17
Figure 2-2: A simple database with a relation between two tables.
Figure 2-2 depicts a relational database environment with two tables. The first table contains
information about pet owners; the second, information about pets. The tables are related by the
single column they have in common: Owner_ID. By relating tables to one another, we can reduce
redundancy of data and improve database performance. The process of breaking tables apart and
thereby reducing data redundancy is called normalization.
Most relational databases which are designed to handle a high number of reads and writes (updates
and retrievals of information) are referred to as OLTP (online transaction processing) systems.
OLTP systems are very efficient for high volume activities such as cashiering, where many items
are being recorded via bar code scanners in a very short period of time. However, using OLTP
databases for analysis is generally not very efficient, because in order to retrieve data from multiple
tables at the same time, a query containing joins must be written. A query is simple a method of
retrieving data from database tables for viewing. Queries are usually written in a language called
SQL (Structured Query Language; pronounced ‘
sequel’). Because it is not very useful to only
query pet names or owner names, for example, we must join
two or more tables together in order
to retrieve both pets and owners at the same time. Joining requires that the computer match the
Owner_ID column in the Owners table to the Owner_ID column in the Pets table. When tables
contain thousands or even millions of rows of data, this matching process can be very intensive
and time consuming on even the most robust computers.
For much more on database design and management, check out geekgirls.com:
(
http://www.geekgirls.com/ menu_databases.htm
).
Dostları ilə paylaş: |