Data Mining for the Masses
18
In order to keep our transactional databases running quickly and smoothly, we may wish to create
a data warehouse. A data warehouse is a type of large database that has been denormalized and
archived. Denormalization is the process of intentionally combining some tables into a single
table in spite of the fact that this may introduce duplicate data in some columns (or in other words,
attributes).
Figure 2-3: A combination of the tables into a single data set.
Figure 2-3 depicts what our simple example data might look like if it were in a data warehouse.
When we design databases in this way, we reduce the number of joins necessary to query related
data, thereby speeding up the process of analyzing our data. Databases designed in this manner are
called OLAP (online analytical processing) systems.
Transactional systems and analytical systems have conflicting purposes when it comes to database
speed and performance. For this reason, it is difficult to design a single system which will serve
both purposes. This is why data warehouses generally contain archived data. Archived data are
data that have been copied out of a transactional database. Denormalization typically takes place at
the time data are copied out of the transactional system. It is important to keep in mind that if a
copy of the data is made in the data warehouse, the data may become out-of-synch. This happens
when a copy is made in the data warehouse and then later, a change to the original record
(observation) is made in the source database. Data mining activities performed on out-of-synch
observations may be useless, or worse, misleading. An alternative archiving method would be to
move the data out of the transactional system. This ensures that data won’t get out-of-synch,
however, it also makes the data unavailable should a user of the transactional system need to view
or update it.
A data set is a subset of a database or a data warehouse. It is usually denormalized so that only
one table is used. The creation of a data set may contain several steps, including appending or
combining tables from source database tables, or simplifying some data expressions. One example
of this may be changing a date/time format from ‘10-DEC-2002 12:21:56’ to ‘12/10/02’. If this
Chapter 2: Organizational Understanding and Data Understanding
19
latter date format is adequate for the type of data mining being performed, it would make sense to
simplify the attribute containing dates and times when we create our data set. Data sets may be
made up of a representative sample of a larger set of data, or they may contain all observations
relevant to a specific group. We will discuss sampling methods and practices in Chapter 3.
TYPES OF DATA
Thus far in this text, you’ve read about some fundamental aspects of data which are critical to the
discipline of data mining. But we haven’t spent much time discussing where that data are going to
come from. In essence, there are really two types of data that can be mined: operational and
organizational.
The most elemental type of data, operational data, comes from transactional systems which record
everyday activities. Simple encounters like buying gasoline, making an online purchase, or
checking in for a flight at the airport all result in the creation of operational data. The times,
prices and descriptions of the goods or services we have purchased are all recorded. This
information can be combined in a data warehouse or may be extracted directly into a data set from
the OLTP system.
Often times, transactional data is too detailed to be of much use, or the detail may compromise
individuals’ privacy. In many instances, government, academic or not-for-profit organizations may
create data sets and then make them available to the public. For example, if we wanted to identify
regions of the United States which are historically at high risk for influenza, it would be difficult to
obtain permission and to collect doctor visit records nationwide and compile this information into
a meaningful data set. However, the U.S. Centers for Disease Control and Prevention (CDCP), do
exactly that every year. Government agencies do not always make this information immediately
available to the general public, but it often can be requested. Other organizations create such
summary data as well. The grocery store mentioned at the beginning of this chapter wouldn’t
necessarily want to analyze records of individual cans of greens beans sold, but they may want to
watch trends for daily, weekly or perhaps monthly totals. Organizational data sets can help to
protect peoples’ privacy, while still proving useful to data miners watching for trends in a given
population.
Data Mining for the Masses
20
Another type of data often overlooked within organizations is something called a data mart. A
data mart is an organizational data store, similar to a data warehouse, but often created in
conjunction with business units’ needs in mind, such as Marketing or Customer Service, for
reporting and management purposes. Data marts are usually intentionally created by an
organization to be a type of one-stop shop for employees throughout the organization to find data
they might be looking for. Data marts may contain wonderful data, prime for data mining
activities, but they must be known, current, and accurate to be useful. They should also be well-
managed in terms of privacy and security.
All of these types of organizational data carry with them some concern. Because they are
secondary, meaning they have been derived from other more detailed primary data sources, they
may lack adequate documentation, and the rigor with which they were created can be highly
variable. Such data sources may also not be intended for general distribution, and it is always wise
to ensure proper permission is obtained before engaging in data mining activities on any data set.
Remember, simply because a data set may have been acquired from the Internet does not mean it
is in the public domain; and simply because a data set may exist within your organization does not
mean it can be freely mined. Checking with relevant managers, authors and stakeholders is critical
before beginning data mining activities.
A NOTE ABOUT PRIVACY AND SECURITY
In 2003, JetBlue Airlines supplied more than one million passenger records to a U.S. government
contractor, Torch Concepts. Torch then subsequently augmented the passenger data with
additional information such as family sizes and social security numbers—information purchased
from a data broker called Acxiom. The data were intended for a data mining project in order to
develop potential terrorist profiles. All of this was done without notification or consent of
passengers. When news of the activities got out however, dozens of privacy lawsuits were filed
against JetBlue, Torch and Acxiom, and several U.S. senators called for an investigation into the
incident.
This incident serves several valuable purposes for this book. First, we should be aware that as we
gather, organize and analyze data, there are real people behind the figures. These people have
certain rights to privacy and protection against crimes such as identity theft. We as data miners
Dostları ilə paylaş: |