Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	9/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 5 6 7 8 9 10 11 12 ... 65

Data Mining for the Masses
18
In order to keep our transactional databases running quickly and smoothly, we may wish to create
a data warehouse.  A data warehouse is a type of large database that has been denormalized and
archived.    Denormalization  is  the  process  of  intentionally  combining  some  tables  into  a  single
table in spite of the fact that this may introduce duplicate data in some columns (or in other words,
attributes).

Figure 2-3: A combination of the tables into a single data set.

Figure  2-3 depicts what our  simple example data might look like if it were in a data warehouse.
When we design databases in this way, we reduce the number of joins necessary to query related
data, thereby speeding up the process of analyzing our data. Databases designed in this manner are
called OLAP (online analytical processing) systems.

Transactional systems and analytical systems have conflicting purposes when it comes to database
speed and performance.  For this reason, it is difficult to design a single system which will serve
both purposes.  This is why data warehouses generally contain archived data.  Archived data are
data that have been copied out of a transactional database. Denormalization typically takes place at
the time data are copied out of the transactional system.  It is important to keep in mind that if a
copy of the data is made in the data warehouse, the data may become out-of-synch.  This happens
when  a  copy  is  made  in  the  data  warehouse  and  then  later,  a  change  to  the  original  record
(observation) is made in the source database.  Data mining activities performed on out-of-synch
observations may be useless, or worse, misleading.  An alternative archiving method would be to
move  the  data  out  of  the  transactional  system.    This  ensures  that  data  won’t  get  out-of-synch,
however, it also makes the data unavailable should a user of the transactional system need to view
or update it.

A data set is a subset of a database or a data warehouse.  It is usually denormalized so that only
one  table  is  used.    The creation  of  a  data  set  may  contain  several  steps,  including  appending  or
combining tables from source database tables, or simplifying some data expressions.  One example
of this may be changing a date/time format from ‘10-DEC-2002 12:21:56’ to ‘12/10/02’.  If this

Chapter 2: Organizational Understanding and Data Understanding
19
latter date format is adequate for the type of data mining being performed, it would make sense to
simplify the attribute containing dates and times when we create our data set.  Data sets may be
made up of a representative sample of a larger set of data, or  they may contain all observations
relevant to a specific group. We will discuss sampling methods and practices in Chapter 3.

TYPES OF DATA

Thus far in this text, you’ve read about some fundamental aspects of data which are critical to the
discipline of data mining.  But we haven’t spent much time discussing where that data are going to
come  from.    In  essence,  there  are  really  two  types  of  data  that  can  be  mined:  operational  and
organizational.

The most elemental type of data, operational data, comes from transactional systems which record
everyday  activities.    Simple  encounters  like  buying  gasoline,  making  an  online  purchase,  or
checking  in  for a  flight  at  the  airport  all  result  in the  creation  of  operational  data.    The  times,
prices  and  descriptions  of  the  goods  or  services  we  have  purchased  are  all  recorded.    This
information can be combined in a data warehouse or may be extracted directly into a data set from
the OLTP system.

Often times, transactional data is too detailed to be of much use, or the detail may compromise
individuals’ privacy.  In many instances, government, academic or not-for-profit organizations may
create data sets and then make them available to the public.  For example, if we wanted to identify
regions of the United States which are historically at high risk for influenza, it would be difficult to
obtain permission and to collect doctor visit records nationwide and compile this information into
a meaningful data set. However, the U.S. Centers for Disease Control and Prevention (CDCP), do
exactly that every year.  Government agencies do not always make this information immediately
available  to  the  general  public,  but  it  often  can  be  requested.    Other  organizations  create  such
summary  data  as  well.    The  grocery  store  mentioned  at  the  beginning  of  this  chapter  wouldn’t
necessarily want to analyze records of individual cans of greens beans sold, but they may want to
watch trends for daily, weekly or perhaps monthly totals.  Organizational data sets can help to
protect peoples’ privacy, while still proving useful to data miners watching for trends in a given
population.

Data Mining for the Masses
20
Another  type  of  data  often  overlooked  within  organizations  is  something  called a  data  mart.   A
data  mart  is  an  organizational  data  store,  similar  to  a  data  warehouse,  but  often  created  in
conjunction  with  business  units’  needs  in  mind,  such  as  Marketing  or  Customer  Service,  for
reporting  and  management  purposes.    Data  marts  are  usually  intentionally  created  by  an
organization to be a type of one-stop shop for employees throughout the organization to find data
they  might  be  looking  for.    Data  marts  may  contain  wonderful  data,  prime  for  data  mining
activities, but they must be known, current, and accurate to be useful.  They should also be well-
managed in terms of privacy and security.

All  of  these  types  of  organizational  data  carry  with  them  some  concern.    Because  they  are
secondary, meaning they have been derived from other more detailed primary data sources, they
may  lack  adequate  documentation,  and  the  rigor  with  which  they  were  created  can  be  highly
variable.  Such data sources may also not be intended for general distribution, and it is always wise
to ensure proper permission is obtained before engaging in data mining activities on any data set.
Remember, simply because a data set may have been acquired from the Internet does not mean it
is in the public domain; and simply because a data set may exist within your organization does not
mean it can be freely mined.  Checking with relevant managers, authors and stakeholders is critical
before beginning data mining activities.

A NOTE ABOUT PRIVACY AND SECURITY

In 2003, JetBlue Airlines supplied more than one million passenger records to a U.S. government
contractor,  Torch  Concepts.    Torch  then  subsequently  augmented  the  passenger  data  with
additional  information  such  as  family  sizes  and  social  security  numbers—information  purchased
from a data broker called Acxiom.  The data were intended for a data mining project in order to
develop  potential  terrorist  profiles.    All  of  this  was  done  without  notification  or  consent  of
passengers.  When news of the activities got out however, dozens of privacy lawsuits were filed
against JetBlue, Torch and Acxiom, and several U.S. senators called for an investigation into the
incident.

This incident serves several valuable purposes for this book.  First, we should be aware that as we
gather,  organize  and  analyze  data,  there  are  real  people  behind  the  figures.    These  people  have
certain rights to privacy and protection  against crimes such as identity theft.  We as data miners

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 5 6 7 8 9 10 11 12 ... 65