Preparation techniques plan: framework for data preparation techniques in machine learning challenge of data preparation

Yüklə 60,65 Kb.
ölçüsü60,65 Kb.
1   2   3   4   5   6   7   8

Tutorial Overview
This tutorial is divided into three parts; they are:

  1. Challenge of Data Preparation

  2. Framework for Data Preparation

  3. Data Preparation Techniques

Challenge of Data Preparation
Data preparation refers to transforming raw data into a form that is better suited to predictive modeling.
This may be required because the data itself contains mistakes or errors. It may also be because the chosen algorithms have expectations regarding the type and distribution of the data.
To make the task of data preparation even more challenging, it is also common that the data preparation required to get the best performance from a predictive model may not be obvious and may bend or violate the expectations of the model that is being used.
As such, it is common to treat the choice and configuration of data preparation applied to the raw data as yet another hyperparameter of the modeling pipeline to be tuned.
This framing of data preparation is very effective in practice, as it allows you to use automatic search techniques like grid search and random search to discover unintuitive data preparation steps that result in skillful predictive models.
This framing of data preparation can also feel overwhelming to beginners given the large number and variety of data preparation techniques.
The solution to this overwhelm is to think about data preparation techniques in a systematic way.
Want to Get Started With Data Preparation?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-Course
Framework for Data Preparation
Effective data preparation requires that the data preparation techniques available are organized and considered in a structured and systematic way.
This allows you to ensure that approach techniques are explored for your dataset and that potentially effective techniques are not skipped or ignored.
This can be achieved using a framework to organize data preparation techniques that consider their effect on the raw dataset.
For example, structured machine learning data, such as data we might store in a CSV file for classification and regression, consists of rows, columns, and values. We might consider data preparation techniques that operate at each of these levels.

Data preparation for rows may be techniques that add or remove rows of data from the dataset. Similarly, data preparation for columns may be techniques that add or remove columns (features or variables) from the dataset. Whereas data preparation for values may be techniques that change the values in the dataset, often for a given column.
There is one more type of data preparation that does not neatly fit into this structure, and that is dimensionality reduction techniques. These techniques change the columns and the values at the same time, e.g. projecting the data into a lower-dimensional space.

  • Data Preparation for Columns + Values

This raises the question of techniques that might apply to rows and values at the same time. This might include data preparation that consolidates rows of data in some way.

  • Data Preparation for Rows + Values

We can summarize this framework and some high-level groups of data preparation methods in the following image.

Machine Learning Data Preparation Framework
Now that we have a framework for thinking about data preparation based on their effect on the data, let’s look at examples of techniques that fit into each group.

Yüklə 60,65 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2024
rəhbərliyinə müraciət

    Ana səhifə