Self-Tuning Histogram: Building Histogram Without Looking at Data
**Introduction:**
Self-tuning histogram is a new approach which building and maintaining histogram for large database. This approach is to builds histogram not by examining the data but by using feedback information about the execution of the queries on the database. While building a histogram involves scanning or sampling the data, sorting, and partitioning it into buckets, the cost of building and maintaining self-tuning histograms is independent of the data size, thus it provides a remarkably inexpensive way to construct histogram for large data sets with little up-front costs.
The idea of this approach is to start with an initial histogram that built with whatever information that we had in the beginning about the distribution of the data and each time a query uses the histogram, the algorithm compares the estimate selectivity to the actual selectivity and refine the histogram based on the selectivity estimation error. This incremental refinement progressively reduces estimation error and leads to histogram that is accurate for similar workloads.
A Self-tuning histogram can be refined “on-line” or “off-line”. In the “on-line” mode, the module executing a range selection immediately updates the Self-tuning histogram. In the “off-line” mode the execution module writes every selection range and it’s results to a local database, this database is use to refine the histogram in a batch at a later time.
Self-tuning histograms are suitable alternative when there is not enough time for updating database statistic to allow building all the desired histograms in the traditional way. The main advantage of Self-tuning histogram is that their accuracy depends on how often they are use. The more Self-tuning histogram is use, the more it is refine, the more accurate it becomes.
**The main Concept:**
A Self-tuning histogram assumed that the data is uniformity distributed until the feedback observation contradicts this assumption. Thus the algorithm consists of two main stages, the first is the *initialized* stage and the second one is the *refine* stage. The refine stage is broken into two parts: 1. Refining individual bucket frequencies, and 2. Restructuring the histogram by moving the bucket boundaries.
The first stage (“Initialized”) will create the histogram for once in a lifecycle. To initial the self-tuning histogram it is needed to know the required number of histogram buckets B, the number of tuples in the relation T, and the minimum and maximum values of attributes .The buckets of the initial histogram are evenly space between min and max. When initial the Self-tuning histogram no feedback information is used. Therefore it makes the uniformity assumption and assigned each of the buckets a frequency of T/B tuples, where B is the number of tuples in the relation T.
The second stage (“Refining”) is divided into two steps; the first one that calls “refining” updates the Self-tuning histogram with the feedback information from the queries. The second one that calls “restructuring” moves the bucket boundaries according to their frequency to get better partition of the frequencies.
The “refining” step computes for each selection on the histogram attribute the absolute estimation error, which is the difference between the estimated and the actual result size. Base on this error, it refines the frequencies of each bucket that were used in the estimation. The change in frequency of any buckets is depend on how much it contributes to the error, buckets with higher frequencies will consider as more contribute to the estimation error than buckets with low frequencies. Thus the changing in the frequency will be in proportion to their current frequency value.
The “restructuring” step assembles for the fact that the frequencies in a bucket are approximate by their average, (if there is a large variation in frequency within a bucket, the average frequency will not be approximate correctly of their individual frequencies). In this step buckets that currently have high frequency will be split into several buckets, it’s suppose to avoid of grouping high frequency and low frequency values in the same bucket. The “restructuring” step composes of two functions: merging and splitting. To merge bucket with similar frequencies it have to assume that two bucket frequencies are similar if the difference between them is less than *x* percent of the number of tuples in the relation T, were *x *is a predefined threshold. In the merging step the algorithm repeatedly finds the pair of adjacent runs of bucket such that the maximum difference in frequency between a bucket in the first run and in the second run is the minimum over all pairs of adjacent runs. The two runs are merged into one if this difference is less *x**T. The second function that calls splitting will identify the buckets that have the highest frequencies (that haven’t been chosen for merging and are not singleton bucket) and split these buckets in proportion to their frequencies. It importance to emphasize that because of the fact that this approach doesn’t examine the data at any point, it have no information about the distribution of the data within the bucket, hence its must splits the bucket at evenly spaced intervals. Other concept that sampling the data have the information about the distribution within the bucket and can split the bucket e.g. approximate to their median.
**Accuracy and conclusions:**
* *The accuracy of a self-tuning histogram is strongly depend on the wide range of data skew(z). Self-tuning histogram was __always__ (for all the range of data skew ‘z’) found better than assuming uniformity and the traditional histograms was found more accurate than the st-histogram for all value of z. This is expected because they are built base on the true distribution determined by examining the data. But the importance issue is that for the low value of skew (skew < 1), the estimation errors using refined st-histogram are very close to the errors using the traditional one.
This paper opens a new concept and provides a remarkably inexpensive way to construct histograms for large data sets with low cost of building them. The paper was written very clearly and explains the algorithm systematically for all it details. The only issue that remained unconvincing was the adjustment of the self-tuning histogram to the “real world“ in term of the behavior of the data skew. This concept was found quite good in the low skew of data, but was also extremely inaccurate in the high range of data skew, therefore I think that the paper should have provided more details about the characteristics of the data skew in a commercial large database.
**Dostları ilə paylaş:** |