# The arboretum procedure

Yüklə 3,07 Mb.

 səhifə 141/148 tarix 30.04.2018 ölçüsü 3,07 Mb.
 MAXABS Maximum absolute value 0 MEAN 1 Mean MEDIAN 1 Median MIDRANGE Range/2 Midrange RANGE Range Minimum SPACING (p) Minimum spacing Mid minimum-spacing STD Standard deviation Mean SUM Sum 0 USTD Standard deviation about the origin 0 For METHOD=ABW(c), METHOD=AHUBER(c), or METHOD=AWAVE(c), c is a positive numeric tuning constant (Iglewicz, 1983). q    For METHOD=AGK(p), p is a numeric constant that gives the proportion of pairs to be used with METHOD=COUNT in the ACECLUS procedure (Refer to SAS/STAT Software: Changes and Enhancements for Release 6.12 p. 229). q    For METHOD=SPACING(p), p is a numeric constant that gives the proportion of data to be contained in the spacing. q    For METHOD=L(p), p is a numeric constant greater than or equal to 1 that specifies the power to which differences are to be raised in computing an L(p) or Minkowski metric. q    For METHOD=IN(SAS-data-set), the SAS data set can contain: a _TYPE_ variable which identifies the observations that contain location and scale measures. For example, PROC STDIZE produces an OUTSTAT= data set that contains LOCATION and SCALE measures and some other statistics. _TYPE_='LOCATION' identifies the observation that contains location measures and _TYPE_='SCALE' identifies the observation that contains scale measures. You can also use the data set created by the OUTSTAT= option from another PROC STDIZE statement as the IN= data set name. See the Output Data Sets section below for the contents of the OUTSTAT data set. 1.   the location and scale variables specified by the LOCATION and SCALE statements. 2.   q    PROC STDIZE reads in the location and scale variables in the IN=data set according to the following rules: PROC STDIZE first looks for the _TYPE_ variable in the IN=data set. If it is found, PROC STDIZE continues to search for all variables specified in the VAR statement. If the _TYPE_ variable is not found, PROC STDIZE searches for the location variables specified in the LOCATION statement and the scale variables specified in the SCALE statement. For robust estimators, see Goodall (1983) and Iglewicz (1983). MAD has the highest breakdown point (50%) but is not very efficient. ABW, AHUBER, and AWAVE provide a good compromise between breakdown and efficiency. L(p) location estimates are increasingly robust as p drops from 2 (least squares, that is, the mean) to 1 (least absolute value, that is, the median), but the L(p) scale estimates are not robust. Spacing is robust to both outliers and clustering (Jannsen, et al., 1983) and is therefore a good choice for cluster analysis or nonparametric density estimation. The mid minimum spacing estimates the mode for small p. AGK is also robust to clustering and more efficient than SPACING, but it is not as robust to outliers and takes longer to compute. If you expect g clusters, the argument to SPACING or AGK should be 1/g or less. AGK is less biased than SPACING in small samples. It would generally be reasonable to use AGK for samples of size 100 or less and to use SPACING for samples of size 1000 or more, with the treatment of intermediate sample sizes depending on the available computer resources. Computation of the Statistics Formulas for statistics of METHOD= MEAN, MEDIAN, SUM, USTD, STD, RANGE, and IQR are given in Chapter 1, "SAS Elementary Statistics Procedure", in the SAS Procedures Guide. Note that the computations of median and upper and lower quartiles depend on the PCTLMTD= option. The rest of the statistics used in the above Table of Methods for Computing Location and Scale Measures, with the exception of METHOD=IN, are described as follows: EUCLEN Euclidean length.  where   is the  th observation and   is the total number of observations in the sample. L(p) Minkowski metric. It is documented as the LEAST=p option in the FASTCLUS procedure (see "The FASTCLUS Procedure" in the SAS/STAT User's Guide). Specifying METHOD=L(p) in the PROC STDIZE statement is almost the same as specifying LEAST=(p) option with MAXCLUS=1 and using the default values of the MAXITER= option in the PROC FASTCLUS statement. The only difference comes from the fact that the maximum number of iterations is a criterion for convergence on all variables simultaneously in PROC STDIZE while it is a criterion for convergence on a single multivariate statistic in PROC FASTCLUS. The location and scale measures for L(p) are output to the OUTSEED= data set in PROC FASTCLUS. MIDRANGE The midrange is defined as  . ABW(c) Tukey's biweight. Refer to p. 376-378 and p. 385, Chapter 11 of Goodall (1983) for the biweight 1-step M-estimate. Also refer to p. 416-418, Chapter 12 of Iglewicz (1983) for the biweight A-estimate. AHUBER(c) Hubers. Refer to p.371-374, Chapter 11 of Goodall (1983) for the Huber 1-step M-estimate. Also refer to p. 416-418, Chapter 12 of Iglewicz (1983) for the Huber A-estimate. AWAVE(c) Andrews' Wave. Refer to p. 376, Chapter 11 of Goodall (1983) for the Wave 1-step M-estimate. Also refer to p. 416-418, Chapter 12 of Iglewicz (1983) for the Wave A-estimate. AGK(p) This is the non-iterative univariate form of the estimator described by Art, Gnanadesikan, and Kettenring (1982). The AGK estimate is documented as the METHOD= option in the PROC ACECLUS statement of the ACECLUS procedure. (See "The ACECLUS Procedure" in the SAS/STAT User's Guide). Specifying METHOD= AGK(p) in the PROC STDIZE statement is the same as specifying METHOD=COUNT and P=p in the PROC ACECLUS statement. SPACING(p) A spacing is the absolute difference between two data values. The minimum spacing for a proportion p is the minimum absolute difference between two data values that contain a proportion p of the data between them. The mid minimum spacing is the mean of these two data values. Computing Quantiles Proc STDIZE offers two methods for computing quantiles: the P2 approach q    the order-statistics approach (as in PROC UNIVARIATE) q    The P2 approach used in PROC STDIZE modifies the   algorithm for histograms proposed by Jain and Chlamtac (1985). The main difference comes from the movement of markers. P2 allows a marker to move to the right (or left) by more than one position (to the largest possible integer) as long as it would not result in two markers being in the same position. This modification is necessary to prorate the FREQ variable. Using the P2 approach to estimate quantiles beyond the quartiles ( P75) will not always produce accurate results and a large sample size (10,000 or more) is required if the tail quantiles ( P10 and  P90) are requested. Also, tail quantiles are not recommended for highly skewed and/or heavy-tailed distributions. Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

Ana səhifə