The arboretum procedure



Yüklə 3.07 Mb.

səhifə141/148
tarix30.04.2018
ölçüsü3.07 Mb.
1   ...   137   138   139   140   141   142   143   144   ...   148
: documentation
documentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked to
documentation -> EnCase Forensic Transform Your Investigations
documentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016
documentation -> Gaia Data Release 1 Documentation release 0

MAXABS

Maximum


absolute

value


0

MEAN


1

Mean


MEDIAN

1

Median



MIDRANGE

Range/2


Midrange

RANGE


Range

Minimum


SPACING (p)

Minimum


spacing

Mid


minimum-spacing

STD


Standard

deviation

Mean

SUM


Sum

0

USTD



Standard

deviation

about the

origin


0

For METHOD=ABW(c), METHOD=AHUBER(c), or METHOD=AWAVE(c), c is a positive

numeric tuning constant (Iglewicz, 1983).

q   


For METHOD=AGK(p), p is a numeric constant that gives the proportion of pairs to be used with

METHOD=COUNT in the ACECLUS procedure (Refer to SAS/STAT Software: Changes and

Enhancements for Release 6.12 p. 229).

q   


For METHOD=SPACING(p), p is a numeric constant that gives the proportion of data to be

contained in the spacing.

q   

For METHOD=L(p), p is a numeric constant greater than or equal to 1 that specifies the power to



which differences are to be raised in computing an L(p) or Minkowski metric.

q   


For METHOD=IN(SAS-data-set), the SAS data set can contain:

a _TYPE_ variable which identifies the observations that contain location and scale

measures. For example, PROC STDIZE produces an OUTSTAT= data set that contains

LOCATION and SCALE measures and some other statistics. _TYPE_='LOCATION'

identifies the observation that contains location measures and _TYPE_='SCALE' identifies

the observation that contains scale measures. You can also use the data set created by the

OUTSTAT= option from another PROC STDIZE statement as the IN= data set name. See

the Output Data Sets section below for the contents of the OUTSTAT data set.

1.  

the location and scale variables specified by the LOCATION and SCALE statements.



2.  

q   


PROC STDIZE reads in the location and scale variables in the IN=data set according to the following


rules: PROC STDIZE first looks for the _TYPE_ variable in the IN=data set. If it is found, PROC

STDIZE continues to search for all variables specified in the VAR statement. If the _TYPE_ variable is

not found, PROC STDIZE searches for the location variables specified in the LOCATION statement and

the scale variables specified in the SCALE statement.

For robust estimators, see Goodall (1983) and Iglewicz (1983). MAD has the highest breakdown point

(50%) but is not very efficient. ABW, AHUBER, and AWAVE provide a good compromise between

breakdown and efficiency. L(p) location estimates are increasingly robust as p drops from 2 (least

squares, that is, the mean) to 1 (least absolute value, that is, the median), but the L(p) scale estimates are

not robust.

Spacing is robust to both outliers and clustering (Jannsen, et al., 1983) and is therefore a good choice for

cluster analysis or nonparametric density estimation. The mid minimum spacing estimates the mode for

small p. AGK is also robust to clustering and more efficient than SPACING, but it is not as robust to

outliers and takes longer to compute. If you expect g clusters, the argument to SPACING or AGK should

be 1/g or less. AGK is less biased than SPACING in small samples. It would generally be reasonable to

use AGK for samples of size 100 or less and to use SPACING for samples of size 1000 or more, with the

treatment of intermediate sample sizes depending on the available computer resources.



Computation of the Statistics

Formulas for statistics of METHOD= MEAN, MEDIAN, SUM, USTD, STD, RANGE, and IQR are

given in Chapter 1, "SAS Elementary Statistics Procedure", in the SAS Procedures Guide. Note that the

computations of median and upper and lower quartiles depend on the PCTLMTD= option.

The rest of the statistics used in the above Table of Methods for Computing Location and Scale

Measures, with the exception of METHOD=IN, are described as follows:

EUCLEN

Euclidean length.



 where 

 is the  th observation and   is the total number of observations in the

sample.

L(p)



Minkowski metric. It is documented as the LEAST=p option in the FASTCLUS procedure (see

"The FASTCLUS Procedure" in the SAS/STAT User's Guide). Specifying METHOD=L(p) in

the PROC STDIZE statement is almost the same as specifying LEAST=(p) option with

MAXCLUS=1 and using the default values of the MAXITER= option in the PROC FASTCLUS

statement. The only difference comes from the fact that the maximum number of iterations is a

criterion for convergence on all variables simultaneously in PROC STDIZE while it is a criterion

for convergence on a single multivariate statistic in PROC FASTCLUS. The location and scale

measures for L(p) are output to the OUTSEED= data set in PROC FASTCLUS.

MIDRANGE



The midrange is defined as 

.

ABW(c)



Tukey's biweight. Refer to p. 376-378 and p. 385, Chapter 11 of Goodall (1983) for the biweight

1-step M-estimate. Also refer to p. 416-418, Chapter 12 of Iglewicz (1983) for the biweight

A-estimate.

AHUBER(c)

Hubers. Refer to p.371-374, Chapter 11 of Goodall (1983) for the Huber 1-step M-estimate. Also

refer to p. 416-418, Chapter 12 of Iglewicz (1983) for the Huber A-estimate.

AWAVE(c)

Andrews' Wave. Refer to p. 376, Chapter 11 of Goodall (1983) for the Wave 1-step M-estimate.

Also refer to p. 416-418, Chapter 12 of Iglewicz (1983) for the Wave A-estimate.

AGK(p)

This is the non-iterative univariate form of the estimator described by Art, Gnanadesikan, and

Kettenring (1982).

The AGK estimate is documented as the METHOD= option in the PROC ACECLUS statement of

the ACECLUS procedure. (See "The ACECLUS Procedure" in the SAS/STAT User's Guide).

Specifying METHOD= AGK(p) in the PROC STDIZE statement is the same as specifying

METHOD=COUNT and P=p in the PROC ACECLUS statement.

SPACING(p)

A spacing is the absolute difference between two data values. The minimum spacing for a

proportion p is the minimum absolute difference between two data values that contain a

proportion p of the data between them. The mid minimum spacing is the mean of these two data

values.

Computing Quantiles

Proc STDIZE offers two methods for computing quantiles:

the P2 approach

q   


the order-statistics approach (as in PROC UNIVARIATE)

q   


The P2 approach used in PROC STDIZE modifies the 

 algorithm for histograms proposed by Jain

and Chlamtac (1985). The main difference comes from the movement of markers. P2 allows a marker to

move to the right (or left) by more than one position (to the largest possible integer) as long as it would

not result in two markers being in the same position. This modification is necessary to prorate the FREQ

variable.

Using the P2 approach to estimate quantiles beyond the quartiles (
P75) will not always

produce accurate results and a large sample size (10,000 or more) is required if the tail quantiles ( P10

and  P90) are requested. Also, tail quantiles are not recommended for highly skewed and/or

heavy-tailed distributions.





Dostları ilə paylaş:
1   ...   137   138   139   140   141   142   143   144   ...   148


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

    Ana səhifə