19
4. RESULTS
In this section, the results of the experiments described in the previous section are
discussed.
4.1 Compression ratios
The first input for the experiment allows each file to be considered a separate
treatment. The compression and password options are then considered blocks, of
which there are four total. The files are divided into four groups: Text, Executable,
Graphics and Other. These are encoded as treatments 1, 2, 3 and 4, respectively.
To balance the experiment, four files for each type are randomly selected and each
file is tested in every block. An ANOVA test is run to compare the means of the
four treatments at a significance level of α = .05. A box-plot and basic descriptive
statistics of the data follow in Figure 4.1 and Table 4.1.
Fig. 4.1. Box Plot of the distributions of different file types.
20
Table 4.1.
Descriptive statistics for compression ratio data
Treatmente N Obs
Mean
Std Dev Minimum Maximum
1
16
0.3244542 0.0637309
0.2470714
0.4163709
2
16
0.3561775 0.0856055
0.2768610
0.4914785
3
16
0.8181292 0.3235247
0.2754676
1.0009069
4
16
0.3092772 0.3171904
0.0360954
0.8311475
Notice in Figure 4.1, the plots of the treatment means overlap. This suggests
that they are not necessarily distinct. To determine whether there exists a significant
difference between file types, hypothesis testing on H
0
: The treatment means are equal
is conducted using Analysis of Variance. SAS provides the ANOVA table in Table 4.2.
The P-value of < 0.0001 is less than the stated significance value. Therefore, there is
statistical evidence to reject H
0
and the conclusion is that there exists a difference in
compression ratios of different file types.
Table 4.2.
ANOVA table for comparing compression ratios of different file types
Source DF Type III SS Mean Square F-Value P-value
trt
3
2.87792404
0.95930801
16.82
<.0001
blk
3
0.00000060
0.00000020
0.00
1.0000
To formally test the difference between means, Tukey’s comparison for treatment
means is implemented. All possible pairs from the data are tested, which make
Tukey’s comparison most appropriate. Means with the same letter are not considered
significantly different. As illustrated in Table 4.3, treatments 2, 4 and 1 are not
significantly different. These treatment types correspond to text, executable and
21
other data files respectively. Graphics are noted to have a mean significantly higher
than other file types.
Table 4.3.
Tukey’s comparison of treatment means
Tukey Grouping
Mean
N trt
A
0.81813 16
3
B
0.35618 16
2
B
0.32445 16
4
B
0.30928 16
1
Finally, Table 4.4 provides 95% confidence intervals for the different file type
ratios. These intervals have a 95% chance of containing the true population mean.
Investigators with a known compression ratio falling within one of these intervals can
assume that the files contained in the archive are of the indicated file type.
Table 4.4.
95% Confidence Intervals for different file type compression ratios
File Type
Mean
95% Confidence Interval
Text
Executable
Graphic
Other
0.32445
0.35618
0.81813
0.30928
0.29049
0.31056
0.64574
0.14026
0.35841
0.40179
0.99052
0.47830
4.2 File detection
Two experiments are run in this section. The first tests whether the appearance
of substrings in the known part of an archive correlates with the compressed length
22
of the archive. The second experiment tests whether the compression ratio of the
archive is correlated with the compression ratio of a file in question.
4.2.1 Appearance of substrings
The archives are constructed as described in Section 3.2. The goal is to identify
archives that contain FP.log through the appearance of substrings of a string S in a
known file. Archives containing FP.log are sorted from the collection. Appearance of
substrings are counted for each archive. Linear regression is then applied to determine
the correlation between the number of appearances and the compressed size of the
archive.
Table 4.5.
SAS output of correlation between size and appearance of substrings
where the file is present
Root MSE
495032
R-Square 0.2520
Dependent Mean 1293068 Adj R-Sq 0.1273
Coeff Var
38.28347
Table 4.6.
SAS output of correlation between size and appearance of substrings
where the file is not present
Root MSE
317309
R-Square 0.1396
Dependent Mean
109798
Adj R-Sq 0.0614
Coeff Var
288.99243
Tables 4.5 and 4.6 show the SAS output for the correlation values. The model uses
multiple linear regression, so the Adj R-sq is the most appropriate statistic. Notice
that R
2
= 0.1273 and R
2
= 0.0614. This implies that the correlation is
present
notpresent