9
a compressed pointer references the compressed representation [24]. The references
can be either left- or right- pointing and the scheme allows for recursion.
Storer and Szymanski’s scheme addresses a flaw in the original LZ77 algorithm.
LZ77 would occasionally generate a reference longer than the target string, resulting
in poor compression. To correct this, LZSS omits references that are longer than a
specific point. This scheme also uses one-bit flags to indicate whether the following
string of data is the original source or a reference.
2.3.2 PPMII
PPMII was integrated into WinRAR as of version 2.9 to further reduce com
pression ratios [21]. PPMII was developed by Dmitry Shkarin as an improvement
to the Prediction by Partial Matching model [25, 26]. Broadly, the n
th
symbol of a
string is predicted based on the previous n − 1 symbols. The compression of a string
is defined by code conditional probability distributions and based on the following
assumption [25]:
The larger the common (initial) part of contexts s, the larger (on the
average) the closeness of their conditional probability distributions.
This notes that the greater number of common characters two strings have, the
greater the probability of predicting the n
th
symbol. This is desirable as a higher
probability requires fewer bits to encode. To efficiently store the contexts, an M −ary
tree is utilized. This is particularly efficient if a text consists of large numbers of short
strings.
2.3.3 Intel IA-32
Intel IA-32 is a compression scheme introduced in response to the observation that
database processing correlates with the hardware constraints of storage I/O [27]. It
provides lightweight compression and decompression using single instruction, multiple
10
data (SIMD) commands to optimize database queries. Data is compressed quickly by
reducing the the dynamic range of data. This is accomplished by applying a mask,
packed shift, and finally stitching the data together.
2.3.4 Delta encoding
This is the second new technique introduced to optimize compression performace
in the newest version. Delta encoding encompasses several techniques that stores data
as the difference between successive samples [28]. This is an alternative to directly
storing the samples themselves. Generally, the first value in the encoded file is equal
to the first value in the original data. The subsequent values are equal to the difference
between the current and previous value in the input. That is, for an encoded value
y
n
with original inputs x
n
:
y
n
= x
n
− x
n−1
(2.1)
This approach is best suited when the values in the original file have only small
changes between adjacent symbols. It is therefore ideal for file representation of a
signal, but performs poorly with text files and executable code.
11
3. METHODS
This section provides detailed descriptions of the experiments taken to determine
information leakage through the compression side-channel. These experiments include
an examination of compression ratios for file type and string detection as well as
a man-in-the-middle attack exploiting the independence between the compression
and encryption algorithms. The experiments were chosen based on their focus on
the compression side-channel and the practicality of implementing the attacks with
limited knowledge of an archive’s contents. Unless otherwise indicated, experiments
are performed using WinRAR v5.10
3.1 Compression ratios
This experiment is based on work by Kelsey and Polmirova-Nickolova [11, 12]. It
is run to test the hypothesis:
Hypothesis 1 The compression of different file types under RAR and RAR5 archives
will produce distinct compression ratios.
If this hypothesis holds true, an attacker can make an educated guess as to the
contents of an encrypted archived file. This knowledge is useful for identifying file
type even when a user applies obfuscating measure such as renaming a file. The
information needed to calculate the compression ratio can be obtained by inspecting
the file header. Once this information is obtained, the compression ratio can be
calculated as shown in Equation 3.1.
The files used in this test are retrieved from the Canterbury Corpus and Maximum
Compression benchmark [29,30]. The files included in these collections are selected to
give compression results typical of commonly used files. In particular, the Canterbury
12
Corpus is the main benchmark to test compression algorithms. Details describing the
contents of the collections can be found in Appendix A.
The files are categorized into four types as outlined in the Maximum Compression
collection: text, executable, graphic, and other. Text file formats include plain text
files in English. Executables are Windows executable files such as .exe extensions.
Graphic files are various image file types. Other types include any files not included
under the other categories such as Microsoft Office documents, Adobe PDF or help
files. The corpa include only two graphic files of .jpeg and .bmp types. To increase
sample size and provide a wider range of file formats under the graphic file type,
additional .png and .gif formats are included.
All files in the collection were compressed with and without encryption using both
RAR and RAR5 file types. For testing purposes, the password “P4ssw0rd” was used for
all encrypted archives. The compression ratio, c, for each archive with packed archive
size x and unpacked size y is calculated as follows:
x
c =
(3.1)
y
The experiment is set up as a Block design with the file types as treatments
and encryption and archive format as blocks. This allows the compression ratio of
file types to be compared while controlling the variation due to different methods.
Analysis of Variance (ANOVA) is then employed to test the existence of a statistical
difference between treatments. These tests are carried out at the α = 5% significance
level.
3.2 File detection
The file detection experiments are inspired by the String Presence Detection at
tacks outlined by Kelsey [12]. This experiment will test the following two hypotheses:
Hypothesis 2a Given an uncompressed plaintext string S and a known file from
an encrypted archive, an attacker can determine whether S appears frequently
within the archive.
Dostları ilə paylaş: |