Structure and Interpretation of Computer Programs

Yüklə 2,71 Mb.

Pdf görüntüsü

səhifə	57/222
tarix	08.08.2018
ölçüsü	2,71 Mb.
	#61085

1 ... 53 54 55 56 57 58 59 60 ... 222

Now we represent the data base as a set of records. To locate the record with a given key we use a

procedure

lookup

, which takes as arguments a key and a data base and which returns the record that

has that key, or false if there is no such record.

Lookup

is implemented in almost the same way as

element-of-set?

. For example, if the set of records is implemented as an unordered list, we

could use

(define (lookup given-key set-of-records)

(cond ((null? set-of-records) false)

((equal? given-key (key (car set-of-records)))

(car set-of-records))

(else (lookup given-key (cdr set-of-records)))))

Of course, there are better ways to represent large sets than as unordered lists. Information-retrieval

systems in which records have to be ‘‘randomly accessed’’ are typically implemented by a tree-based

method, such as the binary-tree representation discussed previously. In designing such a system the

methodology of data abstraction can be a great help. The designer can create an initial implementation

using a simple, straightforward representation such as unordered lists. This will be unsuitable for the

eventual system, but it can be useful in providing a ‘‘quick and dirty’’ data base with which to test the

rest of the system. Later on, the data representation can be modified to be more sophisticated. If the

data base is accessed in terms of abstract selectors and constructors, this change in representation will

not require any changes to the rest of the system.

Exercise 2.66. Implement the

lookup

procedure for the case where the set of records is structured as

a binary tree, ordered by the numerical values of the keys.

2.3.4 Example: Huffman Encoding Trees

This section provides practice in the use of list structure and data abstraction to manipulate sets and

trees. The application is to methods for representing data as sequences of ones and zeros (bits). For

example, the ASCII standard code used to represent text in computers encodes each character as a

sequence of seven bits. Using seven bits allows us to distinguish 2

, or 128, possible different

characters. In general, if we want to distinguish n different symbols, we will need to use

log

n bits

per symbol. If all our messages are made up of the eight symbols A, B, C, D, E, F, G, and H, we can

choose a code with three bits per character, for example

A 000

C 010

E 100

G 110

B 001

D 011

F 101

H 111

With this code, the message

BACADAEAFABBAAAGAH

is encoded as the string of 54 bits

001000010000011000100000101000001001000000000110000111

Codes such as ASCII and the A-through-H code above are known as fixed-length codes, because they

represent each symbol in the message with the same number of bits. It is sometimes advantageous to

use variable-length codes, in which different symbols may be represented by different numbers of bits.

For example, Morse code does not use the same number of dots and dashes for each letter of the

alphabet. In particular, E, the most frequent letter, is represented by a single dot. In general, if our

messages are such that some symbols appear very frequently and some very rarely, we can encode

data more efficiently (i.e., using fewer bits per message) if we assign shorter codes to the frequent

symbols. Consider the following alternative code for the letters A through H:

A 0

C 1010

E 1100

G 1110

B 100

D 1011

F 1101

H 1111

With this code, the same message as above is encoded as the string

100010100101101100011010100100000111001111

This string contains 42 bits, so it saves more than 20% in space in comparison with the fixed-length

code shown above.

One of the difficulties of using a variable-length code is knowing when you have reached the end of a

symbol in reading a sequence of zeros and ones. Morse code solves this problem by using a special

separator code (in this case, a pause) after the sequence of dots and dashes for each letter. Another

solution is to design the code in such a way that no complete code for any symbol is the beginning (or

prefix) of the code for another symbol. Such a code is called a prefix code. In the example above, A is

encoded by 0 and B is encoded by 100, so no other symbol can have a code that begins with 0 or with

100.

In general, we can attain significant savings if we use variable-length prefix codes that take advantage

of the relative frequencies of the symbols in the messages to be encoded. One particular scheme for

doing this is called the Huffman encoding method, after its discoverer, David Huffman. A Huffman

code can be represented as a binary tree whose leaves are the symbols that are encoded. At each

non-leaf node of the tree there is a set containing all the symbols in the leaves that lie below the node.

In addition, each symbol at a leaf is assigned a weight (which is its relative frequency), and each

non-leaf node contains a weight that is the sum of all the weights of the leaves lying below it. The

weights are not used in the encoding or the decoding process. We will see below how they are used to

help construct the tree.

Figure 2.18: A Huffman encoding tree.

Figure 2.18: A Huffman encoding tree.

Figure 2.18 shows the Huffman tree for the A-through-H code given above. The weights at the leaves

indicate that the tree was designed for messages in which A appears with relative frequency 8, B with

relative frequency 3, and the other letters each with relative frequency 1.

Given a Huffman tree, we can find the encoding of any symbol by starting at the root and moving

down until we reach the leaf that holds the symbol. Each time we move down a left branch we add a 0

to the code, and each time we move down a right branch we add a 1. (We decide which branch to

follow by testing to see which branch either is the leaf node for the symbol or contains the symbol in

its set.) For example, starting from the root of the tree in figure 2.18, we arrive at the leaf for D by

following a right branch, then a left branch, then a right branch, then a right branch; hence, the code

for D is 1011.

To decode a bit sequence using a Huffman tree, we begin at the root and use the successive zeros and

ones of the bit sequence to determine whether to move down the left or the right branch. Each time we

come to a leaf, we have generated a new symbol in the message, at which point we start over from the

root of the tree to find the next symbol. For example, suppose we are given the tree above and the

sequence 10001010. Starting at the root, we move down the right branch, (since the first bit of the

string is 1), then down the left branch (since the second bit is 0), then down the left branch (since the

third bit is also 0). This brings us to the leaf for B, so the first symbol of the decoded message is B.

Now we start again at the root, and we make a left move because the next bit in the string is 0. This

brings us to the leaf for A. Then we start again at the root with the rest of the string 1010, so we move

right, left, right, left and reach C. Thus, the entire message is BAC.

Generating Huffman trees

Given an ‘‘alphabet’’ of symbols and their relative frequencies, how do we construct the ‘‘best’’ code?

(In other words, which tree will encode messages with the fewest bits?) Huffman gave an algorithm

for doing this and showed that the resulting code is indeed the best variable-length code for messages

where the relative frequency of the symbols matches the frequencies with which the code was

Yüklə 2,71 Mb.

Dostları ilə paylaş:

1 ... 53 54 55 56 57 58 59 60 ... 222