1.ABSTRACT
It is not always possible or practical to store very large data sets in a relational database. For example, in a rapidly changing large-scale environment with distributed control, such as the World-Wide Web, a strict relational approach is not feasible. Nevertheless, it is desirable for a user to be able to make SQL queries on the entire data set, because SQL is well known, supported, and understood. We introduce “Just-In-Time Databases”, which allow the user to query the entire data set as though it were in a relational database. The underlying engine brings data into the relational database only when it is required by user queries. We describe how “Just-In-Time Databases” are implemented, using the World-Wide Web as an example.
1.1Keywords
Semi-structured information, Web, relational databases, SQL, federated databases, Internet
2.INTRODUCTION
2.1Motivation
With the rise of federated databases comes the need to query very large, heterogeneous, distributed sets of read-only documents. Relational databases provide a powerful abstraction for manipulating data. The division of data into relations helps users comprehend the data and allows them to specify queries in Structured Query Language (SQL). This need to structure information is greatest when a data set is very large; however, traditional database management systems require that the entire data set be stored in a relational database, which is not always possible.
Although not usually classified as such, the World-Wide Web can be seen as a federated database. Specifically, it is a very large distributed collection of documents, composed of a number of different types of semi-structured information. Because of the Web’s size, dispersion, and flux, it would be impossible to store the entire Web in a relational database.
We introduce “Just-In-Time Databases”, which provide the user with the illusion that an entire massive data set is in a local relational database. In reality, data is retrieved or computed only as needed. This allows the benefits of the relational database model to be obtained even when it is impractical or impossible to actually store all the data in a relational database. We provide algorithms for implementing Just-In-Time Databases, using the Web as an example.
2.2Overview
A Just-In-Time Database specification consists of:
-
a relational database schema, which is first used to define tables in an underlying locally-stored traditional relational database management system (RDBMS)
-
a set of functions for instantiating complete table rows, given values for a subset of columns
When the user issues a query on a JIT database, the JIT interpreter does the following:
-
Determines what data must be filled into the local database in order to answer the query, computing and inserting the data if it is not already present.
-
Passes the original query to the underlying RDBMS, returning the answer to the user.
The effect of the JIT interpreter is to provide the user with the illusion that the entire data set is stored in the relational database. In actuality, the relational database serves as a cache on the data set, containing only the data needed to answer queries that have been made.
The next section contains definitions useful for describing the behavior of the system more precisely, followed by the algorithms used to determine what information must be inserted into the actual relational database. We then discuss a JIT database for the Web. We then conclude with a discussion of the significance of this work and its relation to other research.
3.DEFINITIONS
3.1Defining columns
A table in a relational database consists of a set of columns. We call a column c a defining column if an executable function exists that maps any value of c to all sets of legal values for the other columns. This function may not refer to the table itself (e.g., do “select” queries).
For example, if a table squares contains two columns, base and square, representing an integer and its square, respectively, both columns would be defining columns. (Squares and other tables used in this section are shown in Figure 1. Throughout this paper, table names appear in small caps, column names in bold face, and function names in italics.) If the value of base is known, the legal value for square can be easily computed. If the value of square is known, the legal associated values for base can be computed. If square is zero, there will be precisely one legal value for base: zero. If square is the square of any other integer, there will be two legal values for base, the positive and negative square roots. If square is not the square of an integer, there will be no legal values for base. In any case, knowing the value of either column determines the legal values for the other column.
On the other hand, a table election (Figure 1) whose two columns voter or ballot would have no defining columns if it described a secret ballot. Knowing the identity of a voter or the contents of a ballot would not allow determination of the other column value.
3.2Fetch
We define an imperative operator fetch that, when given the values of one or more defining columns, augments a table by adding all legal relations having the given values. The syntax of the basic form of fetch is:
fetch tableName(col0=value0, … colN = valueN)
where tableName is the name of a table, cole0 through colN are defining columns, and value0 through valueN are, respectively, legal values. For example, consider the following instance:
fetch squares(base=4)
which would add the relation {base=4, square=16} to table squares, unless it was already present. It is permissible that, as a side effect, additional relations may be added to squares or other tables.
|
Squares
|
|
|
Election
|
|
Primes
|
base
|
square
|
|
voter
|
ballot
|
|
prime
|
0
|
0
|
|
Alice
|
yes
|
|
2
|
-1
|
1
|
|
Bob
|
no
|
|
3
|
1
|
1
|
|
Carol
|
yes
|
|
5
|
-2
|
4
|
|
David
|
yes
|
|
7
|
Figure 1: Tables squares, election, and primes, used as examples
fetchStatement
|
=
|
“fetch” “(” fetchList “)”
|
|
|
( “from ”
|
|
|
(“where ” logicExpression)?
|
|
|
(“group by ” columnList)?
|
|
|
(“having ” logicExpression)? )?
|
|
|
|
fetchList
|
=
|
namedArgument (“, ” namedArgument)*
|
|
|
|
namedArgument
|
=
|
“=” fetchExpression
|
|
|
|
fetchExpression
|
=
|
|
|
expression “|” expression
expression “&” expression
expression
|
Figure 2: The grammar for fetch statements. Nonterminals tableList, logicExpression, columnList, orderList, and expression are defined elsewhere [15] and have the same meanings as the corresponding SQL nonterminals.
The full syntax for the fetch statement is shown in Figure 2. An optional “from” clause can be used to specify a set of values for a defining column. For example, if table Primes (Figure 1) has a column number containing the prime numbers less than 1000, the following statement would insert into Squares the primes between ten and fifty and their squares (unless they were already in the table):
fetch squares(base=p.number)
from primes p
where (p.number > 10 and p.number <50)
Within the fetch statement, a defined column can be assigned a single value, a conjunction, or a disjunction. For example, the following table would augment squares with the squares of the numbers from one to five:
fetch squares(base=1|2|3|4|5)
The following statement would add no entries to squares:
fetch squares(base=1&2)
since no value for square is associated with base values of both one and two. A conjunction that could augment Squares is:
fetch squares(base=-2&2)
since there is a value for square (4) that is associated with both base values.