Back to EveryPatent.com
United States Patent |
5,696,960
|
Bhargava
,   et al.
|
December 9, 1997
|
Computer program product for enabling a computer to generate uniqueness
information for optimizing an SQL query
Abstract
A system and method of determining uniqueness properties of an expression.
A root of the expression is first determined, where the root is one of a
base relation, a unary operation or a binary operation. Once the root is
determined, a first procedure of an augmented unique process is called to
determine uniqueness properties of a child of that root. The procedure
called is chosen based on the determined root. Where the root is a base
relation, a first procedure of a uniqueness process is applied to
determine the uniqueness properties of the base relation. Where the root
is a unary or binary operation, the called procedure is suspended, a
second procedure of the augmented unique process is called to determine
the uniqueness properties of the child of the operation, and this process
is repeated until a base relation is reached. Once a base relation is
reached, the first procedure of the uniqueness process is applied to
determine the uniqueness properties of the reached base relation. A next
procedure of a uniqueness process is applied to determine the uniqueness
properties of a parent operator of the based relation. The procedure
applied is chosen based on a type of operation represented by the parent.
The process then unwinds to determine the uniqueness properties for each
ancestor of the base relation(s).
Inventors:
|
Bhargava; Gautam (Cupertino, CA);
Goel; Piyush (Monte Sereno, CA);
Iyer; Balakrishna R. (San Jose, CA)
|
Assignee:
|
International Business Machines Corporation (Armonk, NY)
|
Appl. No.:
|
460561 |
Filed:
|
June 2, 1995 |
Current U.S. Class: |
707/2 |
Intern'l Class: |
G06F 017/30 |
Field of Search: |
395/600,602
364/DIG. 1
|
References Cited
U.S. Patent Documents
5423035 | Jun., 1995 | DePrez | 395/600.
|
Other References
"Outer Joins and Filters fro Instantiating Objects from Relational
Databases through Views", Lee et al., IEEE Transactions on Knowledge and
Data Engineering, vol. 6, No. 1 Jul., 1991.
Chen, A.L.P., "Outerjoin Optimization in Multidatabase Systems," 2nd
International Symposium On Databases In Parallel And Distributed Systems,
pp. 211-218, 1990.
Galindo-Legaria, C., and Rosenthal, A., "How to Extend A Conventional
Optimizer To Handle One- And Two-Sided Outerjoin," Proceedings IEEE Data
Engineering Conference, pp. 402-409, 1992.
Paulley, G.N. and Larson, P-A., "Exploiting Uniqueness In Query
Optimization," CASCON, pp. 804-822, vol. II, Oct. 1993.
Pirahesh, H., Hellerstein, J.M. and Hasan, W., "Extensible/Rule Based Query
Rewrite Optimization In Starburst," SIGMOD, pp. 39-48, San Diego, CA,
Jun., 1992.
Date, C.J., et al., "The Role of Functional Dependence In Query
Decomposition," Relational Database Writings 1989-1991, pp. 133-154.
|
Primary Examiner: Black; Thomas G.
Assistant Examiner: Alam; Hosain T.
Attorney, Agent or Firm: Sterne, Kessler, Goldstein & Fox, P.L.L.C., Johnson; Prentiss Wayne
Parent Case Text
This application is a division of application Ser. No. 08/366,560, filed
Dec. 30, 1994.
Claims
Having thus described our invention, what we claim as new and desire to
secure by Letters Patent is:
1. A program storage device readable by a machine, tangibly embodying a
program of instructions executable by the machine to perform method steps
for determining uniqueness properties of an expression for query
optimization, said method steps comprising the steps of:
a) determining a root of the expression, wherein said root is one of a base
relation, a unary operation or a binary operation;
b) calling a first one of a plurality of procedures of an augmented unique
process to determine uniqueness properties of a child of the root, wherein
the procedure called is chosen based on said determined root;
c) where said root is a base relation, applying a first procedure of a
uniqueness process to determine the uniqueness properties of said base
relation;
d) where said root is a unary or binary operation,
(i) suspending the called procedure,
(ii) calling a second one of a plurality of procedures of said augmented
unique process to determine the uniqueness properties of the child of said
unary or binary operation,
(iii) repeating said steps i and ii until a base relation is reached, and
(iv) applying a first procedure of a uniqueness process to determine the
uniqueness properties of said reached base relation; and
e) applying a second of a plurality of procedures of a uniqueness process
to determine the uniqueness properties of a parent operator of said base
relation, wherein said procedure applied is chosen based on a type of
operation represented by said parent; and
f) repeating said step (e) for each ancestor of a base relation.
2. The program storage device of claim 1, wherein if said root is a binary
operator having a left child and a right child, said second of a plurality
of procedures of said uniqueness process comprises the steps of:
concatenating each element in a unique set of a left child with each
element in a unique set of the right child to create a first set;
if the operator is a join or left outer join operator, and a predicate
P.sub.LR is an equi-join, and if the attributes that are common to both a
schema of the predicate and a schema of the right child are in a unique
set of the right child, then a second set is defined as a unique set of
the left child, otherwise, said second set is defined as an empty set;
if the operator is a join or right outer join operator, and the predicate
P.sub.LR is an equi-join, and if the attributes that are common to both
the schema of the predicate and the schema of the left child are in the
unique set of the left child, then a third set is defined as a unique set
of the left child, otherwise said third set is defined as the empty set;
and
defining uniqueness properties of the operator as minimum attributes of
said first, second and third sets.
3. The program storage device of claim 2, wherein if said root is a binary
intersect operator, said second of a plurality of procedures of said
uniqueness process further comprises the steps of:
defining a strongly bound set as a minimum of a renamed combination of a
strongly bound set of the left child and a strongly bound set of the right
child;
defining a weakly bound set and a join attribute set as the empty set;
if the intersect operator is an intersect distinct operator, defining a
unique set as the minimum of the renamed set of the combination of the
output attributes of the intersect distinct and the unique set of the left
child and the unique set of the right child; and
if the intersect operator is not an intersect distinct operator, defining
the unique set as the minimum of the renamed set of the combination of the
unique set of the left child and the unique set of the right child.
4. The program storage device of claim 2, wherein if said root is a join
operation, said second of a plurality of procedures of said uniqueness
process further comprises the steps of:
defining a strongly bound set of said attributes as the union of a strongly
bound set of the left child and the strongly bound set of the right child;
and
defining a weakly bound set of said attributes as the union of a weakly
bound set of the left child and the weakly bound set of the right child;
wherein if said root is a left-outer join operation, said second of a
plurality of procedures of said uniqueness process further comprises the
steps of:
defining a strongly bound set of said attributes as a strongly bound set of
the left child; and
defining a weakly bound set of said attributes as the union of a weakly
bound set of the left child and the weakly bound set of the right child
and the strongly bound set of the right child;
wherein if said root is a right-outer join operation, said second of a
plurality of procedures of said uniqueness process further comprises the
steps of:
defining a strongly bound set of said attributes as the strongly bound set
of the right child; and
defining a weakly bound set of said attributes as the union of a weakly
bound set of the left child and the weakly bound set of the right child
and the strongly bound set of the left child; and
wherein if said root is a full-outer join operation, said second of a
plurality of procedures of said uniqueness process further comprises the
steps of:
defining a strongly bound set of said attributes as an empty set; and
defining a weakly bound set of said attributes as the union of a weakly
bound set of the left child and the weakly bound set of the right child
and the strongly bound set of the left child and the strongly bound set of
the right child.
5. The program storage device of claim 4, wherein said method steps further
comprise the steps of:
for each attribute in said strongly bound set and said weakly bound set,
determining a closure of a predictor set under the predicate,
wherein if the predicate contains subclause A=B, one of the following steps
is performed
(1) if said root is a join operation or a full-outer join operation, the
attribute level of B is the same as the level of A,
(2) if said root is a left-outer join, the attribute level of B is one
lower than the attribute level of A, and
(3) if said root is a right-outer join, the attribute level of B is one
higher than the attribute level of A.
6. The program storage device of claim 2, wherein said second of a
plurality of procedures of said uniqueness process comprises the step of:
defining a join attribute set as the union of the attribute set of the left
child and the attribute set of the right child and the attributed
mentioned in the predicate.
7. The program storage device of claim 1, wherein said first procedure of
said uniqueness process comprises the steps of:
setting a unique set for said base relation equal to a key set for said
base relation; and
defining a strongly bound set, a weakly bound set and a join attribute set
as empty sets for the expression.
8. The program storage device of claim 1, wherein if said root is a
non-duplicate removing projection operation, said second of a plurality of
procedures of said uniqueness process comprises the steps of:
defining a unique set as a unique set of a child of the root;
defining a strongly bound set as the intersection of a strongly bound set
of the child with a set of all attributes of said root;
defining a weakly bound set as the intersection of a weakly bound set of
the child with a set of all attributes of said root; and
defining a join attribute set as a join attribute set of the child.
9. The program storage device of claim 1, wherein if said root is a
selector operation, said second of a plurality of procedures of said
uniqueness process comprises the steps of:
determining whether there exists a set of attributes, K, in a unique set of
a child of the expression such that all attributes in K are strongly bound
by a predicate to determine whether a one-tuple condition exists;
if said one-tuple condition exists, adding all attributes currently in a
node to a unique set of said node and to a strongly bound set of said
node;
if said one-tuple condition does not exist, adding attributes in K that are
not strongly bound to said unique set and to said strongly bound set of
said node;
defining a weakly bound set and a join attribute set as empty sets for said
node; and
defining each attribute that is strongly bound as being a level one
attribute.
10. The program storage device of claim 1, wherein if said root is a delta
projection operation, said second of a plurality of procedures of said
uniqueness process comprises the steps of:
(a) defining an attribute set as including all attributes for the
projection;
(b) determining whether said attribute set contains a joining attribute set
of a child of the projection;
(c) if said attribute set contains a joining attribute set of the child,
then for each attribute in said attribute set that is found in a weakly
bound set of its child as well as in said attribute set, determining
whether a level of said attribute is lower than the level of another
attribute in said attribute set;
(d) if said level of said attribute is lower than the level of another
attribute in said attribute set, removing said attribute from said
attribute set.
11. The program storage device of claim 10, wherein said second of a
plurality of procedures of said uniqueness process further comprises the
step of removing from said attribute set all of said attributes that are
strongly bound in the child.
12. The program storage device of claim 11, wherein said second of a
plurality of procedures of said uniqueness process further comprises the
steps of:
determining whether there are any attributes remaining in said attribute
set;
if no attributes remain in said attribute set, a unique set is defined as
all of said attributes originally in said defined attribute set;
if one or more of said attributes remain in said attribute set, said unique
set is defined as all of the attributes in the child combined with all of
said attributes remaining in said attribute set.
13. The program storage device of claim 12, wherein said second of a
plurality of procedures of said uniqueness process further comprises the
steps of:
defining a strongly bound set as the intersection of a strongly bound set
of the child with a set comprising all of said attributes originally in
said defined attribute set; and
defining a weakly bound set as the intersection of a weakly bound set of
the child with a set comprising all of said attributes originally in said
defined attribute set.
14. The program storage device of claim 1, further comprising the step of
using said uniqueness properties to optimize an SQL query.
15. A computer program product for use with a computer system, said
computer program product comprising:
a computer usable medium having computer readable program code means
embodied in said medium for causing the computer system to determine
uniqueness properties of an expression for query optimization, said
computer readable program code means comprising:
computer readable program code means for enabling the computer system to
determine a root of the expression, wherein said root is one of a base
relation, a unary operation or a binary operation;
computer readable program code means for enabling the computer system to
call a first one of a plurality of procedures of an augmented unique
process to determine uniqueness properties of a child of the root, wherein
the procedure called is chosen based on said determined root;
computer readable program code means for enabling the computer system to
apply a first procedure of a uniqueness process to determine the
uniqueness properties of said base relation where said root is a base
relation;
computer readable program code means usable by the computer system where
said root is a unary or binary operation, including,
computer readable program code means for enabling the computer system to
suspend the called procedure,
computer readable program code means for enabling the computer system to
call a second one of a plurality of procedures of said augmented unique
process to determine the uniqueness properties of the child of said unary
or binary operation, and
computer readable program code means for enabling the computer system to
apply a first procedure of a uniqueness process to determine the
uniqueness properties of a reached base relation; and
computer readable program code means for enabling the computer system to
apply a second of a plurality of procedures of a uniqueness process to
determine the uniqueness properties of a parent operator of said base
relation, wherein said procedure applied is chosen based on a type of
operation represented by said parent.
16. The computer program product of claim 15, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process comprises
computer readable program code means that is usable by the computer system
where said root is a binary operator having a left child and a right
child, said computer readable program code means that is usable by the
computer system where said root is a binary operator having a left child
and a right child, including:
computer readable program code means for enabling the computer system to
concatenate each element in a unique set of a left child with each element
in a unique set of the right child to create a first set;
computer readable program code means for enabling the computer system to
define a second set as a unique set of the left child if a first set of
conditions is satisfied, said first set of conditions including whether
the operator is a join or left outer join operator, a predicate P.sub.LR
is an equi-join, and whether the attributes that are common to both a
schema of the predicate and a schema of the right child are in a unique
set of the right child;
computer readable program code means for enabling the computer system to
define a second set as an empty set if said first set of conditions is not
satisfied;
computer readable program code means for enabling the computer system to
define a third set as a unique set of the left child if a second set of
conditions is satisfied, said second set of conditions including whether
the operator is a join or right outer join operator, the predicate
P.sub.LR is an equi-join, and the attributes that are common to both the
schema of the predicate and the schema of the left child are in the unique
set of the left child;
computer readable program code means for enabling the computer system to
define a third set as an empty set if said second set of conditions is not
satisfied; and
computer readable program code means for enabling the computer system to
define uniqueness properties of the operator as minimum attributes of said
first, second and third sets.
17. The computer program product of claim 16, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process comprises
computer readable program code means that is usable by the computer system
where said root is a binary intersect operator, said computer readable
program code means that is usable by the computer system where said root
is a binary intersect operator, including:
computer readable program code means for enabling the computer system to
define a strongly bound set as a minimum of a renamed combination of a
strong bind set of the left child and a strong bind set of the right
child;
computer readable program code means for enabling the computer system to
define, a weakly bound set and a join attribute set as the empty set;
computer readable program code means for enabling the computer system to
define, if the intersect operator is an intersect distinct operator, a
unique set as the minimum of the renamed set of the combination of the
output attributes of the intersect distinct and the unique set of the left
child and the unique set of the right child; and
computer readable program code means for enabling the computer system to
define, if the intersect operator is not an intersect distinct operator,
the unique set as the minimum of the renamed set of the combination of the
unique set of the left child and the unique set of the right child.
18. The computer program product of claim 16, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process comprises:
computer readable program code means for enabling the computer system to
define, if said root is a join operation, a strongly bound set of said
attributes as the union of a strongly bound set of the left child and the
strongly bound set of the right child, and a weakly bound set of said
attributes as the union of a weakly bound set of the left child and the
weakly bound set of the right child;
computer readable program code means for enabling the computer system to
define, if said root is a left-outer join operation, a strongly bound set
of said attributes as a strongly bound set of the left child, and a weakly
bound set of said attributes as the union of a weakly bound set of the
left child and the weakly bound set of the right child and the strongly
bound set of the right child;
computer readable program code means for enabling the computer system to
define, if said root is a right-outer join operation, a strongly bound set
of said attributes as the strongly bound set of the right child, and a
weakly bound set of said attributes as the union of a weakly bound set of
the left child and the weakly bound set of the right child and the
strongly bound set of the left child;
computer readable program code means for enabling the computer system to
define, if said root is a full-outer join operation, a strongly bound set
of said attributes as an empty set, and a weakly bound set of said
attributes as the union of a weakly bound set of the left child and the
weakly bound set of the right child and the strongly bound set of the left
child and the strongly bound set of the right child.
19. The computer program product of claim 18, further comprising:
computer readable program code means for enabling the computer system to
determine a closure of a predictor set under the predicate for each
attribute in said strongly bound set and said weakly bound set,
computer readable program code means for enabling the computer system to
perform, if the predicate contains subclause A=B, one of the following:
(1) if said root is a join operation or a full-outer join operation, the
attribute level of B is the same as the level of A,
(2) if said root is a left-outer join, the attribute level of B is one
lower than the attribute level of A, and
(3) if said root is a right-outer join, the attribute level of B is one
higher than the attribute level of A.
20. Computer program product of claim 16, wherein said computer readable
program code means for enabling the computer system to apply a second of a
plurality of procedures of a uniqueness process comprises:
computer readable program code means for enabling the computer system to
define a join attribute set as the union of the attribute set of the left
child and the attribute set of the right child and the attributed
mentioned in the predicate.
21. The computer program product of claim 15, wherein said computer
readable program code means for enabling the computer system to apply a
first procedure of a uniqueness process comprises:
computer readable program code means for enabling the computer system to
set a unique set for said base relation equal to a key set for said base
relation; and
computer readable program code means for enabling the computer system to
define a strongly bound set, a weakly bound set and a join attribute set
as empty sets for the expression.
22. The computer program product of claim 15, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process comprises
computer readable program code means that is usable by the computer system
where said root is a non-duplicate removing projection operation, said
computer readable program code means that is usable by the computer system
where said root is a non-duplicate removing projection operation,
including:
computer readable program code means for enabling the computer system to
define a unique set as a unique set of a child of the root;
computer readable program code means for enabling the computer system to
define a strongly bound set as the intersection of a strongly bound set of
the child with a set of all attributes of said root;
computer readable program code means for enabling the computer system to
define a weakly bound set as the intersection of a weakly bound set of the
child with a set of all attributes of said root; and
computer readable program code means for enabling the computer system to
define a join attribute set as a join attribute set of the child.
23. The computer program product of claim 15, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process comprises
computer readable program code means that is usable by the computer system
where said root is a selector operation, said computer readable program
code means that is usable by the computer system where said root is a
selector operation, including:
computer readable program code means for enabling the computer system to
determine whether there exists a set of attributes, K, in a unique set of
a child of the expression such that all attributes in K are strongly bound
by a predicate to determine whether a one-tuple condition exists;
computer readable program code means for enabling the computer system to
add all attributes currently in a node to a unique set of said node and to
a strongly bound set of said node if said one-tuple condition exists;
computer readable program code means for enabling the computer system to
add attributes in K that are not strongly bound to said unique set and to
said strongly bound set of said node if said one-tuple condition does not
exist;
computer readable program code means for enabling the computer system to
define a weakly bound set and a join attribute set as empty sets for said
node; and
computer readable program code means for enabling the computer system to
define each attribute that is strongly bound as being a level one
attribute.
24. The computer program product of claim 15, wherein said computer usable
medium having computer readable program code means embodied in said medium
enables the computer system to optimize an SQL query.
25. The computer program product of claim 15, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process comprises
computer readable program code means that is usable by the computer system
where said root is a delta projection operation, said computer readable
program code means that is usable by the computer system where said root
is a delta projection operation, including:
computer readable program code means for enabling the computer system to
define an attribute set as including all attributes for the projection;
computer readable program code means for enabling the computer system to
determine whether said attribute set contains a joining attribute set of a
child of the projection;
computer readable program code means for enabling the computer system to
determine whether a level of said attribute is lower than the level of
another attribute in said attribute set if the attribute is found in a
weakly bound set of its child as well as in said attribute set; and
computer readable program code means for enabling the computer system to
remove said attribute from said attribute set if said level of said
attribute is lower than the level of another attribute in said attribute
set.
26. The computer program product of claim 25, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process further
comprises computer readable program code means for enabling the computer
system to remove from said attribute set all of said attributes that are
strongly bound in the child.
27. The computer program product of claim 26, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process further
comprises:
computer readable program code means for enabling the computer system to
determine whether there are any attributes remaining in said attribute
set;
computer readable program code means for enabling the computer system to
define a unique set as all of said attributes originally in said defined
attribute set if no attributes remain in said attribute set; and
computer readable program code means for enabling the computer system to
define said unique set as all of the attributes in the child combined with
all of said attributes remaining in said attribute set if one or more of
said attributes remain in said attribute set.
28. The computer program product of claim 27, wherein said computer
readable program code means for enabling the computer system to apply a
second of a plurality of procedures of a uniqueness process further
comprises:
computer readable program code means for enabling the computer system to
define a strongly bound set as the intersection of a strongly bound set of
the child with a set comprising all of said attributes originally in said
defined attribute set; and
computer readable program code means for enabling the computer system to
define a weakly bound set as the intersection of a weakly bound set of the
child with a set comprising all of said attributes originally in said
defined attribute set.
Description
TECHNICAL FIELD
The present invention relates generally to query optimization, and more
specifically to a system and method for generating uniqueness properties
of intermediate query subexpressions.
BACKGROUND ART
Relational DataBase Management System (RDBMS) products using a Structured
Query Language (SQL) interface are well known in the art. The SQL
interface has evolved into a standard language for RDBMS products and has
been adopted as such by both the American Nationals Standard Organization
(ANSI) and the International Standards Organization (ISO). In RDBMS
products, all data is externally structured into tables. The SQL interface
allows users to formulate relational operations on the tables either
interactively, in batch files, or embedded in host languages such as C,
COBOL, etc. Operators are provided in SQL that allow the user to
manipulate the data, wherein each operator operates on either one or two
tables and produces a new table as a result. The power of SQL lies in its
ability to link information from multiple tables or views together to
perform complex sets of procedures with a single statement.
The current state-of-the-art in query optimization provides few solutions
for optimizing query expressions involving joins, outer joins and full
outer joins. Some have suggested that queries could be optimized if
certain uniqueness information is known about relations in a given
expression. In particular, C. J. Date, and Hugh Darwen, in "Relational
Database Writings 1989-1991," pp 133-153, indicate that uniqueness
information can be used to optimize database queries.
However, there is little guidance regarding how to determine this much
needed uniqueness information. Thus, if full advantage of certain
optimization techniques is to be obtained, a technique for determining
uniqueness information is needed.
DISCLOSURE OF INVENTION
The present invention is directed toward a system and method for
determining uniqueness properties of an expression so that queries
containing join and outer join operations can be optimized. According to
the invention, two processes are implemented to determine the uniqueness
properties of a given expression.
The first process, termed an augmented unique process, is a recursive
process applied to the roots of the expression in question. The augmented
unique process is used to call a second process, termed a uniqueness
process. The uniqueness process is used to determine uniqueness properties
of the elements of the expression in question. In operation, the augmented
unique process begins at the root of the expression in question. The
augmented unique process calls an appropriate procedure depending on the
child, or children, of the root. Subsequent procedures of the augmented
unique process are called in a recursive manner depending on the progeny
of the root.
When a base relation is reached, the augmented unique process calls an
appropriate procedure of the uniqueness process to determine the
uniqueness properties of that base relation. The augmented unique process
`unwinds` by calling an appropriate procedure of the uniqueness process
for each of the progenitors of the base relation. If the root of the
expression in question is a binary operator, the processes described above
are repeated for the other progeny of the binary operator.
A feature of the invention is that the uniqueness properties determined for
the relations in the expression can be used for numerous applications. For
example, in one application, the uniqueness properties are used to
determine whether an intersect all operation can be replaced by an
intersect distinct operation, thus simplifying the expression.
The uniqueness properties provide important information pertaining to the
attributes of the expression. For each relation, base or otherwise, the
uniqueness properties determined can include: the weakly bound set, the
strongly bound set, the predictor set and the key set.
Further features and advantages of the present invention, as well as the
structure and operation of various embodiments of the present invention,
are described in detail below with reference to the accompanying drawings.
BRIEF DESCRIPTION OF DRAWINGS
The present invention is described with reference to the accompanying
drawings. In the drawings, like reference numbers indicate identical or
functionally similar elements. Additionally, the left-most digit(s) of a
reference number identifies the drawing in which the reference number
first appears.
FIG. 1 is a diagram illustrating an exemplary computer hardware embodiment
in which the present invention can be implemented;
FIG. 2 is an operational flow diagram illustrating a method of interpreting
and executing SQL statements;
FIG. 3 is an operational flow diagram illustrating a method of interpreting
and executing SQL statements embedded in source code;
FIG. 4 is an operational flow diagram illustrating an application of an
augmented unique process according to one embodiment of the invention;
FIG. 5 is an operational flow diagram illustrating steps associated with
the augmented unique process according to one embodiment of the invention;
FIG. 6 is an operational flow diagram illustrating an application of
procedure two of the uniqueness process according to one embodiment of the
invention;
FIG. 7 is an operational flow diagram illustrating an application of
procedure three of the uniqueness process according to one embodiment of
the invention;
FIG. 8 is an operational flow diagram illustrating an application of
procedure four of the uniqueness process according to one embodiment of
the invention;
FIG. 9 is an operational flow diagram illustrating an application of
procedure five of the uniqueness process according to one embodiment of
the invention; and
FIG. 10 is an operational flow diagram illustrating an application of
procedure six of the uniqueness process according to one embodiment of the
invention.
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention is directed toward a system and method for optimizing
SQL queries. According to the invention, uniqueness properties of each
subexpression of an SQL query are determined using a recursive process
called the augmented unique process. The augmented unique process
recursively determines the uniqueness properties of each relation using a
uniqueness process.
Further according to the invention, a concept of strong and weak bindings
is used to propagate uniqueness information up an expression tree for a
query. Then, for example, such information can be used to eliminate
unnecessary duplicate removal operations, convert Intersect All to
Intersect Distinct .andgate..sub.a to .andgate..sub.d which can then be
converted to joins and/or used for "what-if" .delta.-pushdown to do
subtree pruning. These transformations can provide orders of magnitude
performance improvement in query execution times.
1. Environment of the Invention
FIG. 1 illustrates an exemplary computer hardware environment that in which
the present invention can be implemented. In this exemplary environment, a
computer system 102 is comprised of one or more processors connected to
one or more electronic storage devices 104 and 106, such as disk drives,
on which one or more databases are stored.
Operators of the computer system 102 use a standard terminal interface 108,
such as IMS/DB/DC, CICS, TSO, or other interface, to perform various
search and retrieval functions, termed queries, against the databases.
These functions are typically performed using Relational DataBase
Management System (RDBMS) software that incorporates the Structured Query
Language (SQL) standard. In one embodiment of the present invention, the
RDBMS software comprises the DB2 product offered by IBM for the MVS or
OS/2 operating systems. As will become apparent to those skilled in the
relevant art, the present invention has application to any RDBMS that uses
SQL or any other data manipulation language.
The DB2 architecture includes three major components: (1) the IMS Resource
Lock Manager (IRLM) 110, the Systems Services module 112, and the Database
Services module 114. The IRLM 110 handles locking services for concurrency
control. These locking services are provided because DB2 treats data as a
shared resource, thereby allowing any number of users to access the same
data simultaneously. Therefore, concurrency control is required to isolate
users and to maintain data integrity.
The Systems Services module 112 controls the overall DB2 execution
environment, including managing log data sets 106, gathering statistics,
handling startup and shutdown, and providing management support.
At the center of the DB2 architecture is the Database Services module 114.
The Database Services module 114 contains several submodules, including
the Relational Database System (RDS) 116, the Data Manager 118, the Buffer
Manager 120 and other components 122 such as an SQL compiler/interpreter.
These modules support the functions of the SQL language, i.e., definition,
access control, retrieval, and update of user and system data.
FIG. 2 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements in an interactive
environment. Block 202 represents the input of SQL statements into the
computer from the user. Block 204 represents the step of compiling or
interpreting the SQL statements. An optimization function within block 204
may reorder the SQL query for optimum performance.
Block 206 represents the step of generating from the compiled SQL
statements a compiled set of runtime structures, termed an application
plan. Generally, the SQL statements received as input from the user
specify only the data that the user wants, but do not specify how to get
to that data. This step considers both the available access paths
(indexes, sequential reads, etc.) and system-held statistics on the data
to be accessed (the size of the table, the number of distinct values in a
particular column, etc.), to choose what it considers to be the most
efficient access path for the query. Block 208 represents the execution of
the application plan, and block 210 represents the output of the results
of the application plan to the user.
FIG. 3 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements embedded in source code
according to the present invention. Block 302 represents program source
code containing a host language (such as COBOL or C) and embedded SQL
statements. The program source code is then input to a pre-compile step
304. There are two outputs from the pre-compile step 304: a modified
source module 306 and a database request module (DBRM) 308. The modified
source module 306 contains host language calls to DB2, which the
pre-compile step 304 inserts in place of SQL statements. The DBRM 308
consists of the SQL statements from the program source code 302.
A compile and link-edit step 310 uses the modified source module 306 to
produce a load module 312, while an optimize and bind step 314 uses the
DBRM 308 to produce a compiled set of runtime structures for the
application plan 316. As indicated above in conjunction with FIG. 2, the
SQL statements from the program source code 302 specify only the data that
the user wants, but not how to get to it. The optimize and bind step 314
may reorder the SQL query so as to optimize the query. Thereafter, the
optimize and bind step 314 considers both the available access paths
(indexes, sequential reads, etc.) and system held statistics on the data
to be accessed (the size of the table, the number of distinct values in a
particular column, etc.), to choose what it considers to be the most
efficient access path for the query. The load module 312 and application
plan 316 are then executed together at step 318.
The present invention is described in terms of this example environment.
Description in these terms is provided for convenience only. It is not
intended that the invention be limited to application in this example
environment. In fact, after reading the following description, it will
become apparent to a person skilled in the relevant art how to implement
the invention in alternative environments.
2.0 Conventional Optimization Techniques
In a given query expression, information about the uniqueness (or,
"duplicate-free" status) of a sub(result) can be used to influence the
optimal plan selection for the query. For example, the paper "Exploiting
uniqueness in query optimization," by Paulley, G. N. and Larson, P-A.,
CASCON, pp. 804-822, Vol. II, October 1993, illustrates the case when an
expensive, duplicate elimination clause such as SQL's SELECT DISTINCT can
be converted to a much cheaper SELECT if it can be determined that the
result is already duplicate free.
As another example the paper "Extensible/rule based query rewrite
optimization in Starburst," by Pirahesh et al., SIGMOD, pp. 39-48, San
Diego, Calif., June 1992, illustrates how special rules are allowed to
fire in a rule-based optimizer when a "one-tuple condition" is determined.
Uniqueness propagation exploits available information about the uniqueness
of tuples in base relations (enforced, e.g., through integrity
constraints, key definitions, index uniqueness, etc.), along with
properties of algebraic expressions that preserve and/or generate unique
results, to influence the optimization of the query. For example, if it is
known that one of the operands of an intersect all operation is an
expression with unique tuples, then intersect all can be replaced by
intersect distinct. See, for example, Paulley, G. N. and Larson, P-A.,
"Exploiting uniqueness in query optimization," CASCON, pp. 804-822, Vol.
II, October 1993.
3.0 Optimization Using Uniqueness Propagation
Uniqueness propagation as described by Paulley and Larsen can be exploited
as follows for query optimization:
1. If one of the operands of the intersect all is known to have unique
tuples, a SELECT DISTINCT . . . can be applied to this sub-expression,
which may allow for the "pruning" away of some expensive outer joins from
this sub-expression. See, for example, Chen, A. L. P., "Outerjoin
optimization in multidatabase systems," 2nd International Symposium on
Databases in Parallel and Distributed Systems, 1990
2. The newly transformed intersect distinct operator could be transformed
into a join operator. See for example, Pirahesh et al., "Extensible/rule
based query rewrite optimization in Starburst," SIGMOD, pp. 39-48, San
Diego, Calif., June 1992.
3. The join operator could then be part of an extensive join, outer join,
full outer join reorderability evaluation algorithm.
This invention defines the concept of weak bindings in the context of
queries involving joins, outer joins, and intersection operations. This,
together with new uniqueness determination identities, is used to present
a technique unique for determining the unique set of a (sub)tree
corresponding to a relational (sub-)expression.
3.1 Comparison With Conventional Techniques
(Paulley, G. N. and Larson, P-A., "Exploiting uniqueness in query
optimization," CASCON, pp. 804-822, Vol. II, October 1993) provides an
algorithm to determine when the results of a join operation are duplicate
free. That work is aimed at converting SQL's SELECT DISTINCT to SELECT for
a class of queries that includes selections, projections, and joins,
together with intersection operations. That work does not consider queries
involving outer join, or full join operations, nor does it describe how
uniqueness results can propagate across join, outer join, full outer join,
and intersection operators.
4. Basic definitions
Before describing the invention in detail, some important definitions are
set forth.
Tuple
A tuple t is a mapping from a finite set of attributes, R.orgate.V, to a
set of atomic (possibly null) values where R is a non-empty set of real
attributes and V is a non-empty set of virtual attributes, R.andgate.V={
}, and the mapping t maps at least one virtual attribute v.epsilon.V to a
non-null value. For an attribute set X, t›X! represents the values
associated with attributes X under the mapping t, where X.OR
right.R.orgate.V and X.noteq.{ }.
The set of real attributes of a tuple t is the same as the schema of the
tuple in the traditional relational algebras. These attributes are
accessible to users and can be referenced externally, e.g., in user
queries, etc. On the other hand, virtual attributes are (mostly) used to
provide unique conceptual tuple-ids to tuples, and are not available for
user queries. In some cases, when real attributes are not sufficient to
identify tuples, virtual attributes have to be manipulated explicitly by
the RDBMS.
Relation
A relation r is a triple (R, V, E) where R is a non-empty set of real
attributes, V is a non-empty set of virtual attributes, and E, the
extension of relation r, is a set of tuples such that (.A-inverted.t.sub.1
.epsilon.E)(.A-inverted.t.sub.2 .epsilon.E)(t.sub.1 .noteq.t.sub.2 t.sub.1
›V!.noteq.t.sub.2 ›V!). R.orgate.V is called the schema of relation r. We
use sch(r) to denote the set of real attributes in the schema of r.
Virtual and real attributes of relations are used to provide a conceptual
framework for relations with duplicates.
Predicate
A predicate p over a set of real attributes sch(p), called the schema of p,
is a total mapping from tuples to the Boolean values {TRUE, FALSE}, where
sch(p) is the minimum set of attributes such that for all tuples t.sub.1
and t.sub.2 (t.sub.1 ›sch(p)!=t.sub.2 ›sch(p)!p(t.sub.1)=p(t.sub.2)). For
a tuple t with real schema .OR left.sch(p), p(t) is TRUE if and only if
(.A-inverted.A.epsilon.sch(p)) (substitution of t›A! for an A in p causes
it to evaluate to TRUE).
Null-intolerant
A predicate p is null-intolerant if p evaluates to FALSE for tuples
undefined on one or more attributes in sch(p). More formally, p is
null-intolerant if
(.A-inverted.t)(.E-backward..epsilon.sch(p))(t›A!=NULLp(t)=FALSE).
Throughout this document it is assumed that predicates are null-intolerant,
and predicates do not contain disjunction.
5. Algebraic operators and expressions
Following are definitions for algebraic operators and notational
conventions used in SQL queries. These definitions are helpful in
understanding later portions of this document.
5.1 Algebraic operators
In the following, let r=<R, V, E>, r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >
and r.sub.2 =<R.sub.2, V.sub.2, E.sub.2 > denote relations such that
R.sub.1 .andgate.R.sub.2 ={ } and V.sub.1 .andgate.V.sub.2 ={ }.
Projection
The projection, .pi..sup.a.sub.x (r) of relation r onto attributes X is the
relation <X, V, E'> where X.OR right.R and
E'=.{t.v.vertline.(.E-backward.t'.epsilon.E)(t=t'›X!v=t'›V!)}.
The .pi..sup.a operator is a projection operator that does not remove
"duplicates" in the real attributes part of the source expression. The
superscript a in .pi..sup.a denotes the fact that all the virtual
attributes of the source expression are included in the virtual schema of
the result expression. For ease of reading, the superscript a is omitted
from .pi..sup.a when there is no ambiguity and the operator is simply
written as .pi..
The projection.sup.c, .pi..sup.c.sub.X.sbsb.R.sub.X.sbsb.V (r), of relation
r on attributes X.sub.R X.sub.V is the relation <X.sub.R, X.sub.V, E'>,
where X.sub.R .OR right.R, X.sub.V .OR right.V and:
E'={t.v.vertline.(.E-backward.t'.epsilon.E)(t=t'›X.sub.R !v=t'›X.sub.V !)}.
In contrast to .pi., .pi..sup.c allows a choice in the selection of the
virtual attributes from the source expression. This operation is useful in
defining the outer join operator defined below.
Delta Projection
The delta-projection, .delta..sub.X.sbsb.R.sub.X.sbsb.V (r) of r on
attributes .sub.X.sbsb.R.sub.X.sbsb.V is the relation <X.sub.R X.sub.V,
V.sub.new, E'>, where X.sub.R .OR right.R, X.sub.V .OR right.V, and:
E'={t.vertline.(.E-backward.t'.epsilon.E)(t›X.sub.r X.sub.V !=t'›X.sub.R
X.sub.V !t›V.sub.new != a new, unique value)}.
The .delta. operator models the SELECT DISTINCT . . . construct of SQL
which allows elimination of "duplicates" from a relation. The .delta.
operator is called the distinct projection operator and produces a result
relation which has distinct values on the attributes X.sub.R X.sub.V and a
new virtual attribute.
Selection
The selection, .sigma..sub.p (r), of relation r on predicate p is the
relation <R, V, E'>, where sch(p).OR right.R, and:
E'={t.vertline.t.epsilon.Ep(t)}.
Cross Product and Difference
The cross product, r.sub.1 .times.r.sub.2, and difference, r.sub.1
-r.sub.2, of relations r.sub.1 and r.sub.2 is the relation <R.sub.1
R.sub.2, V.sub.1 V.sub.2, E.sub.1 .times.E.sub.2 > and <R, V, E.sub.1
-E.sub.2), respectively.
Join
The join, r.sup.1 r.sub.2, or relations r.sub.1 and r.sub.2 is the relation
<R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where:
E'={t.vertline.t.epsilon.(E.sub.1 .times.E.sub.2)p(t)}.
Equi-join predicate
The predicate p is called an equi-join predicate if it consists of
conjuncts of clauses of the form X.sub.i =Y.sub.j, where X.sub.i and
Y.sub.j are attributes of the relations referenced by the predicate.
Union Compatible
A relation r.sub.1 =<{A.sub.1, A.sub.2, . . . , A.sub.n }, V.sub.1, E.sub.1
> is said to be union compatible with relation r.sub.2 =<{B.sub.1,
B.sub.2, . . . , B.sub.n }, V.sub.2, E.sub.2 > i:
domain(A.sub.i)=domain (B.sub.i), for 1.ltoreq.i.ltoreq.n.
That is, the real attributes of r.sub.1 and r.sub.2 can be ordered such
that the domains of the first real attributes of r.sub.1 and r.sub.2 are
the same, the domains of the second real attributes of r.sub.1 and r.sub.2
are the same, and so on.
Union and Outer Union
The union, r.sub.1 .orgate.r.sub.2, of union compatible relations r.sub.1
and r.sub.2 is the relation (R, V, E.sub.1 .orgate.E.sub.2). The outer
union, r.sub.1 r.sub.2, is the relation (R.sub.1 .orgate.R.sub.2, V.sub.1
.orgate.V.sub.2, E'), where:
E'={t.vertline.(.E-backward.t'.epsilon.E.sub.1)(t›R.sub.1 V.sub.1
!=t'(.A-inverted.A.epsilon.(R.sub.2 -R.sub.1).orgate.(V.sub.2
-V.sub.1))(t›A!=NULL))
(.E-backward.t".epsilon.E.sub.2)(t›R.sub.2 V.sub.2
!=t"(.A-inverted.A.epsilon.(R.sub.1 -R.sub.2).orgate.(V.sub.1
-V.sub.2))(t›A!=NULL))}.
Note that rows in r.sub.1 r.sub.2 are padded with NULLs for those
attributes that are not present either in relation r.sub.1 or in relation
r.sub.2.
The following notational convention is useful for the following
definitions. If r=<R, V, E> is a relation,the term ext(r) is used to
represent E, the extension part of r. In order to maintain readability in
the definitions, in the following few definitions we adopt the shorthand
convention of simply writing E.sub.1 E.sub.2 instead of ext(r.sub.1
r.sub.2) or .pi..sub.X (E.sub.1 E.sub.2) instead of ext(.pi..sub.X
(r.sub.1 r.sub.2)), etc.
Left Outer Join
The left outer join, r.sub.1 r.sub.2, is the relation <R.sub.1 R.sub.2,
V.sub.1 V.sub.2, E'>, where
E'=(E.sub.1 E.sub.2)(E.sub.1 -.pi..sup.c.sub.R.sbsb.1.sub.V.sbsb.1 (E.sub.1
E.sub.2)).
Preserved Relation and Null Supplying Relation
Relation r.sub.1 in the above definition of left outer join, r.sub.1
r.sub.2, is called the preserved relation and relation r.sub.2 is called
the null supplying relation. The right outer join, r.sub.1 r.sub.2, can
similarly be defined in which r.sub.1 is the null supplying relation and
r.sub.2 is the preserved relation.
Full Outer Join
The full outer join, r.sub.1 r.sub.2, of relations r.sub.1 and r.sub.2 is
the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where
E'=(E.sub.1 E.sub.2)(E.sub.1 -.pi..sup.c.sub.R.sbsb.1.sub.V.sbsb.1 (E.sub.1
E.sub.2))(E.sub.2 -.pi..sup.c.sub.R.sbsb.2.sub.V.sub.2 (E.sub.1 E.sub.2)).
Equivalent
Values v.sub.1 and v.sub.2 are said to be equivalent, denoted v.sub.1
=v.sub.2 if either both v.sub.1,v.sub.2 are non-NULL and v.sub.1 =v.sub.2,
or if both are NULL.
Intersect Distinct
The intersect distinct, r.sub.1 .andgate.r.sub.2, of two union compatible
relations r.sub.1 and r.sub.2 is the relation <{C.sub.1,C.sub.2, . . .
,C.sub.n }, V.sub.new, E'>, where
domain(A.sub.i)=domain(B.sub.i)=domain(C.sub.i), 1.ltoreq.i.ltoreq.n, and
E'={t.vertline.(.E-backward.t.sub.1 .epsilon.E.sub.1)(.E-backward.'.sub.2
.epsilon.E.sub.2)(.A-inverted.i)(t›C.sub.i !.tbd.t.sub.1 ›A.sub.i
!t›C.sub.i !.tbd.t.sub.2 ›B.sub.i !t›V.sub.new !=a new, unique value)}
Intersect distinct retains the common tuples in relations r.sub.1 and
r.sub.2. If a tuple t.sub.1 .epsilon.E.sub.1 contains null value in
attribute A.sub.i and t.sub.2 .epsilon.E.sub.2 contains null value in
attribute B.sub.i and identical no-null values in the remaining attributes
then t.sub.1 and t.sub.2 are considered equivalent and only a single copy
is retained in the result. If either operand contains duplicates then only
one copy of the common tuple is retained in the result.
Intersect All
By contrast, the operand intersect all, .andgate..sub.a, retains "some" of
the duplicate copies of the common tuples. More precisely, in two union
compatible relations r.sub.1 and r.sub.2, if a common tuple t appears i
times in r.sub.1 and j times in r.sub.2, then t appears min{i,j} times in
the result relation r.sub.1 .andgate..sub.a r.sub.2.
5.2 Expressions and expression trees
The following provides a recursive definition of expressions.
1. If r=<R, V, E> is a relation, then r is an expression. Henceforth,
whenever there is no ambiguity, the shorthand notation e is used to
represent the expression e=<R, V, E>,
2. If e=<R, V, E> is an expression then .pi..sup.a.sub.X (e) is an
expression, where X.OR right.R.
3. If e=<R, V, E> is an expression then .delta..sub.X.sbsb.R.sub.X.sbsb.V
(e) is an expression, where X.sub.R .OR right.R and X.sub.V .OR right.V.
4. If e=<R, V, E> is an expression then .sigma..sub.p (e) is an expression,
where sch(p).OR right.R.
5. If e.sub.1 =<R.sub.1, V.sub.1, E.sub.1 > and e.sub.2 =<R.sub.2, V.sub.2,
E.sub.2 > are expressions then e.sub.1 .circle-w/dot.e.sub.2 is an
expression, where:
.circle-w/dot..epsilon.{,,,,.andgate..sub.d,.andgate..sub.a },
and p is a predicate such that sch(p).OR right.R.sub.1 R.sub.2, and sch
(p).andgate.R.sub.1 .noteq.{ }, and sch(p).andgate.R.sub.2 .noteq.{ }.
6. If e=<R, V, E> is an expression then so is (e), where (e)=<R, V, E>.
This parenthesization is required due to the fact that some of the binary
operations defined above are not fully associative. Parenthesization
determines the evaluation order so that expressions can be evaluated
un-ambiguously. However, whenever there is no ambiguity, parenthesis will
be dropped freely.
Expression Tree
An expression can also be represented by a corresponding expression tree in
which the inner nodes are the operators occurring in the expression and
the leaves are the base relations referenced in the expression.
Let denote one of the binary operators defined in the previous section,
then an expression tree T with left subtree T.sub.l, right subtree T.sub.r
and root is denoted by
(T.sub.l T.sub.r)
Henceforth, the two equivalent representations are used interchangeably.
6. Uniqueness Properties
As stated above, the invention is directed toward a system and method for
generating uniqueness information for relations in an expression. In one
application, this information is used to optimize a query, such as an SQL
query. To accomplish this objective, several innovations can be used as
described below. These innovations include:
The concept of weak binding in the context of outer joins;
Powerful, yet simple, new identities for key set determination;
A process, termed `uniqueness process` which provides a method for
determining the unique set of a subtree corresponding to a relational
sub-expression; and
A process, termed `augmented-unique process` for selecting roots of the
expression to which the uniqueness process is applied. Roots are selected
in the post order.
A unique set is the set of those attribute groups that uniquely identify
tuples in the expression. For base relations, this is the same as the key
set; for other expressions, the unique set may contain attributes which
are not part of the expression's schema. In this latter case, the
uniqueness test boils down to simply determining whether the schema of the
expression contains a member of the unique set.
This point may be best illustrated by the following example: Consider the
relation r=<R, V, E> having key K, and the expression .pi..sub.X (r). Then
the unique set for both r and .pi..sub.X (r) contains the key K. However,
tuples in .pi..sub.X (r) will be unique if X.OR left.K.
The uniqueness process, described in detail below, can be applied
recursively in order to determine the unique set of every subtree in a
given expression tree T. First, the unique sets of the base relations are
determined. After the unique sees of the base relation have been
determined, they are used to determine the unique sets of every subtree of
expression tree T.
To determine the unique sets, the concept of a keySet(r) is used. First,
KEY is used to denote a key of base relations r=<R, V, E>, where KEY.OR
right.R and KEY is minimal in the sense that no subset of the attributes
in KEY is a key for r. Then, keySet(r) is used to denote the set
containing all minimal keys of r. Therefore, keySet(r) is called the
unique set of relation r. Note, keySet(r) may be an empty set.
If U(e.sub.x) and U(e.sub.y) are two unique sets then
U(e.sub.x).multidot.U(e.sub.y) denotes the following set:
U(e.sub.x).multidot.U(e.sub.y)={K.sub.1 .orgate.K.sub.2 .vertline.K.sub.1
.epsilon.U(e.sub.x), K.sub.2 .epsilon.U(e.sub.y)}
Intuitively, U(e.sub.x).multidot.U(e.sub.y) is the set of all possible keys
that can be generated by combining the keys of e.sub.x and e.sub.y.
Strong Binding
A concept called strong binding is also important in the invention. A real
attribute A is said to be strongly bound if all tuples defined on schemas
containing A have exactly the same non-null value for attribute A.
Typically, an attribute A in the real schema of expression e gets strongly
bound when the select operation .sigma..sub.p.sbsb.x (e) assigns a literal
constant to attribute A.epsilon.sch(p.sub.x). For example
##STR1##
selects all tuples of expression r where A.sub.1 is equal to five. Because
A.sub.1 =5 for all tuples in the selected set of tuples, A.sub.1 is said
to be strongly bound.
Further, if R is a set of attributes, the set:
sBind(p.sub.x, R)={A.vertline.p.sub.x strongly binds attribute
AA.epsilon.(sch(p.sub.x).andgate.R)}.
Weak Bindings
Equally important to the invention is a concept called weak binding. A real
attribute A is said to be weakly bound if all tuples defined on schemas
containing A have either a null value or exactly the same non-null value
for attribute A.
Typically, a strongly bound attribute A in the real schema of expression
e.sub.I gets weakly bound in the result of an expression such as e.sub.0
.fwdarw.e.sub.1.
Predictor Sets
Yet another key concept of the invention is the identification of a
predictor set. If A is a (strongly or weakly) bound attribute in the real
schema of expression e, the predictor set of A, denoted pSet(A), is
defined as follows:
1. A.epsilon.pSet (A).
2. (Closure under predicate P:) If X.epsilon.pSet (A) and subclause X=Y is
part of an (inner/outer) join predicate P in e, then Y.epsilon.pSet (A).
In the uniqueness process, information about bound attributes is used to
get smaller keys from already available keys. More precisely, given that X
is a key for expression e, and A.epsilon.X is bound, the uniqueness
process attempts to determine if X-A is also a key. This is trivially true
if A is strongly bound and X-A.noteq.{ }.
If, however, A is weakly bound, the following result occurs: If X is a key
for e and A.epsilon.X is weakly bound to constant c, then X-A is a key for
e if:
.pi..sub.X-A (.sigma..sub.A=c (e)).andgate..pi..sub.X-A (e-.sigma..sub.A=c
(e))={ }, where X-A.noteq.{ }
In other words, if the same values for attributes X-A are not present in
tuples having the value c for column A and tuples having NULL for column
A, then A can be pruned from the key X.
The uniqueness process, described below, uses one manifestation of this
sufficient condition using predictor sets (pSets) to determine the unique
sets of the expression. The query expression is processed bottom-up, and
pSets are built as bound attributes and equijoin predicates that reference
these attributes are encountered.
As a later example shows, it is important to remember "levels" at which
attributes are added to pSets. From a structural point-of-view, attribute
B will be at a higher level than C in a query expression, if either B is
from the preserved side and C is from the null-supplying side of an outer
join on predicate B=C, or the query specifies application of predicate A=B
after that of predicate A=C. Semantically, if B and C are attributes in
pSet(A) and B has a higher level than C, then whenever values for C are
non-null, so are the values for B. With this in mind, the definition of
pSets is augmented to be a set of <attribute,level> pairs.
6.2 Identities for uniqueness propagation
Paulley, G. N. and Larson, P.-A., in their paper "Exploiting uniqueness in
query optimization," CASCON, pp. 804-822, Vol. II, October 1993, present
the following result for join queries: If relations r and s both have
primary keys, then key(r).multidot.key(s) is a key for rx. However, that
work does not consider key recognition across binary operators, .fwdarw.,
.rarw., and .revreaction..
The following identity applies to binary operations of , .fwdarw., .rarw.,
and .revreaction.. For .fwdarw., .rarw., and .revreaction. the identity is
novel and unique. For operations the identity is generally known, but is
presented here for the sake of completeness.
If e=L.sup.LR R is an expression having L and R as its left and right
operands, respectively, and .circle-w/dot. is a binary operator
.epsilon.{.sup.1, .fwdarw., .rarw., .revreaction.}, then:
keySet(e)=S.sub.1 .orgate.S.sub.2 .orgate.S.sub.3,
where
##EQU1##
For operator this is prior art.
To optimize the expression, the "minimal" key set is found from the set
S.sub.1 .orgate.S.sub.2 .orgate.S.sub.3, above. The uniqueness process,
described in detail below, implements this expression to optimize queries.
6.3 Uniqueness Propagation
A process for doing uniqueness propagation up the expression tree for a
given query expression is now described. However, before describing the
process in detail, an understanding of the following additional notation
is helpful.
sBindSet(T)
sBindSet(T)--denotes the set of strongly bound attributes in the
(sub)expression represented by (sub)tree T.
wBindSet(T)
wBindSet(T)--denotes the set of weakly bound attributes in the
(sub)expression represented by (sub)tree T.
joinAttr(T)
joinAttr(T)--denotes the set of attributes that are part of join predicates
in the (sub)expression represented by (sub)tree T. For example, for the
expression:
.sigma..sub.A.sbsb.1.sub.=5 (A).sup.A.sbsp.1.sup.B.sbsp.1
.sigma..sub.B.sbsb.1.sub.=5 (B)
the join attributes are A.sub.1 B.sub.1.
Shorthand Notation
If .circle-w/dot..sub.r is the root of the (sub)tree corresponding to
(sub)expression e.sub.x, the shorthand notation sch(.circle-w/dot..sub.r)
is used to represent sch (e.sub.x). For example, the root of the
expression:
e=.delta..sub.A.sbsb.1.sub.B.sbsb.1 (C.sup.C.sbsp.1.sup.A.sbsp.1
(.sigma..sub.A.sbsb.1.sub.=5 (A).sup.A.sbsp.1B.sbsp.1
.sigma..sub.B.sbsb.1.sub.=5 (B)))
is .delta..sub.A.sbsb.1.sub.B.sbsb.1 and
sch(.delta..sub.A.sbsb.1.sub.B.sbsb.1) is used to represent sch(e). For
the subexpression of e:
(C.sup.C.sbsp.1.sup.A.sbsp.1 (.sigma..sub.A.sbsb.1.sub.=5
(A).sup.A.sbsp.1.sup.B.sbsp.1 .sigma..sub.B.sbsb.1.sub.=5 (B)),
C.sub.1 A.sub.1 is the root.
Minimum Column Set
If X is a set of attribute sets, then min(X) denotes the set
{x.vertline.x.epsilon.X.A-inverted.y.epsilon.X(x.noteq.y.times.y)}. For
example, if X={ABC, ABD, AB, BC,}, where ABC, ABD, AB and BC are all keys
then:
min(X)={AB, BC}
is the minimum set of columns that imply that the others are also keys.
If <B, l>.epsilon.pSet(A) then the notation level(B) represents l, the
level of attribute B.
Recall that the attributes in the result of r.andgate.s are the renamed
versions of attributes of its union-compatible operands. If Y .OR
right.sch(r).orgate.sch(s), rename.sub..andgate. (Y) represents the
renamed version of the attributes in Y corresponding to the attribute
names in the result of .andgate..
The input to the uniqueness process is the root of an expression (sub)tree
T containing unary operators such as .delta., .pi., .delta. and binary
operators such as , .fwdarw., .rarw., .revreaction., .andgate..sub.d,
.andgate..sub.a. Any or all of these operators can be in the expression in
any sequence (provided, of course, that the expression is meaningful).
All the join predicates in the expression are equi-join predicates.
The uniqueness process is applied to the subject expression from the
bottom-up to determine the following properties, referred to as the
"uniqueness properties."
1. the unique set, unique(T), for the input (sub)expression, represented by
(sub)tree T. This is the unique set of columns that identify the output of
T.
2. the strongly bound set, sBindSet(T) for the input (sub)expression,
represented by (sub)tree T. These are the strongly bound attributes of T,
for which, by definition tuples can only have the same constant value.
3. the weakly bound set wBindSet(T) for the input (sub)expression,
represented by (sub)tree T. These are the weakly bound attributes of T,
for which, by definition tuples have the same constant value or null.
4. the predictor set is pSet(A) for each bound attribute A.
The uniqueness process is applied recursively beginning at the root of the
subject expression as described by the following augmented-unique process.
The augmented-unique process traverses the subject expression in the post
order. The augmented-unique process is set forth in detail as follows:
Input.
Root of an expression (sub)tree T containing unary operators (.delta.,
.pi., .sigma.) and binary operators {, .fwdarw., .rarw., .revreaction.,
.andgate..sub.d, .andgate..sub.a }. All the join predicates in the
expression are equi-join predicates.
Output.
1. unique(T), the unique set for the input (sub)expression, represented by
(sub)tree T.
2. sBindSet(T).
3. wBindSet(T).
4. For each bound attribute A, its predictor set pSet(A).
Procedure: Let .circle-w/dot..sub.r be the root of T.
1. If .circle-w/dot..sub.r is a leaf (i.e., a base relation s) then perform
step 1 in the uniqueness process.
2. If T=.sigma..sub.p (T') then
call augmented.sub.-- unique(T') to obtain unique(T'), sBindSet(T'),
wBindSet(T') and pSet.
Perform step 2 in unique to determine unique(T), sBindSet(T), wBindSet(T)
and pSet.
3. If T=.delta..sub.x (T') then
call augmented.sub.-- unique (T') to obtain unique(T'), sBindSet (T'),
wBindSet (T') and pSet.
Perform step 3 in uniqueness process to determine unique(T), sBindSet(T),
wBindSet(T) and pSet.
4. If T=.pi..sub.x (T') then
call augmented.sub.-- unique(T') to obtain unique(T'), sBindSet(T'),
wBindSet(T') and pSet.
Perform step 4 in uniqueness process to determine unique(T), sBindSet(T),
wBindSet(T) and pSet.
5. If T=T.sub.L .circle-w/dot.T.sub.R then
call augmented.sub.-- unique(T.sub.L) to obtain unique(T.sub.L),
sBindSet(T.sub.L), wBindSet(T.sub.L) and pSet.
call augmented.sub.-- unique(T.sub.R) to obtain unique(T.sub.R),
sBindSet(T.sub.L), wBindSet(T.sub.R) and pSet.
Perform step 5 in uniqueness process to determine unique(T), sBindSet(T),
wBindSet(T) and pSet.
6. If T=T.sub.L .andgate.T.sub.R, where .andgate..epsilon.{.andgate..sub.a,
.andgate..sub.d }, then
call augmented.sub.-- unique(T.sub.L) to obtain unique(T.sub.L),
sBindSet(T.sub.L), wBindSet(T.sub.L) and pSet.
call augmented.sub.-- unique(T.sub.L) to obtain unique(T.sub.R),
sBindSet(T.sub.R), wBindSet(T.sub.R) and pSet.
Perform step 6 in unique to determine unique(T), sBindSet(T), wBindSet(T)
and pSet.
FIG. 4 is a high-level flow diagram illustrating the augmented unique
process. In a step 404, the process begins at the root of the expression
and continues to find the innermost root contained in the expression. Once
the innermost root is found, in a step 408, the uniqueness properties of
that root are determined. In a step 412 the process `unwinds` to determine
the uniqueness properties of subsequent roots of the expression from the
inside out.
FIG. 5 is an operational flow diagram illustrating the augmented unique
process in greater detail. In a step 504, the root of the expression T is
determined. The root of the expression T is examined first as illustrated
by steps 511 through 516. The actual step followed depends on the type of
root. Once it is determined what the root of the expression is, an
appropriate call is made to determine the uniqueness properties of the
child, T', of the root of expression T as illustrated by steps 561 and 522
through 526. Where the root is a leaf as illustrated by step 511,
procedure 1 of the uniqueness process is performed to determine the
uniqueness properties of that leaf, as illustrated in a step 561. The
uniqueness process is described in detail below.
If, instead, the root of the expression is a unary operation, as
illustrated by steps 512, 513 and 514, the operation continues at one of
steps 522, 523, and 524 where the uniqueness properties of the child, T',
of that unary operation are determined. This is done in a recursive manner
by suspending the original procedure and re-entering the augmented unique
process at step 504, as illustrated by step 570. This process continues
recursively until the base relation of the sub-expression is reached
(i.e., where T' is a leaf) and in step 561, procedure 1 of the uniqueness
process is used to determine the uniqueness properties of that leaf.
The process then `unwinds` to determine the uniqueness properties of the
parent operator of that leaf (i.e., the operation returns to the suspended
call). Now, the operation continues at steps 562, 563, or 564 depending on
the type of operator in question.
If the root of the expression is a binary operation as illustrated by steps
515 and 516, the uniqueness properties of one side of the binary operation
are determined as illustrated by steps 525 and 526, respectively, and then
the uniqueness properties of the second side of the binary operation are
determined as illustrated by steps 535 and 536, respectively. As is the
case with the unary operations above, the properties of all descendants of
the binary operation must be determined first. Thus, the augmented-unique
process is applied recursively.
Once the uniqueness properties of both sides of the binary operation are
determined, the uniqueness process is applied to the root to determine its
uniqueness properties in steps 565 and 566.
To provide further clarity, this procedure is now described in terms of a
simple example. Consider the expression:
.pi..sub.C.sbsb.1 (.delta..sub.B.sbsb.1.sub.C.sbsb.1
(C.sup.C.sbsp.1.sup.A.sbsp.1 (A.sup.A.sbsp.1.sup.B.sbsp.1
(.sigma..sub.B.sbsb.1.sub.=4 (B)))).andgate..sub.a .pi..sub.D.sbsb.1
(D.sup.D.sbsp.1.sup.E.sbsp.1 E).
To optimize this expression, it is desirable to replace the intersect all
operator .andgate..sub.a with an intersect distinct operator
.andgate..sub.d if possible. To determine whether such optimization is
possible, the uniqueness process is followed to apply uniqueness process
to each root of the expression to determine the unique sets of the
expression.
As stated above, the starting point of the optimization process is at the
root of the expression. In the example expression, the root is the
intersect all operator. Therefore, the procedure applied of the
augmentation-unique process is procedure 6, illustrated by step 516 of
FIG. 5. In a step 526, augmented-unique(T.sub.L) is called to determine
the uniqueness properties of the left child. Note that either the left or
the right child could be evaluated first, but that this example chooses to
evaluate the left side first.
The left child of the intersect all root is .pi..sub.C.sbsb.1. Thus, the
call in step 526 is suspended and the operation continues at step 514,
where the uniqueness properties of the root .pi..sub.C.sbsb.1 are
determined. To determine the uniqueness properties of the root
.pi..sub.C.sbsb.1, the uniqueness properties of the child of the root
.pi..sub.C.sbsb.1 are determined in step 524. The child, T.sup.1, of
.pi..sub.C.sbsb.1 (T.sup.1) is:
.delta..sub.B.sbsb.1.sub.C.sbsb.1 (C.sup.C.sbsp.1.sup.A.sbsp.1
(A.sup.A.spsb.1.sup.B.sbsp.1 (.sigma..sub.B.sbsb.1.sub.=4 (B)))).
The root of this child is .delta..sub.B.sbsb.1.sub.C.sbsb.1. Thus, the call
in step 524 is suspended and the operation continues at step 513, where
the uniqueness properties of .delta..sub.B.sbsb.1.sub.C.sbsb.1 (T.sup.1)
are determined. To determine the uniqueness properties of
.delta..sub.B.sbsb.1.sub.C.sbsb.1 (T.sup.1), the uniqueness properties of
the children T.sup.1 must be determined as illustrated in step 523. The
child T.sup.1 of this root .delta..sub.B.sbsb.1.sub.C.sbsb.1 is:
C.sup.C.sbsp.1.sup.A.sbsp.1 (A.sup.A.sbsp.1.sup.B.sbsp.1
(.sigma..sub.B.sbsb.1.sub.=4 (B))
The root of the child C.sup.C.sbsp.1.sup.A.sbsp.1
(A.sup.A.sbsp.1.sup.B.sbsp.1 (.sigma..sub.B.sbsb.1.sub.=4 (B))) is
C.sub.1 A.sub.1.
Therefore, the call in step 523 is suspended and the operation continues at
step 515, where the uniqueness properties of binary operations are
determined. The uniqueness properties of the binary operation C.sub.1
A.sub.1 are first determined for one child and then the other child as
illustrated by steps 525 and 535.
For the left child of the root C.sub.1 A.sub.1 the operation continues at
step 511 where the uniqueness properties of the left child, C, are
determined. Because C is a leaf, procedure 1 (illustrated by column 1,
step 561) is followed. In step 561, procedure 1 of the uniqueness process
is implemented to determine the uniqueness properties of the relation C.
Procedure 1 of the uniqueness process is described in detail below.
Once this is done, the left child of the root C.sub.1 A.sub.1 is completely
processed. Hence, the operation continues back at step 535 where the
uniqueness properties of the right child of the root C.sub.1 A.sub.1 are
determined.
The right child of the root C.sub.1 A.sub.1 is
(A.sup.A.sbsp.1.sup.B.sbsp.1 (.sigma..sub.B.sbsb.1.sub.4 (B)))
The root of this child is A.sub.1 B.sub.1, for which the uniqueness
properties are determined by the augmented unique process as illustrated
by step 515.
For the left child of the root A.sub.1 B.sub.1, the call is suspended and
the operation continues at step 561 where the uniqueness properties of the
left child, A, are determined. In step 561, procedure 1 of the uniqueness
process is implemented to determine the uniqueness properties of the
relation A.
Once this is done, the operation can continue at step 535 where the
uniqueness properties of the right child of the root A.sub.1 B.sub.1 are
determined. The right child of the root A.sub.1 B.sub.1 is
(.sigma..sub.B.sbsb.1.sub.=4 (B))
The uniqueness properties of the root, (.sigma..sub.B.sbsb.1.sub.=4 (B)),
are determined as illustrated by step 512. The child of
.sigma..sub.B.sbsb.1.sub.=4 is B. Because B is a leaf, the uniqueness
properties of B are determined by 511 as described above for the leaves C
and A.
Once the uniqueness properties of B are determined, the process `unwinds`
to determine the uniqueness properties of the parents and all ancestors in
an outward fashion. The uniqueness properties of
(.sigma..sub.B.sbsb.1.sub.=4 (B)) are determined in step 562 by applying
procedure 2 of the uniqueness process. The uniqueness properties of
A.sub.1 B.sub.1 are determined in step 565 by applying procedure 5 of the
uniqueness process. The uniqueness properties of
.delta..sub.B.sbsb.1.sub.C.sbsb.1 are determined in step 563 by applying
procedure 3 of the uniqueness process. And finally, the uniqueness
properties of .pi..sub.C.sbsb.1 are determined in step 564 by applying
procedure 4 of the uniqueness process.
The procedure described above is repeated for the second half (the right
half in this case) of the expression. That is, the above procedure is
repeated for the subexpression .pi..sub.D.sbsb.1
(D.sup.D.sbsp.1.sup.E.sbsp.1 E). The root of this expression is
.pi..sub.D.sbsb.1, which leads us to procedure 4 where steps 514 and 524
dictate that the uniqueness properties of the child D.sub.1 E.sub.1 of
this root be determined. To determine the uniqueness properties of D.sub.1
E.sub.1, step 515 is used. According to steps 525 and 535, the uniqueness
properties of D and E are determined. The uniqueness properties of D and E
are determined by step 511 where procedure 1 of uniqueness process is
applied individually to both D and E in step 561. After step 561 is
completed for both D and E, the operation unwinds as described above. In
unwinding, uniqueness process is applied to D.sub.1 E.sub.1 in step 565
and then uniqueness process is applied to .pi..sub.D.sbsb.1 in step 564.
At this time, to complete the unwinding, the operation continues at column
6 where procedure 6 of uniqueness process is applied to the root of the
expression T.sub.L .andgate..sub.a T.sub.R.
Procedures of the Uniqueness Process
The uniqueness process according to one embodiment of the invention is now
described. The procedure of the uniqueness process is described as
follows, where:
.circle-w/dot..sub.r is defined as a variable, indicating a root of
expression T.
.circle-w/dot..sub.r is defined as a variable representing a child of a
root.
The uniqueness process has six procedures, one of which is followed
depending on the root in question:
Procedure 1
If .circle-w/dot..sub.r is a leaf (i.e., a base relation s) then
##EQU2##
Procedure 2
If .circle-w/dot..sub.r =.sigma..sub.P and .circle-w/dot..sub.c is its
child (a base relation s=<R, V, E>) then
______________________________________
If K .epsilon. unique(s) such that (K - sBind(P, K))
= { } then
/* . . . one-tuple condition */
unique(T) = {A .vertline. A .epsilon. R}
sBindSet(T) = {A .vertline. A .epsilon. R}
}
else
{
unique(T) = min({K - sBind(P,K) .vertline. K .epsilon.
unique(s)})
sBindSet(T) = {A .vertline. A .epsilon. sch(P) P
strongly binds A}
}
wBindSet(T) = { }
joinAttr(T) = { }
for each attribute A that is strongly bound by
P do
pSet(A) = {<A, 1>}
}
______________________________________
Procedure 3
If .circle-w/dot..sub.r =.delta..sub.X and .circle-w/dot..sub.c is its
child then
______________________________________
X' = X
if X' joinAttr(.sub.c) then
for each attribuite W .epsilon. (wBindSet(.sub.c) .andgate. X') do
if (A)(<A, 1> .epsilon. pSet(W)Alevel(A) >
level (W)) then
X' = X' - W
X' = X - sBindSet(.sub.c)
if X' = { } then /* . . . one-tuple condition */
unique(T) = {A .vertline. A .epsilon. X}
else
unique(T) = min(unique(.sub.c) .orgate. {X'})
sBindSet(T) = sBindSet(.sub.c) .andgate. X
wBindSet(T) = wBindSet(.sub.c) .andgate. X
}
______________________________________
Procedure 4
If .circle-w/dot..sub.r =.pi..sub.X and .circle-w/dot..sub.c is its child
then
##EQU3##
Procedure 5
If .sub.r is a binary operator .epsilon.{, .fwdarw., .rarw.,
.revreaction.}, and .circle-w/dot..sub.L and .circle-w/dot..sub.R are its
left and right children, respectively, then
(a) unique(T)=min(S.sub.1 .orgate.S.sub.2 .orgate.S.sub.3),
where:
______________________________________
S.sub.1 = unique(.sub.L)
.smallcircle. unique(.sub.R)
unique(.sub.L)
if .epsilon. {, .fwdarw.}, P.sub.LR is
equi-join and (sch(P.sub.LR)
S.sub.2 = .andgate. sch(.sub.R)) .epsilon. unique(.sub.R)
{ } otherwise
unique(.sub.R)
if .epsilon. {, .rarw.}, P.sub.LR is
equi-join and (sch(P.sub.LR)
S.sub.3 = .andgate. sch(.sub.L)) .epsilon. unique(.sub.L)
{ } otherwise
______________________________________
(b) if .circle-w/dot..sub.r = then
sBindSet(T)=sBindSet(.circle-w/dot..sub.L).orgate.sBindSet(.circle-w/dot..s
ub.R)
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R)
if .circle-w/dot..sub.r =.fwdarw.then
sBindSet(T)=sBindSet(.circle-w/dot..sub.L)
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R).orgate.
sBindSet(.circle-w/dot..sub.R)
if .circle-w/dot..sub.r =.rarw.then
sBindSet(T)=sBindSet(.circle-w/dot..sub.R)
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R).orgate.
sBindSet(.circle-w/dot..sub.L)
if .circle-w/dot..sub.r =.revreaction.then
sBindSet(T)={ }
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R).orgate.
sBindSet(.circle-w/dot..sub.L)
.orgate.sBindSet(.circle-w/dot..sub.R)
(c) for each attribute A in a sBindSet(T) and wBindSet(T) compute closure
of pSet(A) under predicate P.sub.LR as follows:
if P.sub.LR contains subclause A=B then
if .circle-w/dot..sub.r .epsilon.{, .revreaction.} then
add <B, level (A)> to pSet(A)
if .circle-w/dot..sub.r =.fwdarw. and A is from the preserved side of
.fwdarw. then
add <B, level(A)-1> to pSet(A)
if .circle-w/dot..sub.r =.fwdarw. and A is from the null-supplying side of
.fwdarw. then
add <B, level (A)+1> to pSet(A)
(d)
joinAttr(T)=joinAttr(.circle-w/dot..sub.L).orgate.joinAttr(.circle-w/dot..
sub.R).orgate. (all the attributes mentioned in P.sub.LR)
Procedure 6
If .circle-w/dot..sub.r .epsilon.{.andgate..sub.d, .andgate..sub.a } and
.circle-w/dot..sub.L and .circle-w/dot..sub.R are its left and right
children, respectively, then
______________________________________
sBindSet(T) = min(rename.sub..andgate. (sBindSet(.sub.L)
.orgate.
sBindSet(.sub.R)))
wBindSet(T) = { }
wBindSet(T) = { }
wBindSet(T) = { }
joinAttr(T) = { }
if , is .andgate..sub.d then
unique(T) = min(rename.sub..andgate. ({output attributes of
.andgate..sub.d }
.orgate. unique(.sub.L)
.orgate. unique(.sub.R)))
else
unique(T) = min(rename.sub..andgate. (unique(.sub.L) .orgate. unique(.sub.
)))
}
______________________________________
Application of the Uniqueness Process
As described above, in steps 561-566 of the augmented unique process, the
appropriate procedure of the uniqueness process is applied to the
appropriate root of the expression to determine the uniqueness properties
of that root. That is each procedure determines the
unique(T)
sBindSet(T)
wBindSet(T)
joinAttr(T)
for each associated root.
The application of the uniqueness process as applied in this manner is now
described in greater detail.
In step 561, the uniqueness process is applied to determine the uniqueness
properties of .circle-w/dot..sub.r, where .circle-w/dot..sub.r is a leaf,
or a base relation s. The procedure of uniqueness process applied in this
step is procedure 1. Procedure 1 of uniqueness process is now described.
In procedure 1, the unique set for the input subexpression is set equal to
the key set for that expression. The sBindSet, wBindSet, and joinAttr set
are all initialized to the empty set. For example, if the key set of the
expression B is B.sub.1,B.sub.2, the unique set is defined as
B.sub.1,B.sub.2.
In step 562, uniqueness process is applied to determine the uniqueness
properties of .circle-w/dot..sub.r, where .circle-w/dot..sub.r is a
selector operation .pi..sub.P on the predicate P. The procedure of
uniqueness process applied in this step is procedure 2. Note that because
the process is applied in a recursive manner (as described above with
reference to the augmented unique process), step 562 is not reached until
the uniqueness properties of all the children of .circle-w/dot..sub.r have
been determined.
FIG. 6 illustrates procedure 2 of uniqueness process.
In procedure 2, The unique(t) and sBindSet(t) for P are first determined.
To make such a determination, it is first determined in a step 604 whether
there exists a set of attributes, K, in the unique set of the child of T
such that all the attributes in K are strongly bound by predicate P. That
is, it is determined whether:
.E-backward.K.epsilon.unique(s) such that (K-sBind(P, K))={ }
If this is true, a one-tuple condition is present. This means there is
exactly one tuple in the relationship computed at this point, therefore,
every attribute is sufficient to identify the tuples uniquely (i.e., all
attributes are unique).
If a one-tuple condition exists, (i.e. if step 604 is true), in a step 608
all the attributes currently in the node are added to unique (T) for this
node:
unique(T)={A.vertline.A.epsilon.R}
The strong bound set is similarly computed as:
sBindSet(T)={A.vertline.A.epsilon.R}
If, on the other hand, step 604 proves false (i.e., if all the attributes
are not bound), the unique set is determined in step 612 by including the
attributes from K that were not bound. Thus, the unique set is determined
as follows:
unique(T)=min({K-sBind(P,K).vertline.K.epsilon.unique(s)})
The strongly bound set is defined to include all the attributes from the
predicate which are strongly bound at this point as follows:
sBindSet(T)={A.vertline.A.epsilon.sch(P)P strongly binds A}
In steps 620 and 622 the weakly bound set and joint attribute sets are
defined as the empty set.
wBindSet(T)={ }
joinAttr(T)={ }
Finally, in a step 626, for each attribute A that is strongly bound by P,
level information is added. Specifically, each attribute A is defined as
being level one as follows:
pSet(A)={<A, 1)}.
To summarize to this point, the uniqueness properties for the leaf node
were first determined using procedure one, in step 561. Then, once the
uniqueness properties for the leaf node are determined, the uniqueness
properties are determined for a selection node, having the leaf node as
its child.
In step 563, uniqueness process is applied to determine the uniqueness
properties of .circle-w/dot..sub.r, where .circle-w/dot..sub.r is a delta
projection operation .delta..sub.X and X is a the set of attributes for
the projection. The procedure of uniqueness process applied in this step
is procedure 3. Note that because the process is applied in a recursive
manner (as described above with reference to the augmented unique
process), step 563 is not reached until the uniqueness properties of all
the children of .circle-w/dot..sub.r have been determined.
According to step 563, procedure 3 of uniqueness process is utilized to
determine the uniqueness properties of .circle-w/dot..sub.r, where
.circle-w/dot..sub.r is a delta projection operation .delta..sub.X. FIG. 7
illustrates the application of procedure 3 to .circle-w/dot..sub.r, where
.circle-w/dot..sub.r is a delta projection operation .delta..sub.X. In
general, the purpose of procedure 3 is determine which attributes can be
discarded to obtain smaller sets. In a step 704 the operation begins by
assuming that every attribute is going to be retained. In this regard, in
step 704 the operation is initialized as:
X'=X
In a step 708, it is determined whether X' contains the joining attribute
set of the child. In other words, in step 708 it is determined whether X'
contains all the attributes that are mentioned in join/outer join
predicates below this point in the expression tree. This is determined as
follows:
if X'.OR left.joinAttr(.circle-w/dot..sub.c)
If step 708 is true, it is determined in a step 712 for an attribute W
whether that attribute W is found in the weakly bound set of the child as
well as in the attribute set X', as follows:
W.epsilon.(wBindSet(.circle-w/dot..sub.c).andgate.X')
If step 712 is true, it is determined in a step 716 whether the level of
the attribute W is lower than the level of an attribute that is in X'.
This is determined as follows:
(.E-backward.A)(<A, 1>.epsilon.A5X'level(A)>level(W))
If step 716 is true, in a step 720 the attribute W can be removed from the
set of attributes X' as follows:
X'=X'-W
Steps 712-720 are repeated for all attributes W that are in X' and
wBindSet(.circle-w/dot..sub.c) to determine which attributes W can be
removed from the set of attributes X' as illustrated by decision block 724
and flow line 762. Note that example 4.2 below illustrates the concept of
levels for bound attributes.
In a step 728, all the attributes that are strongly bound in the child are
also discarded as follows:
X'=X'-sBindSet(.circle-w/dot..sub.c)
In a decision step 732, it is determined whether X' is empty. If X'={ },
then a one-tuple condition exists and in a step 736 every attribute that
was in the original X is part of the unique set:
unique(T)={A.vertline.A.epsilon.X}
If, on the other hand, X' is not empty, in a step 740, the unique set is
defined as all the attributes in the child combined with all the
attributes of X':
unique(T)=min(unique(.circle-w/dot..sub.c).orgate.{X'})
In a step 744, the strongly bound set is defined as the intersection of
strongly bound set of the child with the set X. In other words, the
algorithm is only interested in the attributes common to both sets.
Similarly, the weakly bound set is defined as the intersection of the
weakly bound set of the child with the set X:
sBindSet(T)=sBindSet(.circle-w/dot..sub.c).andgate.X
wBindSet(T)=wBindSet(.circle-w/dot..sub.c).andgate.X
In a step 748, the join attribute set is defined as the join attribute set
of the child:
joinAttr(T)=joinAttr(.circle-w/dot..sub.c)
According to step 564, procedure 4 of uniqueness process is utilized to
determine the uniqueness properties of .circle-w/dot..sub.r, where
.circle-w/dot..sub.r is a non-duplicate removing projection operation
.pi..sub.X. FIG. 8 illustrates the operation of determining the uniqueness
properties of .circle-w/dot..sub.r, where .circle-w/dot..sub.r is a
projection operation .pi..sub.X. In a step 804, the unique set is defined
as the unique set of the child of .pi..sub.X :
unique (T)=unique(.circle-w/dot..sub.c)
In a step 808, the strongly bound set is defined as the intersection of
strongly bound set of the child with the set X. In other words, the
algorithm is only interested in the attributes common to both sets.
Similarly, in a step 812 the weakly bound set is defined as the
intersection of the weakly bound set of the child with the set X:
sBindSet(T)=sBindSet(.circle-w/dot..sub.c).andgate.X
wBindSet(T)=wBindSet(.circle-w/dot..sub.c).andgate.X
In a step 816, the join attribute set is defined as the join attribute set
of the child:
joinAttr(T)=joinAttr(.circle-w/dot..sub.c)
According to step 565, procedure 5 of uniqueness process is utilized to
determine the uniqueness properties of .circle-w/dot..sub.r, where
.circle-w/dot..sub.r is a binary operator. FIG. 9 illustrates the
operation of determining the uniqueness properties of .circle-w/dot..sub.r
according to one embodiment, where .circle-w/dot..sub.r is a binary
operator .epsilon.{, .fwdarw., .rarw., .revreaction.}, and
.circle-w/dot..sub.L and .circle-w/dot..sub.R are its left and right
children, respectively. Note that by recursively following the augmented
unique process, the properties of the left and right children have already
been determined.
In a step 904, the uniqueness properties of the operation are defined as
the minimum attributes of the union of three sets:
unique(T)=min(S.sub.1 .orgate.S.sub.2 .orgate.S.sub.3),
where min is defined in Section 4.3.
Obviously, to determine the uniqueness properties, the sets (S.sub.1,
S.sub.2, and S.sub.3 must be determined.
S1 is determined by concatenating each element in the unique set of the
left child with each element in the unique set of the right child:
S.sub.1 unique(.circle-w/dot..sub.L).multidot.unique(.circle-w/dot..sub.R)
For example, if the left set has A and B and the left set has C and D, the
concatenation is AC,AD,BC and BD.
If the operator is a join or left outer join operator, and the predicate
P.sub.LR is an equi-join (e.g., A1=B1), and if the attributes that are
common to both the schema of the predicate and the schema of the right
child are in the unique set of the right child, then S.sub.2 is the unique
set of the left child; otherwise, S.sub.2 is the empty set:
##EQU4##
Similarly, if the operator is a join or right outer join operator, and the
predicate P.sub.LR is an equi-join (e.g., A1=B1), and if the attributes
that are common to both the schema of the predicate and the schema of the
left child are in the unique set of the left child, then S.sub.3 is the
unique set of the left child; otherwise, S.sub.3 is the empty set:
##EQU5##
In steps 908A-908D and 910A-910D the strongly bound set and weakly bound
sets are determined. The actual step followed is chosen based on the
actual operator present.
If .circle-w/dot..sub.r =, in a step 908A, the strongly bound set is set
equal to the union of the strongly bound set of the left child and the
strongly bound set of the right child:
sBindSet(T)=sBindSet(.circle-w/dot..sub.L).andgate.sBindSet(.circle-w/dot..
sub.R)
If .circle-w/dot..sub.r =, in a step 910A, the weakly bound set is set
equal to the union of the weakly bound set of the left child and the
weakly bound set of the right child.
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R)
If .circle-w/dot..sub.r =.fwdarw., in a step 908B, the strongly bound set
is set equal to the strongly bound set of the left child:
sBindSet(T)=sBindSet(.circle-w/dot..sub.L)
If .circle-w/dot..sub.r =.fwdarw., in a step 910B, the weakly bound set is
set equal to the union of the weakly bound set of the left child and the
weakly bound set of the right child and the strongly bound set of the
right child.
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R).orgate.sBindSet(.circle-w/dot..sub.R)
If .circle-w/dot..sub.r =.rarw., in a step 908C, the strongly bound set is
set equal to the strongly bound set of the right child:
sBindSet(T)=sBindSet(.circle-w/dot..sub.R)
If .circle-w/dot..sub.r =.rarw., in a step 910C, the weakly bound set is
set equal to the union of the weakly bound set of the left child and the
weakly bound set of the right child and the strongly bound set of the left
child.
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R).orgate.sBindSet(.circle-w/dot..sub.L)
If .circle-w/dot..sub.r =.revreaction., in a step 908D the strongly bound
set is set to the empty set:
sBindSet(T)={ }
In a step 910D, the weakly bound set is defined as the union of the weakly
bound set of the left child, the weakly bound set of the right child, the
strongly bound set of the left child and the strongly bound set of the
right child:
wBindSet(T)=wBindSet(.circle-w/dot..sub.L).orgate.wBindSet(.circle-w/dot..s
ub.R).orgate.sBindSet(.circle-w/dot..sub.L).orgate.sBindSet(.circle-w/dot..
sub.R)
For each attribute A in sBindSet(t) and wBindSet(t) there is a predicate
P.sub.LR. If the predicate, P.sub.LR, contains subclause A=B, as
illustrated by a step 912, the operation continues at one of steps 914A,
914B or 914C. If not, the operation continues directly to step 918.
The pSet of the binary operator is determined in a step 914 by using the
level information. For each attribute A in a sBindSet(T) and wBindSet(T)
the closure of pSet(A) under predicate P.sub.LR is computed depending on
the binary operator and on the relationship between the children.
If the predicate, P.sub.LR, contains subclause A=B and if
.circle-w/dot..sub.r .epsilon.{,.revreaction.}, in a step 914A, the level
of B is the same as the level of A:
add <B, level(A)> to pSet(A)
If the predicate, P.sub.LR, contains subclause A=B and if
.circle-w/dot..sub.r =.fwdarw. and A is from the preserved side of
.fwdarw., in a step 914B the level of B is one less than the level of A:
add <B, level(A)-1> to pSet(A)
If the predicate, P.sub.LR, contains subclause A=B and if
.circle-w/dot..sub.r =.fwdarw. and A is from the null-supplying side of
.fwdarw. then in a step 914C the level of B is one higher than the level
of A:
add <B, level(A)+1> to pSet(A)
In a step 918, the join attributes are defined as the join attributes from
the left side plus the join attributes from the right side plus all the
attributes from the predicate, P.sub.LR.
joinAttr(T)=joinAttr(.circle-w/dot..sub.L).orgate.joinAttr(.circle-w/dot..s
ub.R).orgate.(all the attributes in P.sub.LR)
According to step 566, procedure 6 of uniqueness process is utilized to
determine the uniqueness properties of .circle-w/dot..sub.r, where
.circle-w/dot..sub.r is a binary intersect operator
.epsilon.{.andgate..sub.d, .andgate..sub.a }. FIG. 10 illustrates the
operation of determining the uniqueness properties of .circle-w/dot..sub.r
according to one embodiment, where .circle-w/dot..sub.r is a binary
operator .epsilon.{.andgate..sub.d, .andgate..sub.a }, and
.circle-w/dot..sub.L and .circle-w/dot..sub.R are its left and right
children, respectively. Note that by recursively following the augmented
unique process, the properties of the left and right children have already
been determined.
In a step 1004, the strongly bound set is defined as the minimum of the
renamed combination of strong bind set of the left child and the strong
bind set of the right children:
sBindSet(T)=min(rename.sub..andgate.
(sBindSet(.circle-w/dot..sub.L).orgate.sBindSet(.circle-w/dot..sub.R)))
In a step 1008, the weak bind set is the empty set:
wBindSet(T)={ }
In a step 1012, the join attribute set is the empty set:
joinAttr(T)={ }
In a step 1014 it is determined whether the intersect operation is an
intersect distinct operator (i.e., if .circle-w/dot..sub.r is
.andgate..sub.d), then in a step 1018 the unique set is defined as the
minimum of the renamed set of the combination of the output attributes of
the intersect distinct and the unique set of the left child and the unique
set of the right child:
unique(T)=min(rename.sub..andgate. ({output attributes of .andgate..sub.d
}.orgate.unique(.circle-w/dot..sub.L).orgate.unique(.circle-w/dot..sub.R))
)
If in step 1014 it is determined that the operator is not intersect
distinct, then in step 1022 the unique set is defined as the minimum of
the renamed set of the combination of the unique set of the left child and
the unique set of the right child:
unique(T)=min(rename.sub..andgate.
(unique(.circle-w/dot..sub.L).orgate.unique(.circle-w/dot..sub.R)))
In an implementation, the size of the generated sets can be reduced at each
step of the recursion by pruning the attributes by the relevant set of
attributes, which would be passed down the recursion chain. Thus,
projection operators such as .delta..sub.X and .pi..sub.X would pass down
X as the set of relevant attributes and all generated sets could be
restricted to these attributes.
Although uniqueness process does not require it, its efficacy can be
greatly improved by doing some pre-processing of the input expression tree
along the lines of some existing, well-known, techniques such as
.sigma.-pushdown which moves all the .pi. operations next to the base
relations (Galindo-Legaria, C., and Rosenthal, A., "How to extend a
conventional optimizer to handle one- and two-sided outerjoin,"
Proceedings IEEE Data Engineering Conference, pp. 402-409, 1992).
4.5 Examples on uniqueness propagation
Example of Procedure 3 of the Uniqueness Process
As stated above with respect to procedure 3, when attempting to discard an
attribute, the higher level is checked first. In this example, let A, B, C
be relations, having real attributes A.sub.1 A.sub.2, B.sub.1, C.sub.1,
respectively. Consider the expression
e=.delta..sub.A.sbsb.1.sub.B.sbsb.1 (C.sup.C.sbsp.1.sup.A.sbsp.1
(.delta..sub.A.sbsb.1.sub.=5 (A).sup.A.sbsp.1.sub.B.sbsp.1
.delta..sub.B.sbsb.1.sub.=5 (B))). Let
##STR2##
We know that A.sub.1 B.sub.1 is a key of the expression. The question is
whether we can discard B.sub.1 and say that A.sub.1 is also a key. As this
example demonstrates, B.sub.1 can be removed from the key A.sub.1 B.sub.1
of e and A.sub.1 is also a key. However, A.sub.1 cannot be removed from
A.sub.1 B.sub.1 because B.sub.1 is not a key. This illustrates the concept
of "levels" for bound attributes.
Example illustrating the importance of carrying join attributes in
procedure 3
Let A, B be relations, having real join attributes A.sub.1 A.sub.2, B.sub.1
B.sub.2 respectively. Here, X' is A.sub.1 B.sub.1. According to procedure
3, because the join attributes are not contained in X', this step cannot
be applied. Consider the expression
e=.delta..sub.A.sbsb.1.sub.B.sbsb.1 (.delta..sub.A.sbsb.1.sub.=5
(A).sup.A.sbsp.1.sup.=B.sbsp.1.sup.A.sbsp.2.sup.=B.sbsp.2
.delta..sub.B.sbsb.1.sub.=5 (B)). Let
##STR3##
As this example demonstrates, we cannot remove B.sub.1 from the key A.sub.1
B.sub.1 of e and claim that A.sub.1 is also a key, even though B.sub.1 is
at a lower "level than A.sub.1. This illustrates the requirement that all
join attributes be present in the projection list for .delta. before
printing considerations be applied.
Example illustrating weak binding and uniqueness propagation
This example illustrates the use of weak bindings and uniqueness
propagation for converting .andgate..sub.a to .andgate..sub.d and subtree
pruning. This application of uniqueness propagation utilizes the following
identities:
1. (Chen, A. L. P., "Outerjoin optimization in multidatabase systems," 2nd
International Symposium on Databases in Parallel and Distributed Systems,
1990) .delta..sub.X (r.fwdarw.s)=.delta..sub.X (r), where X.OR
right.sch(r); and
2. (Paulley, G. N. and Larson, P.-A., "Exploiting uniqueness in query
optimization," CASCON, pp. 804-822, Vol. II, October 1993) .delta..sub.X
(r).andgate..sub.a s)=.delta..sub.X (r).andgate..sub.d S, where X.OR
right.sch(r).
Consider the query expression involving relations A, B, C, D, E where
A.sub.1 .epsilon.sch(A), B.sub.1 .epsilon.sch(B), C.sub.1 .epsilon.sch(C),
and D.sub.1 .epsilon.sch(D):
.pi..sub.C.sbsb.1 (.delta..sub.B.sbsb.1.sub.C.sbsb.1
(C.sup.C.sbsp.1.sup.A.sbsp.1 (A.sup.A.sbsp.1.sup.B.sbsp.1
(.delta..sub.B.sbsb.1.sub.=4 (B)))).andgate..sub.a .pi..sub.D.sbsb.1
(D.sup.D.sbsp.1.sup.E.sbsp.1 E).
This expression is equivalent to:
##EQU6##
Because of weak binding and uniqueness propagation we can determine that
C.sub.1 is indeed a key for .pi..sub.C.sbsb.1
(.delta..sub.B.sbsb.1.sub.C.sbsb.1 (C.sup.C.sbsp.1.sup.A.sbsp.1
(A.sup.A.sbsp.1.sup.B.sbsp.1 (.delta..sub.B.sbsb.1.sub.=4 (B)))). That
allows us to realize that .pi..sub.C.sbsb.1, which converts
.andgate..sub.a to .andgate..sub.d, which in turn allows .pi..sub.D.sbsb.1
to be the same as .delta..sub.D.sbsb.1, which in turn prunes the
expression D.sup.D.sbsp.1.sup.E.sbsp.1 E down to just D.
5. Conclusion
While various embodiments of the present invention have been described
above, it should be understood that they have been presented by way of
example only, and not limitation. Thus, the breadth and scope of the
present invention should not be limited by any of the above-described
exemplary embodiments, but should be defined only in accordance with the
following claims and their equivalents.
While the invention has been particularly shown and described with
reference to (a) preferred embodiment(s) thereof, it will be understood by
those skilled in the art that (various changes) (the foregoing and other
changes) in form and details may be made therein without departing from
the spirit and scope of the invention.
Top