Back to EveryPatent.com
United States Patent |
5,713,015
|
Goel
,   et al.
|
January 27, 1998
|
Reordering of complex SQL queries involving GROUPBYs, joins, outer joins
and full outer joins
Abstract
A method, apparatus, and article of manufacture for query simplification by
applying generalized inference propagation and generalized transitive
closure in SQL queries having selection, projection, join, outer join, and
intersection operations. The disclosed transformations and enumeration
method unify and solve the problems of 1) unnesting join aggregate
queries, and 2) complete enumeration of queries containing outer joins,
when the outer join predicate references an aggregated value, or the
predicate references more than two base relations in a query subtree. The
system first eliminates redundant sub-expressions and modifies expensive
binary operations to inexpensive binary operations, then converts complex
predicates to simple predicates by application of a generalized selection
(GS) operator.
Inventors:
|
Goel; Piyush (Monte Sereno, CA);
Iyer; Balakrishna Raghavendra (San Jose, CA)
|
Assignee:
|
International Business Machines Corporation (Armonk, NY)
|
Appl. No.:
|
655300 |
Filed:
|
May 30, 1996 |
Current U.S. Class: |
707/4; 707/2 |
Intern'l Class: |
G06F 017/30 |
Field of Search: |
395/601,602,603,604
|
References Cited
U.S. Patent Documents
5367675 | Nov., 1994 | Cheng et al. | 395/602.
|
5412804 | May., 1995 | Krishna | 395/602.
|
5557791 | Sep., 1996 | Cheng et al. | 395/602.
|
Other References
Gautam Bhargava et al., "Efficient Processing of outer joins and aggregate
functions," 1996 12th Int'l Conf. on Data Engineering, IEEE, pp. 441-449.
Pintsang Chang, "Nonlinear Versus Linear Recursion: A Perspective from
Computing Transitive Closures of a Binary Relation by the Join Domain
Nested Loops Approach," 1990 Compsac, pp. 382-390.
|
Primary Examiner: Kulik; Paul V.
Attorney, Agent or Firm: Merchant, Gould, Smith, Edell, Welter & Schmidt
Claims
What is claimed is:
1. A method of simplifying an SQL query in a computer having a memory, the
SQL query being performed by the computer to retrieve data from a
relational database stored in a electronic storage device coupled to the
computer, the method comprising the steps of:
(a) iteratively eliminating, in the memory of the computer, redundant
sub-expressions from the SQL query and modifying expensive binary
operations in the SQL query to inexpensive binary operations, so that only
simple and complex predicates remain in the SQL query; and
(b) iteratively converting, in the memory of the computer, the complex
predicates remaining in the SQL query to simple predicates by application
of a generalized selection (GS) operator.
2. The method of claim 1 further comprising the step of iteratively
applying, in the memory of the computer, a simplification identity to a
SELECT DISTINCT operator in the SQL query.
3. The method of claim 1 further comprising the step of iteratively
applying, in the memory of the computer, a simplification identity for a
push-up of a GROUPBY operator in the SQL query.
4. The method of claim 1 wherein the GS operator, .sigma..sub.p.sup.*
›r.sub.2,r.sub.3 ! (r.sub.1), of relation r.sub.1 with respect to
relations r.sub.2 and r.sub.3 comprises the relation <R.sub.1, V.sub.1,
E>, wherein:
E'=.sigma..sub.p (r.sub.1){.pi..sub.R.sbsb.2.sup.c.sub.V.sbsb.2
(r.sub.1)-.pi..sub.R.sbsb.2.sup.c.sub.V.sbsb.2 (.sigma..sub.p
(r.sub.1))}{.pi..sub.R.sbsb.3.sup.c.sub.V.sbsb.3
(r.sub.1)-.pi..sub.R.sbsb.3.sup.c.sub.V.sbsb.3 (.sigma..sub.p (r.sub.1))}
and wherein r.sub.1 =<R.sub.1, V.sub.1,E.sub.1 >, r.sub.2
=<R.sub.2,V.sub.2,E.sub.2 >, and r.sub.3 =<R.sub.3,V.sub.3,E.sub.3 >
comprise three relations such that R.sub.2 .OR right.R.sub.1, R.sub.3 .OR
right.R.sub.1, R.sub.2 .andgate.R.sub.3 =.phi., V.sub.2 .OR right.V.sub.1,
V.sub.3 .OR right.V.sub.1, and V.sub.2 .andgate.V.sub.3 =.phi., and p
denotes a null-intolerant predicate in R.sub.1.
5. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU22##
wherein X.sub.i =<R.sub.i,V.sub.i,E.sub.i > are relational expressions for
1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
6. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU23##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
7. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU24##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
8. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU25##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
9. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU26##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
10. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU27##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
11. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU28##
wherein X.sub.i =<R.sub.i, V.sub.i, V.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
12. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU29##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
13. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU30##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
14. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU31##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
15. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU32##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
16. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU33##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
17. The method as set forth in claim 1 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU34##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
18. An apparatus for simplifying an SQL query, comprising:
(a) a computer having a memory and an electronic storage device coupled
thereto, the data storage device storing a relational database;
(b) means, performed by the computer, for accepting the SQL query into the
memory of the computer, the SQL query being executed by the computer to
retrieve data from a relational database stored in the computer;
(c) means, performed by the computer, for iteratively eliminating redundant
sub-expressions in the SQL query and modifying expensive binary operations
in the SQL query to inexpensive binary operations, so that only simple and
complex predicates remain in the SQL query; and
(d) means, performed by the computer, for iteratively converting the
complex predicates in the SQL query to simple predicates by application of
a generalized selection (GS) operator.
19. A program storage device readable by a computer, tangibly embodying a
program of instructions executable by the computer to perform method steps
for simplifying an SQL query in a computer having a memory, the SQL query
being performed by the computer to retrieve data from a relational
database stored in an electronic storage device coupled to the computer,
the method comprising the steps of:
(a) iteratively eliminating, in the memory of the computer, redundant
sub-expressions in the SQL query and modifying expensive binary operations
in the SQL query to inexpensive binary operations, so that only simple and
complex predicates remain in the SQL query; and
(b) iteratively converting, in the memory of the computer, the complex
predicates in the SQL query to simple predicates by application of a
generalized selection operator.
20. The apparatus of claim 18 further comprising means for iteratively
applying, in the memory of the computer, a simplification identity to a
SELECT DISTINCT operator in the SQL query.
21. The apparatus of claim 18 further comprising means for iteratively
applying, in the memory of the computer, a simplification identity for a
push-up of a GROUPBY operator in the SQL query.
22. The apparatus of claim 18 wherein the GS operator, .sigma..sub.p.sup.*
›r.sub.2,r.sub.3 !(r.sub.1), of relation r.sub.1 with respect to relations
r.sub.2 and r.sub.3 comprises the relation <R.sub.1, V.sub.1, E'>,
wherein:
E'=.sigma..sub.p (r.sub.1){.pi..sub.R.sbsb.2.sup.c.sub.v.sbsb.2
(r.sub.1)-.pi..sub.R.sbsb.2.sup.c.sub.v.sbsb.2 (.sigma..sub.p
(r.sub.1))}{.pi..sub.R.sbsb.3.sup.c.sub.v.sbsb.3
(r.sub.1)-.pi..sub.R.sbsb.3.sup.c.sub.v.sbsb.3 (.sigma..sub.p (r.sub.1))}
and wherein R.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >, r.sub.2 =<R.sub.2,
V.sub.2, E.sub.2 >, and r.sub.3 =<R.sub.3, V.sub.3, E.sub.3 > comprise
three relations such that R.sub.2 .OR right.R.sub.1, R.sub.3 .OR
right.R.sub.1, R.sub.2 .andgate.R.sub.3 =.phi., V.sub.2 .OR right.V.sub.1,
V.sub.3 .OR right.V.sub.1, and V.sub.2 .andgate.V.sub.3 =.phi., and p
denotes a null-intolerant predicate in R.sub.1.
23. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU35##
wherein X.sub.1 =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
24. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU36##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
25. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU37##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
26. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU38##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
27. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU39##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
28. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU40##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
29. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU41##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
30. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU42##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
31. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU43##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
32. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU44##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
33. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU45##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
34. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU46##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
35. The apparatus as set forth in claim 18 above, wherein the means for
iteratively converting comprises means for applying a simplification
expression to the SQL query comprising:
##EQU47##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
36. The method of claim 19 further comprising the step of iteratively
applying, in the memory of the computer, a simplification identity to a
SELECT DISTINCT operator in the SQL query.
37. The method of claim 19 further comprising the step of iteratively
applying, in the memory of the computer, a simplification identity for a
push-up of a GROUPBY operator in the SQL query.
38. The method of claim 19 wherein the GS operator, .sigma..sub.p.sup.*
›r.sub.2, r.sub.3 !(r.sub.1), of relation r.sub.1 with respect to
relations r.sub.2 and r.sub.3 comprises the relation <R.sub.1, V.sub.1,
E'>, wherein:
E'=.sigma..sub.p (r.sub.1){.pi..sub.R.sbsb.2.sup.c.sub.V.sbsb.2
(r.sub.1)-.pi..sub.R.sbsb.2.sup.c.sub.V.sbsb.2 (.sigma..sub.p
(r.sub.1))}{.pi..sub.R.sbsb.3.sup.c.sub.V.sbsb.3
(r.sub.1)-.pi..sub.R.sbsb.3.sup.c.sub.V.sbsb.3 (.sigma..sub.p (r.sub.1))}
and wherein r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >, r.sub.2 =<R.sub.2,
V.sub.2, E.sub.2 >, and r.sub.3 =<R.sub.3, V.sub.3, E.sub.3 > comprise
three relations such that R.sub.2 .OR right.R.sub.1, R.sub.3 .OR
right.R.sub.1, R.sub.2 .andgate.R.sub.3 =.phi., V.sub.2 .OR right.V.sub.1,
V.sub.3 .OR right.V.sub.1, and V.sub.2 .andgate.V.sub.3 =.phi., and p
denotes a null-intolerant predicate in R.sub.1.
39. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU48##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.1 and X.sub.j.
40. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU49##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
41. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU50##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
42. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU51##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
43. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU52##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
44. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL, query comprising:
##EQU53##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
45. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU54##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
46. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU55##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
47. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU56##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
48. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU57##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
49. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU58##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
50. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU59##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
51. The method as set forth in claim 19 above, wherein the iteratively
converting step comprises the step of applying a simplification expression
to the SQL query comprising:
##EQU60##
wherein X.sub.i =<R.sub.i, V.sub.i, E.sub.i > are relational expressions
for 1.ltoreq.i.ltoreq.3, and p.sub.i,j denote the predicates between
expressions X.sub.i and X.sub.j.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates in general to database management systems performed
by computers, and in particular, to a method, apparatus, and article of
manufacture for the reordering of complex structured query language (SQL)
queries involving GROUPBYs, joins, outer joins and full outer joins.
2. Description of Related Art
The current state-of-the-art in query optimization has few results for
optimizing query expressions involving GROUPBYs, joins, outer joins, and
full outer joins specified with predicates that either reference more than
two relations or reference columns generated by aggregations. Several
researchers have performed work in this area, as reflected in the
following publications, all of which are incorporated by reference herein:
Bhargava, G., Goel, P. and Iyer, B., "Reordering of complex queries
involving joins and outer joins," IBM Technical Report TR03.567, July
1994, (hereinafter referred to as "›BHAR94!");
Bhargava, G., Goel, P. and Iyer, B., "Hypergraph based reorderings of outer
join queries with complex predicates," SIGMOD, 1995, pp. 304-315,
(hereinafter referred to as "›BHAR95!");
Ganski, R., and Wong, H. K. T., "Optimization of nested SQL queries
revisited," SIGMOD, 1987, (hereinafter referred to as "›GANS87!");
Galindo-Legaria, C., and Rosenthal, A., "How to extend a conventional
optimizer to handle one- and two-sided outerjoin", proceedings of Data
Engineering, pp. 402-409, 1992, (hereinafter referred to as "›GALI92a!");
Harinarayan, V. and Gupta, A., "Optimization using tuple subsumption",
ICDT, Prague, January 1995, (hereinafter referred to as "›HARI94!");
Levy, Alon Y., Mumick, I. S. and Sagiv, Y., "Query optimization by
predicate move-around", VLDB, pp. 96-107, 1994, (hereinafter referred to
as "›LEVY94!");
Muralikrishma, M., "Improved Unnesting Algorithms for Join Aggregate SQL
Queries", Proceedings of 18th VLDB Con., pp 91-102, Vancouver, British
Columbia, Canada, 1992, (hereinafter referred to as "›MURA92!");
Pirahesh, H., Hellerstein, J. M. and Hasan, W., "Extensible/rule based
query rewrite optimization in Starburst," SIGMOD, pp. 39-48, San Diego,
Calif., June 1992, (hereinafter referred to as "›PIRA92!");
Paulley, G. N. and Larson, P-A., "Exploiting uniqueness in query
optimization," CASCON, pp. 804-822, Vol. II, October 1993, (hereinafter
referred to as "›PAUL93!");
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A. and
Price, T. G., "Access path selection in a relational database management
system," SIGMOD, pp. 23-34, 1979, (hereinafter referred to as "›SELI79!").
However, there are numerous problems in the above prior art techniques.
More specifically, ›BHAR94! and ›BHAR95! have proposed algorithms to
reorder queries containing outer join predicates that reference two or
more relations, but their work can only do partial reorderings and cannot
reorder join aggregate queries.
With regard to join aggregate queries, consider the following query:
View V.sub.1 :
SELECT R.sub.1.a AS a, R.sub.2.a AS b, c=count (R.sub.1)
FROM R.sub.1, R.sub.2
WHERE R.sub.1.b .theta..sub.1 R.sub.2.b
GROUPBY R.sub.1. c, R.sub.2.d
Query 1:
SELECT R.sub.3.a, R.sub.4.b, R.sub.2.b
FROM (SELECT * FROM V.sub.1 LEFTOUTERJOIN R.sub.3 ON R.sub.3.b
.theta..sub.2 V.sub.1.c), R.sub.4
WHERE R.sub.4.b=V.sub.1.b
where .theta..sub.1, .theta..sub.2
.epsilon.{=,.noteq.,.ltoreq.,.gtoreq.,<,>}. The outer and inner joins
specified in the above query cannot be reordered by the existing methods.
This follows from the fact that view V.sub.1 cannot be merged with the
rest of the query because it contains a column (column c) that is
generated by an aggregation operator. Hence, the join specified in view
V.sub.1 cannot be reordered with other outer and inner joins specified in
the query.
Join-aggregate queries are known to be much more complex. Consider the
following co-related join-aggregate query:
SELECT R.sub.1.a FROM R.sub.1 WHERE R.sub.1.b .theta..sub.1
(SELECT COUNT (*) FROM R.sub.2 WHERE R.sub.2.c=R.sub.1.c AND R.sub.2. d
.theta..sub.2
(SELECT COUNT (*) FROM R.sub.3 WHERE R.sub.2.e=R.sub.3.e AND
R.sub.1.f=R.sub.3.f)),
where .theta..sub.1, .theta..sub.2
.epsilon.{=,.noteq.,.ltoreq.,.gtoreq.,<,>}. Typical commercial relational
data base management systems (RDBMS) execute the above co-related query
using Tuple Iteration Semantics (TIS), wherein for every tuple in R.sub.1,
first, tuples in R.sub.2 are selected by applying predicate
R.sub.2.c=R.sub.1.c and then each selected tuple in relation R.sub.2 in
turn, along with the tuple in R.sub.1, is substituted in predicate
R.sub.2.e=R.sub.3.e and R.sub.1.f=R.sub.3.f in order to first select and
then count the selected tuples in relation R.sub.3. This process is
repeated for every tuple in R.sub.1 resulting in a very inefficient
processing strategy.
Prior work by ›GANS87! and ›MURA92!, incorporated by reference herein,
proposed a method to unnest the above query and to transform it into the
following two queries that employ outer joins and do not require TIS:
Query 2:
SELECT INTO TEMP
R.sub.1.key, R.sub.1.a, R.sub.2.key, R.sub.2.b
FROM (SELECT * FROM R.sub.1 LEFTOUTERJOIN R.sub.2 ON
R.sub.1.c=R.sub.2.c), R.sub.3 LEFTOUTERJOIN R.sub.2.e=R.sub.3.e AND
R.sub.1.f=R.sub.3.f
GROUPBY R.sub.1.key, R.sub.1.a, R.sub.2.key, R.sub.2.b
HAVING R.sub.2.d .theta..sub.2 count (R.sub.3.key)
Query 3:
SELECT R.sub.1.a
FROM TEMP
GROUPBY R.sub.1.key
HAVING R.sub.1.b .theta..sub.1 count(R.sub.2.key)
One difficult problem in query optimization has been the inability to
unnest join aggregate queries. While the transformation of such a query to
an outer join is known, existing methods cannot reorder the left outer
joins specified in Query 2 because of the complex predicate
R.sub.2.e=R.sub.3.e and R.sub.1.f=R.sub.3.f. In addition, the prior art
has not described how to perform complete enumeration of queries
containing outer joins, when the outer join predicate references an
aggregated value, or the predicate references more than two base relations
in a query subtree. The problem is important since hierarchical schemas
are often found in the schemas translated from hierarchical legacy data
bases used for RDBMS-based data warehousing applications and object
oriented data bases.
Therefore, for more efficient processing of complex SQL queries, there is a
need for reordering complex SQL queries that contain GROUPBYs, inner and
outer joins.
SUMMARY OF THE INVENTION
To overcome the limitations in the prior art described above, and to
overcome other limitations that will become apparent upon reading and
understanding the present specification, the present invention discloses a
method and apparatus for query simplification by applying generalized
inference propagation and generalized transitive closure in SQL queries
having selection, projection, join, outer join, and intersection
operations. The disclosed transformations and enumeration method unify and
solve the problems of (1) unnesting join aggregate queries, and (2)
complete enumeration of queries containing outer joins, when the outer
join predicate references an aggregated value, or the predicate references
more than two base relations in a query subtree. The system first
eliminates redundant sub-expressions and modifies expensive binary
operations to inexpensive binary operations, then converts complex
predicates to simple predicates by application of a generalized selection
(GS) operator.
BRIEF DESCRIPTION OF THE DRAWINGS
Referring now to the drawings in which like reference numbers represent
corresponding parts throughout:
FIG. 1 illustrates the computer hardware environment of the present
invention;
FIG. 2 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements in an interactive
environment according to the present invention;
FIG. 3 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements embedded in source code
according to the present invention;
FIG. 4 is a flowchart illustrating a method of optimizing SQL queries of
the present invention;
FIG. 5 is a flowchart illustrating a method of RDBMS software performing
the application of a generalized selection (GS) operator according to the
present invention; and
FIG. 6 is a flowchart illustrating a method of removing redundant
sub-expressions and simplifying complex expressions according to the
present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In the following description of the preferred embodiment, reference is made
to the accompanying drawings which form a part hereof, and in which is
shown by way of illustration a specific embodiment in which the invention
may be practiced. It is to be understood that other embodiments may be
utilized and structural and functional changes may be made without
departing from the scope of the present invention.
DEFINITIONS
Following are definitions for a number of terms used in SQL queries. These
definitions are required for an understanding of later portions of the
present specification.
Tuple
A tuple t is a mapping from a finite set of attributes, R.orgate.V, to a
set of atomic (possibly null) values, where R is a non-empty set of real
attributes and V is a non-empty set of virtual attributes,
R.andgate.V=.phi., and the mapping t maps at least one virtual attribute
v.epsilon.V to a non-null value. For an attribute set X, t›X! represents
the values associated with attributes X under the mapping t, where X.OR
right.R.orgate.V and X.noteq..phi..
The motivation behind the distinction between real and virtual attributes
is to emphasize the difference between the real attributes available for
manipulation (to the user of the RDBMS) and the virtual attributes used
(by the RDBMS) for bookkeeping only. The set of real attributes of a tuple
is the same as the schema of the tuple in the traditional relational
algebra. These attributes are accessible to users and can be referenced
externally, e.g., in user queries, etc. On the other hand, virtual
attributes are (at times) used to provide unique conceptional tuple-ids to
tuples, and are not accessible to users and cannot be referenced
externally.
Relation
A relation r is a triple <R, V, E> where R is a non-empty set of real
attributes, V is a non-empty set of virtual attributes, and E, the
extension of relation r, is a set of tuples such that:
(.A-inverted.t.sub.1 .epsilon.E) (.A-inverted.t.sub.2 .epsilon.E) (t.sub.1
.noteq.t.sub.2 t.sub.1 ›V!.noteq.t.sub.2 ›V!)
In this definition, R.orgate.V is called the schema of relation r.
Predicate
A predicate p over a set of real attributes sch(p), called the schema of p,
is a total mapping from tuples to the Boolean values {TRUE, FALSE}, where
sch(p) is the minimum set of attributes such that for all tuples t.sub.1
and t.sub.2 :
(t.sub.1 ›sch(p)!=t.sub.2 ›sch(p)!p(t.sub.1)=p(t.sub.2)).
For a tuple t with real schema R.OR left.sch(p), p(t) is TRUE if and only
if (.A-inverted.A.epsilon.sch(p)) (i.e., substitution of t›A! for A in p
causes it to evaluate to TRUE).
Null-intolerant
A predicate p is null-intolerant if p evaluates to FALSE for tuples
undefined on one or more attributes in sch(p). More formally, p is
null-intolerant if:
(.A-inverted.t)(.E-backward.A.epsilon.sch(p))(t›A!=NULLp(t)=FALSE)
Throughout the remainder of the present specification, it is assumed that
all predicates are null-intolerant.
ALGEBRAIC OPERATORS
Following are definitions for algebraic operators used in SQL queries.
These definitions are required for an understanding of later portions of
the present specification.
Relations
Let r=<R, V, E>, r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 > and r.sub.2
=<R.sub.2, V.sub.2, E.sub.2 > denote relations such that R.sub.1
.andgate.R.sub.2 =.phi. and V.sub.1 .andgate.V.sub.2 =.phi..
Projection
The projection, .pi..sup.a.sub.x (r) of relation r onto attributes X is the
relation <X,V,E'> where X.OR right.R and:
E'={t.v.vertline.(.E-backward.t'.epsilon.E)(t=t'›X!v=t'›V!)}
The .pi..sup.a operator is a projection operator that does not remove
"duplicates" in the real attributes part of the source expression. The
superscript a in .pi..sup.a denotes the fact that all the virtual
attributes of the source expression are included in the virtual schema of
the result expression. For ease of reading, the superscript a is omitted
from .pi..sup.a whenever there is no ambiguity, so it can be written
simply as .pi..
The projection, .pi..sup.c.sub.x.sbsb.r.sub.x.sbsb.v (r) of relation r on
attributes X.sub.R X.sub.V is the relation <X.sub.R,X.sub.v, E'>, where
X.sub.R .OR right.R, X.sub.v .OR right.V and:
E'={t.v.vertline.<.E-backward.t'.epsilon.E)(t=t'›X.sub.R !v=t'›X.sub.v !)}
In contrast to .pi., .pi..sup.c allows a choice in the selection of the
virtual attributes from the source expression. This operation is needed
for defining the "Modified Generalized Outer Join" operator defined
hereinafter.
Table 1 shows the extension of the relation, and also the evaluation of the
expressions .pi..sub.A.sbsb.1 (r) and .pi..sub.a.sbsb.1.sup.c.sub.v.sbsb.1
(r) .
Delta-Projection
The delta-projection, .delta..sub.x.sbsb.R.sub.x.sub.v (r), of relation r
on attributes X.sub.R X.sub.V is the relation <X.sub.R X.sub.V, V.sub.new,
E'>, where X.sub.R .OR right.R, X.sub.v .OR right.V, and:
E'={t.vertline.(.E-backward.t'.epsilon.E)(t›X.sub.R X.sub.V !=t'›X.sub.R
X.sub.V !t›V.sub.new !}
which is a new, unique value. The .delta. operator models the "SELECT
DISTINCT . . . " construct of SQL which allows elimination of "duplicates"
from a relation. The .delta. operator is called the distinct projection
operator and produces a result relation which has distinct values on the
attributes X.sub.R X.sub.V and a new virtual attribute.
Generalized Projection
A Generalized Projection (GP), .pi..sub.x,f(Y), models the "GROUPBY . . . "
construct of SQL, which accepts as its argument a relation r and produces
a new relation according to the subscripts X and f(Y).
The subscript X specifies the attributes referenced in the GROUPBY
statement. For instance, GP .pi..sub.x (r) represents SQL statement
"SELECT X FROM r GROUPBY X". Note that a GP can also be used to represent
a SQL "SELECT DISTINCT" statement. For instance, "SELECT DISTINCT X" can
be represented by GP .pi..sub.x (r), i.e., by a GP with no aggregate
function.
The subscript f(Y) specifies the aggregation (if present). For example, GP
.pi..sub.X,count(Y) (r) represents the SQL statement "SELECT X, COUNT(Y)
FROM r GROUPBY X". Some aggregate functions are duplicate-insensitive,
e.g., MAX, MIN, COUNT(DISTINCT), etc. If a GP either represents the SELECT
DISTINCT statement or contains a duplicate-insensitive function, then such
a GP is termed a duplicate-insensitive GP and is denoted by
.delta..sub.x,f(Y), where f.epsilon.{MAX, MIN, COUNT(DISTINCT),
AVG(DISTINCT), SUM(DISTINCT)}.
Modified Generalized Outer Join
A modified generalized outer join (MGOJ) is described in the co-pending and
commonly assigned patent application Ser. No. 08/326,461, entitled "METHOD
AND APPARATUS FOR REORDERING COMPLEX SQL QUERIES CONTAINING INNER AND
OUTER JOIN OPERATIONS, filed on Oct. 20, 1994, by G. Bhargava et al., and
incorporated herein by reference. The MGOJ is defined in terms of
.pi..sup.c operator. Let r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 > and r.sub.2
=<R.sub.2, V.sub.2,E.sub.2 > be two relations such that R.sub.1
.andgate.R.sub.2 =.phi. and V.sub.1 .andgate.V.sub.2 =.phi.. Further, let
X.sub.i =<R.sub.x.sbsb.i, V.sub.x.sbsb.i,E.sub.x.sbsb.i >,
1.ltoreq.i.ltoreq.n, and Y.sub.j =<R.sub.Y.sbsb.j,
V.sub.Y.sbsb.j,E.sub.Y.sbsb.j >, 1.ltoreq.j.ltoreq.m, be relations such
that R.sub.X.sbsb.i .andgate.R.sub.X.sbsb.k =.phi.=V.sub.X.sbsb.i
.andgate.V.sub.X.sbsb.k, R.sub.Y.sbsb.j .andgate.R.sub.Y.sbsb.s
=.phi.=V.sub.Y.sbsb.j .andgate.V.sub.Y.sbsb.s, R.sub.X.sbsb.i .OR
right.R.sub.1 and R.sub.Y.sbsb.j .OR right.R.sub.2, where
i.noteq.k,j.noteq.s, 1.ltoreq.i,k.ltoreq.n and 1.ltoreq.j,s.ltoreq.m.
The modified generalized outer join (MGOJ), r.sub.1 MGOJ›p, R.sub.X.sbsb.1,
. . . , R.sub.X.sbsb.n, R.sub.Y.sub.1, . . . , R.sub.Y.sbsb.m ! r.sub.2,
of relations r.sub.1 and r.sub.2 while preserving attributes
R.sub.X.sbsb.i and R.sub.Y.sbsb.i is the relation <R.sub.1 R.sub.2,
V.sub.1 V.sub.2, E'>, where 1.ltoreq.i.ltoreq.n, 1.ltoreq.j.ltoreq.m and
E' given by:
##EQU1##
Selection
The selection, .sigma..sub.p (r), of relation r on predicate p is the
relation <R,V,E'>, where sch(p).OR right.R, and:
E'={t.vertline.(t.epsilon.E)p(t)}
Cross Product and Difference
The cross product, r.sub.1 .times.r.sub.2, and difference, r.sub.1
-r.sub.2, of relations r.sub.1 and r.sub.2 are the relations <R.sub.1
R.sub.2, V.sub.1 V.sub.2, E.sub.1 .times.E.sub.2 > and <R, V, E.sub.1
-E.sub.2 >, respectively.
Union Compatible
A relation r.sub.1 =<{A.sub.1,A.sub.2, . . . , A.sub.n }, V, E.sub.1 > is
said to be union compatible with relation r.sub.2 =<{B.sub.1, B.sub.2, . .
. , B.sub.n }, V, E.sub.2 > if there exists a permutation p of the numbers
1, 2, . . . , n such that domain (A.sub.i)=domain (B.sub.p.sbsb.i), for
1.ltoreq.i.ltoreq.n. That is, the attributes of r.sub.1 and r.sub.2 can be
ordered such that the domains of the first attributes of r.sub.1 and
r.sub.2 are the same, the domains of the second attributes of r.sub.1 and
r.sub.2 are the same, and so on.
Union and Outer Union
The union, r.sub.1 .orgate.r.sub.2, of relations r.sub.1 and r.sub.2 is the
relation <R, V, E.sub.1 .orgate.E.sub.2 >. The outer union, r.sub.1
r.sub.2, is the relation (R.sub.1 .orgate.R.sub.2, V.sub.1
.orgate.V.sub.2, E'), where:
E'={t.vertline.(.E-backward.t'.epsilon.E.sub.1)(t›R.sub.1 V.sub.1
!=t'(.A-inverted.A.epsilon.(R.sub.2 -R.sub.1).orgate.(V.sub.2
-V.sub.1))(t›A!=NULL)) (.E-backward.t".epsilon.E.sub.2)(t›R.sub.2 V.sub.2
!=t"(.A-inverted.A.epsilon.(R.sub.1 -R.sub.2).orgate.(V.sub.1
-V.sub.2))(t›A!=NULL)}
Note that rows in r.sub.1 r.sub.2 are padded with nulls for those
attributes that are not present either in relation r.sub.1 or in relation
r.sub.2.
Intersect Distinct
Let r.sub.1 =<{A.sub.1, A.sub.2, . . . , A.sub.n }, V, E.sub.1 > and
r.sub.2 =<{B.sub.1, B.sub.2, . . . , B.sub.n }, V, E.sub.2 > be two union
compatible relations. Then, the intersect distinct, r.sub.1
.andgate..sub.d r.sub.2, of r.sub.1 and r.sub.2 is the relation <{C.sub.1,
C.sub.2, . . . , C.sub.n }, V, E>, where each C.sub.i is a possibly
renamed version of the union compatible attribute pair (A.sub.i, B.sub.i),
1.ltoreq.i.ltoreq.n, and
E={t.vertline.(.E-backward.t.sub.1 .epsilon.E.sub.1)(.E-backward.t.sub.2
.epsilon.E.sub.2)(.A-inverted.i)(t›C.sub.i !.tbd.t.sub.1 ›A.sub.i
!t›C.sub.i !.tbd.t.sub.2 ›B.sub.i !) t›V! is a new unique value}
The intersect distinct operation retains the common tuples in relations
r.sub.1 and r.sub.2. If a tuple t.sub.1 .epsilon.E.sub.1 contains null
value in attribute A.sub.i and t.sub.2 .epsilon.E.sub.2 contains null
value in attribute B.sub.i, 1.ltoreq.i.ltoreq.n, and identical non-null
values in the remaining attributes, then t.sub.1 and t.sub.2 are
considered equivalent and only a single copy is retained in the result. In
case there are duplicate tuples in either of the operands, only one copy
of the common tuple is retained in the result. In contrast, the intersect
all operator, denoted as .andgate..sub.a, retains "some" of the duplicate
copies of the common tuples, subject to the "minimum rule". More
precisely, in two union compatible relations r.sub.1 and r.sub.2, if a
tuple t appears i times in r.sub.1 and j times in r.sub.2, then t appears
min {i,j} times in the result relation r.sub.1 .andgate..sub.a r.sub.2.
Additional Algebraic Operators
In the following definitions, it is assumed that if predicate p is
associated with join/outer/full outer join of relations r.sub.1 and
r.sub.2 then p=p.sub.1 p.sub.2 . . . p.sub.m, where p.sub.i is a
predicate such that sch(p.sub.i).andgate.R.sub.1 .noteq..phi. and
sch(p.sub.i).andgate.R.sub.2 .noteq..phi., 1.ltoreq.i.ltoreq.m.
Join
The join,
##EQU2##
of relations r.sub.1 and r.sub.2 is the relation <R.sub.1 R.sub.2, V.sub.1
V.sub.2, E'>, where:
E'={t.vertline.t.epsilon.(E.sub.1 .times.E.sub.2)p(t)}
Anti-Join
The anti-join,
##EQU3##
of relations r.sub.1 and r.sub.2 is the relation <R.sub.1, V.sub.1, E'>,
where:
##EQU4##
Left and Right Outer Joins
The left outer join,
##EQU5##
is the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where:
##EQU6##
Relation r.sub.1 in the above definition is called the preserved relation
and relation r.sub.2 is called the null supplying relation. The right
outer join,
##EQU7##
can similarly be defined in which r.sub.1 is the null supplying relation
and r.sub.2 is the preserved relation.
Full Outer join
The full outer join,
##EQU8##
of relations r.sub.1 and r.sub.2 is the relation <R.sub.1 R.sub.2, V.sub.1
V.sub.2, E'>, where:
##EQU9##
OVERVIEW
The following description of the present invention presents a novel
operator, Generalized Selection (GS), which is used for complete
reordering of queries containing complex predicates and predicates that
are specified on columns generated by aggregations. The GS operator
enables the generation of different schedules for queries similar to Query
2 and Query 3, given above in the background section. For example, Query 2
allows the generation of schedules in which relations R.sub.4 and R.sub.1
are joined first. Consequently, if this join is very selective then the
join may substantially reduce the cost of the query. Further, the GS
operator facilitates complete enumeration of those queries that contain
only conjunctive predicates with binary operations. It is assumed that
predicates specified with binary operations are conjunctive, that is, in
expression:
##EQU10##
where:
.circle-w/dot..epsilon.{, .rarw., .fwdarw., .revreaction.}
and predicate p is of the form:
p=p.sub.1 p.sub.2 . . . p.sub.n
In addition, association identities are provided that enable the "break-up"
of complex predicates. Methods that generate an optimal plan for a given
query are described with the aid of examples which explain the enumeration
process.
HARDWARE ENVIRONMENT
FIG. 1 illustrates an exemplary computer hardware environment that could be
used with the present invention. In the exemplary environment, a computer
system 102 is comprised of one or more processors connected to one or more
electronic storage devices 104 and 106, such as disk drives, that store
one or more relational databases.
Operators of the computer system 102 use a standard operator interface 108,
such as IMS/DB/DC, CICS, TSO, OS/2 or other similar interface, to transmit
electrical signals to and from the computer system 102 that represent
commands for performing various search and retrieval functions, termed
queries, against the databases. In the present invention, these queries
conform to the Structured Query Language (SQL) standard, and invoke
functions performed by Relational DataBase Management System (RDBMS)
software. In the preferred embodiment of the present invention, the RDBMS
software comprises the DB2 product offered by IBM for the MVS or OS/2
operating systems. Those skilled in the art will recognize, however, that
the present invention has application to any RDBMS software that uses SQL.
As illustrated in FIG. 1, the DB2 architecture for the MVS operating system
includes three major components: the IMS Resource Lock Manager (IRLM) 110,
the Systems Services module 112, and the Database Services module 114. The
IRLM 110 handles locking services, because DB2 treats data as a shared
resource, thereby allowing any number of users to access the same data
simultaneously, and thus concurrency control is required to isolate users
and to maintain data integrity. The Systems Services module 112 controls
the overall DB2 execution environment, including managing log data sets
106, gathering statistics, handling startup and shutdown, and providing
management support.
At the center of the DB2 architecture is the Database Services module 114.
The Database Services module 114 contains several submodules, including
the Relational Database System (RDS) 116, the Data Manager 118, the Buffer
Manager 120 and other components 122 such as an SQL compiler/interpreter.
These submodules support the functions of the SQL language, i.e.,
definition, access control, retrieval, and update of user and system data.
The present invention is generally implemented as one or more functions
within the RDBMS software. Generally, the RDBMS software is tangibly
embodied in a computer-readable medium, e.g. one or more of the data
storage devices 104 and 106. Moreover, the RDBMS software is comprised of
instructions which, when read and executed by the computer system 102,
causes the computer system 102 to perform the steps necessary to implement
and/or use the present invention. Under control of an operating system,
the RDBMS software may be loaded from the data storage devices 104 and 106
into a memory of the computer system 102 for use during actual operations.
INTERACTIVE SQL EXECUTION
FIG. 2 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements in an interactive
environment according to the present invention. Block 202 represents the
input of SQL statements into the computer system 102 from the user. Block
204 represents the step of compiling or interpreting the SQL statements.
An optimization function within block 204 may transform the SQL query in a
manner described in more detail later in this specification. Block 206
represents the step of generating a compiled set of runtime structures
called an application plan from the compiled SQL statements. Generally,
the SQL statements received as input from the user specify only the data
that the user wants, but not how to get to it. This step considers both
the available access paths (indexes, sequential reads, etc.) and system
held statistics on the data to be accessed (the size of the table, the
number of distinct values in a particular column, etc.), to choose what it
considers to be the most efficient access path for the query. Block 208
represents the execution of the application plan, and block 210 represents
the output of the results of the application plan to the user.
EMBEDDED/BATCH SQL EXECUTION
FIG. 3 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements embedded in source code
according to the present invention. Block 302 represents program source
code containing a host language (such as COBOL or C) and embedded SQL
statements. The program source code is then input to a pre-compile step
304. There are two outputs from the pre-compile step 304: a modified
source module 306 and a Database Request Module (DBRM) 308. The modified
source module 306 contains host language calls to DB2, which the
pre-compile step 304 inserts in place of SQL statements. The DBRM 308
consists of the SQL statements from the program source code 302. A compile
and link-edit step 310 uses the modified source module 306 to produce a
load module 312, while an optimize and bind step 314 uses the DBRM 308 to
produce a compiled set of runtime structures for the application plan 316.
As indicated above in conjunction with FIG. 2, the SQL statements from the
program source code 302 specify only the data that the user wants, but not
how to get to it. The optimize and bind step 314 may reorder the SQL query
in a manner described in more detail later in this specification.
Thereafter, the optimize and bind step 314 considers both the available
access paths (indexes, sequential reads, etc.) and system held statistics
on the data to be accessed (the size of the table, the number of distinct
values in a particular column, etc.), to choose what it considers to be
the most efficient access path for the query. The load module 312 and
application plan 316 are then executed together at step 318.
SQL QUERY OPTIMIZATION
FIG. 4 is a flowchart illustrating the method of optimizing SQL queries in
steps 204 of FIG. 2 and 314 of FIG. 3 according to the present invention.
Block 402 represents the RDBMS software accepting the SQL query. Block 404
represents the RDBMS software translating the query into one or more query
blocks, which are representations of the outer query and any subqueries or
inner queries within the outer query, wherein the query blocks are an
internal representation of the user's query maintained in the memory of
the computer system 102 by the RDBMS software. Block 406 represents the
RDBMS software generating join plans from the query blocks. Block 408
represents the RDBMS software applying the GS operator, as described in
more detail in conjunction with FIG. 5. Finally, block 410 represents the
RDBMS software executing the query blocks against the relational database
and the output of the result table to the operator.
GENERALIZED SELECTION (GS) OPERATOR
Consider the following query:
##EQU11##
If only joins, outer and full outer joins are employed, then this query can
only be executed in the manner in which it is specified. In order to
re-order this query and other similar queries, a new Generalized Selection
(CS) .sigma..sup.*. operator is defined that enables the generation of
other plans such as:
##EQU12##
In the following definition, let r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >,
r.sub.2 =<R.sub.2, V.sub.2,E.sub.2 > and r.sub.3 =<R.sub.3, V.sub.3,
E.sub.3 > be three relations such that R.sub.2 .OR right.R.sub.1, R.sub.3
.OR right.R.sub.1, R.sub.2 .andgate.R.sub.3 =.phi., V.sub.2 .OR
right.V.sub.1, V.sub.3 .OR right.V.sub.1, and V.sub.2 .andgate.V.sub.3
=.phi.; also, let p denote a null-intolerant predicate in R.sub.1.
The Generalized Selection (GS), .andgate..sub.p.sup.* ›r.sub.2, r.sub.3 !
(r.sub.1) , of relation r.sub.1 with respect to relations r.sub.2 and
r.sub.3 is the relation <R.sub.1, V.sub.1, E'>, where E' is given by:
E'=.sigma..sub.p (r.sub.1){.pi..sub.R.sbsb.2.sup.c.sub.V.sbsb.2
(r.sub.1)-.pi..sub.R.sbsb.2.sup.c.sub.V.sbsb.2 (.pi..sub.p
(r.sub.1))}{.pi..sub.R.sbsb.3.sup.c.sub.V.sbsb.3
(r.sub.1)-.pi..sub.R.sbsb.3.sup.c.sub.V.sbsb.3 (.sigma..sub.p (r.sub.1))}
It will be noted that, if either r.sub.2 or r.sub.3 is empty, then the
corresponding term in E' is also empty; if both the relations are empty
then both the terms corresponding to these relations are also empty, and
the resultant relation is obtained by applying predicate p to relation
r.sub.1.
SIMPLIFICATION IDENTITIES FOR GENERALIZED SELECTION
Let X.sub.i =<R.sub.i, V.sub.i,E.sub.i >, where 1.ltoreq.i.ltoreq.3, be
expressions, p.sub.i,j denote the predicate(s) between expressions X.sub.i
and X.sub.j, then:
##EQU13##
It will be noted that in the above-described identities, a complex
predicate is specified with the last operator. If a complex predicate is
specified with the next to the last operator, then:
##EQU14##
In general, if more than one complex predicate is present in a given
expression, then it can be shown that the given expression can be
transformed into a resultant expression in which all join predicates are
simple. That is, complex predicates are applied such that some of the
predicates are applied with the join operators and remaining predicates
are applied with the GS operators that appear at the top in the resultant
expression.
For example, consider expression:
##EQU15##
By associative properties of outer joins, expression Q.sub.1 is equivalent
to the following expression:
##EQU16##
By applying Simplification Identity (4), the above expression can be
transformed to the following expression:
##EQU17##
At this point, by applying Simplification Identity (10), the above
expression can be further transformed to the following expression:
##EQU18##
It will be noted that the last expression does not contain any complex
predicates.
METHOD FOR ENUMERATING PLANS
FIG. 5 is a flowchart illustrating the method used by the RDBMS software to
apply the GS operator in step 408 of FIG. 4 according to the present
invention. FIG. 6 is a flowchart illustrating a method used by the RDBMS
software to remove redundant sub-expressions and simplify complex
expressions according to the present invention. The method to enumerate
plans for queries consists of the following sub-methods:
(a) Simplification. In this sub-method, attributes are identified by the
RDBMS software at 502 and 602 that are preserved by the initial query
specification, but which are not needed to be preserved by analysis of the
query structure. The beneficial effect is to convert some full outer joins
to outer joins, and some outer joins to regular joins. This step reduces
the cost by pruning away redundant sub-expressions and by converting
expensive binary operations to in-expensive binary operations. The actual
mechanism of this step is detailed in ›BHAR95!, which is incorporated by
reference herein. Of course, those skilled in the art will recognize that
the simplification step covers more queries than disclosed herein.
(b) "Dam" (due to aggregation) destruction. At 504 and 604, aggregation or
GROUPBY operators which have been dealt with by commercial RDBMS query
optimizers as firewalls across which reorderings are not considered, and
the results provided in ›SEL179!, incorporated by reference herein, are
individually applied to each query block by the RDBMS software. Results
described in ›HARI94, LEVY94, PIRA92, PAUL93!, incorporated by reference
herein, may be used to systematically break-down these firewalls generated
by query sub-blocks. The primary technique involves the application of
simplification identities (1) through (14) stated for the SELECT DISTINCT
operator, and from an observation made for the "push-up" of GROUPBY
operator.
An additional technique is given below which is required to handle
predicates that reference columns generated by aggregations. This
technique addresses outerjoin predicates on the aggregated GROUPBY
attributes. The technique is explained by means of an example. Consider
the following view X.sub.2 :
CREATE VIEW X.sub.2 (X.sub.2,1, X.sub.2,2) AS (SELECT Y.sub.1,AVG(Y.sub.2)
FROM Y GROUPBY Y.sub.1)
Consider the following query Q.sub.2 that references view X.sub.2 :
##EQU19##
where:
sch(p.sub.1,2.sup.2).andgate.X.sub.2 =X.sub.2,2
and
X.sub.2,2 .epsilon slash.sch(p.sup.1.sbsp.1,2).andgate.X.sub.2
It will be noted, as shown in the right hand side, the aggregation has been
moved up in the transformed query, permitting a complete reordering of
expression:
##EQU20##
In general, it is feasible to move aggregations past all the binary
operations specified in a given query (including outerjoins), thus,
enabling the complete enumeration of different binary operations specified
in the query.
(c) Conversion of complex predicates to simple predicates. In FIGS. 5 and
6, at 506 and 606, a query containing complex predicates is converted by
the RDBMS software into a query containing only simple predicates, but
induced at the root are the instances of generalized selections. This step
is a pre-cursor to the complete reordering of relations. One such
conversion for query Q.sub.1 is given in simplification identity (14)
above. Query Q.sub.1 has two complex predicates and, therefore, four ways
to convert the query to transformed queries containing only simple
predicates. Other ways to break up the complex predicates are given in the
following:
##EQU21##
It will be noted that the above identities do not have any complex
outerjoins predicate. Thus, by combining the methods of the present
invention and of ›GALI92a!, incorporated by reference herein, all possible
plans for the binary operations specified in a query would be enumerated.
Using cost estimation, the one with the lowest cost becomes a candidate
for execution.
CONCLUSION
This concludes the description of the preferred embodiment of the
invention. The following describes some alternative embodiments for
accomplishing the present invention. For example, any type of computer,
such as a mainframe, minicomputer, or personal computer, could be used
with the present invention. In addition, any software program adhering
(either partially or entirely) to the SQL language could benefit from the
present invention.
In summary, the present invention discloses a method and apparatus for
query simplification using generalized inference propagation and
generalized transitive closure in SQL queries having selection,
projection, join, outer join, and intersection operations. The disclosed
transformations and enumeration method unify and solve the problems of (1)
unnesting join aggregate queries, and (2) complete enumeration of queries
containing outer joins, when the outer join predicate references an
aggregated value, or the predicate references more than two base relations
in a query subtree.
The foregoing description of the preferred embodiment of the invention has
been presented for the purposes of illustration and description. It is not
intended to be exhaustive or to limit the invention to the precise form
disclosed. Many modifications and variations are possible in light of the
above teaching. It is intended that the scope of the invention be limited
not by this detailed description, but rather by the claims appended
hereto.
TABLE 1
______________________________________
Difference between .pi. and .pi..sup.c.
______________________________________
##STR1##
______________________________________
Top