Evaluating Top-k Queries over
Web-Accessible Databases
Nicolas Bruno
Luis Gravano
Amélie Marian
Columbia University
2/27/2002
2
“Top-
k
” Queries Natural in
Many Scenarios
Example: NYC Restaurant Recommendation
Service.
Goal: Find best restaurants for a user:
Close to address: “2290 Broadway”
Price around $25
Good rating
Query: Specification of Flexible Preferences
Answer: Best
k
Objects for Distance Function
2/27/2002
3
Attributes often Handled by
External Sources
MapQuest
returns the
distance
between
two addresses.
NYTimes Review
gives the
price
range
of a restaurant.
Zagat
gives a
food rating
to the
restaurant.
2/27/2002
4
“Top-
k
” Query Processing
Challenges
Attributes handled by external sources
(e.g.,
MapQuest
distance).
External sources exhibit a variety of
interfaces (e.g.,
NYTimes Review
,
Zagat
).
Existing algorithms do not handle all
types of interfaces.
2/27/2002
5
Processing Top-
k
Queries over
Web-Accessible Data Sources
Data and query model
Algorithms for sources with different
interfaces
Our new algorithm:
Upper
Experimental results
2/27/2002
6
Data Model
Top-k Query: assignment of weights and
target values to attributes
< $25, “2290 Broadway”, very good >
preferred price
close to address
preferred rating
weights: <4, 1, 2>
price: most important attribute
Combined in
scoring function
2/27/2002
7
Sorted Access Source
S
Return objects sorted
by scores for a given
query.
Example:
Zagat
GetNext
S
interface
S-Source
Access Time:
tS(S)
2/27/2002
8
Random Access Source
R
Return the score of a
given object for a given
query.
Example:
MapQuest
R-Source
Access Time:
tR(R)
GetScore
R
interface
2/27/2002
9
Query Model
Attributes scores between 0 and 1.
Sequential access to sources.
Score Ties broken arbitrarily.
No
wild guesses.
One
S-Source
(or
SR-Source
) and
multiple
R-sources.
(More on this later.)
2/27/2002
10
Query Processing Goals
Processing top-
k
queries over
R-Sources
.
Returning exact answer to top-
k
query
q.
Minimizing query response time.
Naïve solution too expensive (access all
sources for all objects).
2/27/2002
11
Example: NYC Restaurants
S-Source:
Zagat
: restaurants sorted by food rating.
R-Sources:
MapQuest:
distance between two input
addresses.
User address: “2290 Broadway”
NYTimes Review:
price range of the input
restaurant.
Target Value: $25
2/27/2002
12
TA
Algorithm for
SR-Sources
Perform sorted access sequentially to all
SR-Sources
Completely probe every object found
for all
attributes using random access.
Keep best
k
objects.
Stop when scores of best
k
objects are no less than
maximum possible score of unseen objects
(threshold).
Fagin, Lotem, and Naor (PODS 2001)
Does NOT handle
R-Sources
2/27/2002
13
Our Adaptation of
TA
Algorithm
for
R-Sources:
TA-Adapt
Perform sorted access to
S-Source
S.
Probe every
R-Source
R
i
for newly found
object.
Keep best
k
objects.
Stop when scores of best
k
objects are no
less than maximum possible score of unseen
objects (threshold).
2/27/2002
14
An Example Execution of
TA-Adapt
Object
S(Zagat)
R
1
(MQ)
R
2
(NYT)
Final Score
tS(S)=tR(R
1
)=tR(R
2
)=1, w=<3, 2, 1>, k=1
Final Score = (3
.
score
Zagat
+ 2
.
score
MQ
+ 1
.
score
NYT
)/6
Threshold = 1
Total Execution Time = 9
o
1
GetNext
S
(q)
Threshold = 0.95
0.9
GetScore
R1
(q,o
1
)
Threshold = 0.95
0.1
GetScore
R2
(q,o
1
)
Threshold = 0.95
0.5
0.56
GetNext
S
(q)
Threshold = 0.9
o
2
0.8
GetScore
R1
(q,o
2
)
Threshold = 0.9
0.7
GetScore
R2
(q,o
2
)
Threshold = 0.9
0.7
0.75
GetNext
S
(q)
Threshold = 0.725
o
3
0.45
GetScore
R1
(q,o
3
)
Threshold = 0.725
0.6
GetScore
R2
(q,o
3
)
Threshold = 0.725
0.3
0.55
2/27/2002
15
Improvements over
TA-Adapt
Add a shortcut test after each random-
access probe (
TA-Opt
).
Exploit techniques for processing
selections with expensive predicates
(
TA-EP
).
Reorder accesses to
R-Sources.
Best
weight/time
ratio.
2/27/2002
16
The
Upper
Algorithm
Selects a pair (object,source) to probe next.
Based on the property:
The object with the highest upper bound will
be probed before top-k solution is reached.
Object is one of top-
k
objects
Object is not one of top-
k
objects
2/27/2002
17
Threshold = 1
An Example Execution of
Upper
Object
Upper Bound
S(Zagat)
R
1
(MQ)
R
2
(NYT)
Final Score
Total Execution Time = 6
0.95
GetNext
S
(q)
Threshold = 0.95
o
1
0.9
0.1
0.65
GetScore
R1
(q,o
1
)
Threshold = 0.95
o
2
0.8
0.9
GetNext
S
(q)
Threshold = 0.9
0.7
GetScore
R1
(q,o
2
)
Threshold = 0.9
0.8
o
3
0.45
0.725
GetNext
S
(q)
Threshold = 0.725
0.8
0.75
0.7
GetScore
R2
(q, o
2
)
Threshold = 0.725
0.75
tS(S)=tR(R
1
)=tR(R
2
)=1, w=<3, 2, 1>, k=1
Final Score = (3
.
score
Zagat
+ 2
.
score
MQ
+ 1
.
score
NYT
)/6
2/27/2002
18
The
Upper
Algorithm
Choose object with highest upper bound.
If some unseen object can have higher upper bound:
Access
S-Source
S
Else:
Access
best
R-Source
R
i
for chosen object
Keep best
k
objects
If top-
k
objects have final values higher than
maximum possible value of any other object, return
top-
k
objects.
Interleaves accesses on objects
2/27/2002
19
Selecting the Best Source
Upper
relies on expected values to make its
choices.
Upper
computes “best subset” of sources
that is expected to:
1.
Compute the final score for
k
top objects.
2.
Discard other objects as fast as possible.
Upper
chooses best source in “best subset”.
Best
weight/time
ratio.
2/27/2002
20
Experimental Setting:
Synthetic Data
Attribute scores randomly generated (three
data sets: uniform, gaussian and correlated).
tR(R
i
)
: integer between 1 and 10.
tS(S)
{0.1, 0.2,…,1.0}.
Query execution time:
t
total
Default:
k
=50, 10000 objects, uniform data.
Results: average
t
total
of 100 queries.
Optimal
assumes complete knowledge
(unrealistic, but useful performance bound)
2/27/2002
21
Experiments: Varying Number
of Objects Requested
k
2/27/2002
22
Experiments: Varying Number
of Database Objects
N
2/27/2002
23
Experimental Setting:
Real Web Data
S-Source:
Verizon Yellow Pages
(sorted by distance)
R-Sources:
Subway Navigator
Subway time
Altavista
Popularity
MapQuest
Driving time
NYTimes Review
Food and price
ratings
Zagat
Food, Service, Décor
and Price ratings
2/27/2002
24
Experiments: Real-Web Data
# of Random Accesses
2/27/2002
25
Evaluation Conclusions
TA-EP
and
TA-Opt
much faster than
TA-Adapt
.
Upper
significantly better than all
versions of
TA.
Upper
close to optimal.
Real data experiments:
Upper
faster
than
TA
adaptations.
2/27/2002
26
Conclusion
Introduced first algorithm for top-
k
processing
over
R-Sources
.
Adapted
TA
to this scenario.
Presented new algorithms:
Upper
and
Pick
(see
paper)
Evaluated our new algorithms with both real
and synthetic data.
Upper
close to optimal
2/27/2002
27
Current and Future Work
Relaxation of the Source Model
Current source model limited
Any number of
R-Sources
and
SR-Sources
Upper
has good results even with only
SR-Sources
Parallelism
Define a query model for parallel access to
sources
Adapt our algorithms to this model
Approximate Queries
2/27/2002
28
References
Top-k Queries:
Evaluating Top-k Selection Queries
, S. Chaudhuri and L.
Gravano. VLDB 1999
TA algorithm:
Optimal Aggregation Algorithms for Middleware
, R. Fagin,
A. Lotem, and M. Naor. PODS 2001
Variations of TA:
Query Processing Issues on Image (Multimedia) Databases
,
S. Nepal and V. Ramakrishna. ICDE 1999
Optimizing Multi-Feature Queries for Image Databases
, U.
Güntzer, W.-T. Balke, and W.Kie
ß
ling. VLDB 2000
Expensive Predicates
Predicate Migration: Optimizing queries with Expensive
Predicates
, J.M. Hellerstein and M. Stonebraker. SIGMOD
1993
2/27/2002
29
Real-web Experiments
2/27/2002
30
Real-web Experiments with
Adaptive Time
2/27/2002
31
Relaxing the Source Model
Upper
TA-EP
2/27/2002
32
Upcoming Journal Paper
Variations of
Upper
Select
best
source
Data Structures
Complexity Analysis
Relaxing Source Model
Adaptation of our Algorithms
New Algorithms
Variations of Data and Query Model to handle
real web data
2/27/2002
33
Optimality
TA instance optimal over:
Algorithms that do not make wild guesses.
Databases that satisfy the distinctness property.
TA
Z
instance optimal over:
Algorithms that do not make wild guesses.
No complexity analysis of our algorithms, but
experimental evaluation instead