C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
Evaluating Top-k Queries overWeb-Accessible Databases
Nicolas Bruno
Luis Gravano
Amélie Marian
Columbia University
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
2
“Top-k” Queries Natural inMany Scenarios
Example: NYC Restaurant RecommendationService.
Goal: Find best restaurants for a user:
Close to address: “2290 Broadway”
Price around $25
Good rating
Query: Specification of Flexible Preferences
Answer: Best k Objects for Distance Function
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
3
Attributes often Handled byExternal Sources
MapQuest returns the distance betweentwo addresses.
NYTimes Review gives the price rangeof a restaurant.
Zagat gives a food rating to therestaurant.
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
4
“Top-k” Query ProcessingChallenges
Attributes handled by external sources(e.g., MapQuest distance).
External sources exhibit a variety ofinterfaces (e.g., NYTimes Review,Zagat).
Existing algorithms do not handle alltypes of interfaces.
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
5
Processing Top-k Queries overWeb-Accessible Data Sources
Data and query model
Algorithms for sources with differentinterfaces
Our new algorithm: Upper
Experimental results
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
6
Data Model
Top-k Query: assignment of weights andtarget values to attributes
< $25, “2290 Broadway”, very good >
preferred price
close to address
preferred rating
weights: <4, 1, 2>
price: most important attribute
Combined inscoring function
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
7
Sorted Access Source S
Rectangle: Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Return objects sortedby scores for a givenquery.
Example: Zagat
GetNextinterface
S-Source
Access Time: tS(S)
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
8
Random Access Source R
Return the score of agiven object for a givenquery.
Example: MapQuest
R-Source
Access Time: tR(R)
GetScoreinterface
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
9
Query Model
Attributes scores between 0 and 1.
Sequential access to sources.
Score Ties broken arbitrarily.
No wild guesses.
One S-Source (or SR-Source) andmultiple R-sources. (More on this later.)
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
10
Query Processing Goals
Processing top-k queries over R-Sources.
Returning exact answer to top-k query q.
Minimizing query response time.
Naïve solution too expensive (access allsources for all objects).
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
11
Example: NYC Restaurants
S-Source:
Zagat: restaurants sorted by food rating.
R-Sources:
MapQuest: distance between two inputaddresses.
User address: “2290 Broadway”
NYTimes Review: price range of the inputrestaurant.
Target Value: $25
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
12
TA Algorithm for SR-Sources
Perform sorted access sequentially to all SR-Sources
Completely probe every object found for allattributes using random access.
Keep best k objects.
Stop when scores of best k objects are no less thanmaximum possible score of unseen objects(threshold).
Fagin, Lotem, and Naor (PODS 2001)
Does NOT handle R-Sources
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
13
Our Adaptation of TA Algorithmfor R-Sources: TA-Adapt
Perform sorted access to S-Source S.
Probe every R-Source Ri for newly foundobject.
Keep best k objects.
Stop when scores of best k objects are noless than maximum possible score of unseenobjects (threshold).
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
14
An Example Execution ofTA-Adapt
Object
S(Zagat)
R1(MQ)
R2(NYT)
Final Score
tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
Threshold = 1
Total Execution Time = 9
o1
GetNextS(q)
Threshold = 0.95
0.9
GetScoreR1(q,o1)
Threshold = 0.95
0.1
GetScoreR2(q,o1)
Threshold = 0.95
0.5
0.56
GetNextS(q)
Threshold = 0.9
o2
0.8
GetScoreR1(q,o2)
Threshold = 0.9
0.7
GetScoreR2(q,o2)
Threshold = 0.9
0.7
0.75
GetNextS(q)
Threshold = 0.725
o3
0.45
GetScoreR1(q,o3)
Threshold = 0.725
0.6
GetScoreR2(q,o3)
Threshold = 0.725
0.3
0.55
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
15
Improvements over TA-Adapt
Add a shortcut test after each random-access probe (TA-Opt).
Exploit techniques for processingselections with expensive predicates(TA-EP).
Reorder accesses to R-Sources.
Best weight/time ratio.
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
16
The Upper Algorithm
Selects a pair (object,source) to probe next.
Based on the property:
The object with the highest upper bound willbe probed before top-k solution is reached.
Object is one of top-k objects
Object is not one of top-k objects
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
17
Threshold = 1
An Example Execution of Upper
Object
Upper Bound
S(Zagat)
R1(MQ)
R2(NYT)
Final Score
Total Execution Time = 6
0.95
GetNextS(q)
Threshold = 0.95
o1
0.9
0.1
0.65
GetScoreR1(q,o1)
Threshold = 0.95
o2
0.8
0.9
GetNextS(q)
Threshold = 0.9
0.7
GetScoreR1(q,o2)
Threshold = 0.9
0.8
o3
0.45
0.725
GetNextS(q)
Threshold = 0.725
0.8
0.75
0.7
GetScoreR2(q, o2)
Threshold = 0.725
0.75
tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1
Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
18
The Upper Algorithm
Choose object with highest upper bound.
If some unseen object can have higher upper bound:
Access S-Source S
Else:
Access best R-Source Ri for chosen object
Keep best k objects
If top-k objects have final values higher thanmaximum possible value of any other object, returntop-k objects.
Interleaves accesses on objects
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
19
Selecting the Best Source
Upper relies on expected values to make itschoices.
Upper computes “best subset” of sourcesthat is expected to:
1.Compute the final score for k top objects.
2.Discard other objects as fast as possible.
Upper chooses best source in “best subset”.
Best weight/time ratio.
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
20
Experimental Setting:Synthetic Data
Attribute scores randomly generated (threedata sets: uniform, gaussian and correlated).
tR(Ri): integer between 1 and 10.
tS(S)  {0.1, 0.2,…,1.0}.
Query execution time: ttotal
Default: k=50, 10000 objects, uniform data.
Results: average ttotal of 100 queries.
Optimal assumes complete knowledge
(unrealistic, but useful performance bound)
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
21
Experiments: Varying Numberof Objects Requested k
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
22
Experiments: Varying Numberof Database Objects N
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
23
Experimental Setting:Real Web Data
S-Source: Verizon Yellow Pages
(sorted by distance)
R-Sources:
Subway Navigator
Subway time
Altavista
Popularity
MapQuest
Driving time
NYTimes Review
Food and priceratings
Zagat
Food, Service, Décorand Price ratings
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
24
Experiments: Real-Web Data
# of Random Accesses
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
25
Evaluation Conclusions
TA-EP and TA-Opt much faster thanTA-Adapt.
Upper significantly better than allversions of TA.
Upper close to optimal.
Real data experiments: Upper fasterthan TA adaptations.
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
26
Conclusion
Introduced first algorithm for top-k processingover R-Sources.
Adapted TA to this scenario.
Presented new algorithms: Upper and Pick (seepaper)
Evaluated our new algorithms with both realand synthetic data.
Upper close to optimal
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
27
Current and Future Work
Relaxation of the Source Model
Current source model limited
Any number of R-Sources and SR-Sources
Upper has good results even with only SR-Sources
Parallelism
Define a query model for parallel access tosources
Adapt our algorithms to this model
Approximate Queries
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
28
References
Top-k Queries:
Evaluating Top-k Selection Queries, S. Chaudhuri and L.Gravano. VLDB 1999
TA algorithm:
 Optimal Aggregation Algorithms for Middleware, R. Fagin,A. Lotem, and M. Naor. PODS 2001
Variations of TA:
Query Processing Issues on Image (Multimedia) Databases,S. Nepal and V. Ramakrishna. ICDE 1999
Optimizing Multi-Feature Queries for Image Databases, U.Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000
Expensive Predicates
Predicate Migration: Optimizing queries with ExpensivePredicates, J.M. Hellerstein and M. Stonebraker. SIGMOD1993
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
29
Real-web Experiments
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
30
Real-web Experiments withAdaptive Time
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
31
Relaxing the Source Model
Upper
TA-EP
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
32
Upcoming Journal Paper
Variations of Upper
Select best source
Data Structures
Complexity Analysis
Relaxing Source Model
Adaptation of our Algorithms
New Algorithms
Variations of Data and Query Model to handlereal web data
C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif
2/27/2002
33
Optimality
TA instance optimal over:
Algorithms that do not make wild guesses.
Databases that satisfy the distinctness property.
TAZ instance optimal over:
Algorithms that do not make wild guesses.
No complexity analysis of our algorithms, butexperimental evaluation instead