Evaluating Top-k Queries over Web-Accessible Databases

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

Evaluating Top-k Queries overWeb-Accessible Databases

Nicolas Bruno

Luis Gravano

Amélie Marian

Columbia University

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

“Top-k” Queries Natural inMany Scenarios

Example: NYC Restaurant RecommendationService.

Goal: Find best restaurants for a user:

Close to address: “2290 Broadway”

Price around $25

Good rating

Query: Specification of Flexible Preferences

Answer: Best k Objects for Distance Function

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Attributes often Handled byExternal Sources

MapQuest returns the distance betweentwo addresses.

NYTimes Review gives the price rangeof a restaurant.

Zagat gives a food rating to therestaurant.

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

“Top-k” Query ProcessingChallenges

Attributes handled by external sources(e.g., MapQuest distance).

External sources exhibit a variety ofinterfaces (e.g., NYTimes Review,Zagat).

Existing algorithms do not handle alltypes of interfaces.

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Processing Top-k Queries overWeb-Accessible Data Sources

Data and query model

Algorithms for sources with differentinterfaces

Our new algorithm: Upper

Experimental results

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Data Model

Top-k Query: assignment of weights andtarget values to attributes

< $25, “2290 Broadway”, very good >

preferred price

close to address

preferred rating

weights: <4, 1, 2>

price: most important attribute

Combined inscoring function

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Sorted Access Source S

Rectangle: Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

Return objects sortedby scores for a givenquery.

Example: Zagat

GetNextS interface

S-Source

Access Time: tS(S)

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Random Access Source R

Return the score of agiven object for a givenquery.

Example: MapQuest

R-Source

Access Time: tR(R)

GetScoreR interface

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Query Model

Attributes scores between 0 and 1.

Sequential access to sources.

Score Ties broken arbitrarily.

No wild guesses.

One S-Source (or SR-Source) andmultiple R-sources. (More on this later.)

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Query Processing Goals

Processing top-k queries over R-Sources.

Returning exact answer to top-k query q.

Minimizing query response time.

Naïve solution too expensive (access allsources for all objects).

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Example: NYC Restaurants

S-Source:

Zagat: restaurants sorted by food rating.

R-Sources:

MapQuest: distance between two inputaddresses.

User address: “2290 Broadway”

NYTimes Review: price range of the inputrestaurant.

Target Value: $25

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

TA Algorithm for SR-Sources

Perform sorted access sequentially to all SR-Sources

Completely probe every object found for allattributes using random access.

Keep best k objects.

Stop when scores of best k objects are no less thanmaximum possible score of unseen objects(threshold).

Fagin, Lotem, and Naor (PODS 2001)

Does NOT handle R-Sources

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Our Adaptation of TA Algorithmfor R-Sources: TA-Adapt

Perform sorted access to S-Source S.

Probe every R-Source Ri for newly foundobject.

Keep best k objects.

Stop when scores of best k objects are noless than maximum possible score of unseenobjects (threshold).

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

An Example Execution ofTA-Adapt

Object

S(Zagat)

R1(MQ)

R2(NYT)

Final Score

tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1

Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6

Threshold = 1

Total Execution Time = 9

GetNextS(q)

Threshold = 0.95

0.9

GetScoreR1(q,o1)

Threshold = 0.95

0.1

GetScoreR2(q,o1)

Threshold = 0.95

0.5

0.56

GetNextS(q)

Threshold = 0.9

0.8

GetScoreR1(q,o2)

Threshold = 0.9

0.7

GetScoreR2(q,o2)

Threshold = 0.9

0.7

0.75

GetNextS(q)

Threshold = 0.725

0.45

GetScoreR1(q,o3)

Threshold = 0.725

0.6

GetScoreR2(q,o3)

Threshold = 0.725

0.3

0.55

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Improvements over TA-Adapt

Add a shortcut test after each random-access probe (TA-Opt).

Exploit techniques for processingselections with expensive predicates(TA-EP).

Reorder accesses to R-Sources.

Best weight/time ratio.

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

The Upper Algorithm

Selects a pair (object,source) to probe next.

Based on the property:

The object with the highest upper bound willbe probed before top-k solution is reached.

Object is one of top-k objects

Object is not one of top-k objects

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Threshold = 1

An Example Execution of Upper

Object

Upper Bound

S(Zagat)

R1(MQ)

R2(NYT)

Final Score

Total Execution Time = 6

0.95

GetNextS(q)

Threshold = 0.95

0.9

0.1

0.65

GetScoreR1(q,o1)

Threshold = 0.95

0.8

0.9

GetNextS(q)

Threshold = 0.9

0.7

GetScoreR1(q,o2)

Threshold = 0.9

0.8

0.45

0.725

GetNextS(q)

Threshold = 0.725

0.8

0.75

0.7

GetScoreR2(q, o2)

Threshold = 0.725

0.75

tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1

Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

The Upper Algorithm

Choose object with highest upper bound.

If some unseen object can have higher upper bound:

Access S-Source S

Else:

Access best R-Source Ri for chosen object

Keep best k objects

If top-k objects have final values higher thanmaximum possible value of any other object, returntop-k objects.

Interleaves accesses on objects

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Selecting the Best Source

Upper relies on expected values to make itschoices.

Upper computes “best subset” of sourcesthat is expected to:

1.Compute the final score for k top objects.

2.Discard other objects as fast as possible.

Upper chooses best source in “best subset”.

Best weight/time ratio.

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Experimental Setting:Synthetic Data

Attribute scores randomly generated (threedata sets: uniform, gaussian and correlated).

tR(Ri): integer between 1 and 10.

tS(S)  {0.1, 0.2,…,1.0}.

Query execution time: ttotal

Default: k=50, 10000 objects, uniform data.

Results: average ttotal of 100 queries.

Optimal assumes complete knowledge

(unrealistic, but useful performance bound)

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Experiments: Varying Numberof Objects Requested k

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Experiments: Varying Numberof Database Objects N

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Experimental Setting:Real Web Data

S-Source: Verizon Yellow Pages

(sorted by distance)

R-Sources:

Subway Navigator

Subway time

Altavista

Popularity

MapQuest

Driving time

NYTimes Review

Food and priceratings

Zagat

Food, Service, Décorand Price ratings

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Experiments: Real-Web Data

# of Random Accesses

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Evaluation Conclusions

TA-EP and TA-Opt much faster thanTA-Adapt.

Upper significantly better than allversions of TA.

Upper close to optimal.

Real data experiments: Upper fasterthan TA adaptations.

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Conclusion

Introduced first algorithm for top-k processingover R-Sources.

Adapted TA to this scenario.

Presented new algorithms: Upper and Pick (seepaper)

Evaluated our new algorithms with both realand synthetic data.

Upper close to optimal

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Current and Future Work

Relaxation of the Source Model

Current source model limited

Any number of R-Sources and SR-Sources

Upper has good results even with only SR-Sources

Parallelism

Define a query model for parallel access tosources

Adapt our algorithms to this model

Approximate Queries

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

References

Top-k Queries:

Evaluating Top-k Selection Queries, S. Chaudhuri and L.Gravano. VLDB 1999

TA algorithm:

Optimal Aggregation Algorithms for Middleware, R. Fagin,A. Lotem, and M. Naor. PODS 2001

Variations of TA:

Query Processing Issues on Image (Multimedia) Databases,S. Nepal and V. Ramakrishna. ICDE 1999

Optimizing Multi-Feature Queries for Image Databases, U.Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000

Expensive Predicates

Predicate Migration: Optimizing queries with ExpensivePredicates, J.M. Hellerstein and M. Stonebraker. SIGMOD1993

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Real-web Experiments

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Real-web Experiments withAdaptive Time

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Relaxing the Source Model

Upper

TA-EP

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Upcoming Journal Paper

Variations of Upper

Select best source

Data Structures

Complexity Analysis

Relaxing Source Model

Adaptation of our Algorithms

New Algorithms

Variations of Data and Query Model to handlereal web data

$C:\Program Files\Common Files\Microsoft Shared\Clipart\themes1\lines\bd15184_.gif$

2/27/2002

Optimality

TA instance optimal over:

Algorithms that do not make wild guesses.

Databases that satisfy the distinctness property.

TAZ instance optimal over:

Algorithms that do not make wild guesses.

No complexity analysis of our algorithms, butexperimental evaluation instead