iDetect

Company

LOGO

MovieMiner

A collaborative filtering system forpredicting Netflix user’s movie ratings

[ECS289G Data Mining]

Team Spelunker: Justin Becker, Philip Fisher-Ogden

The Problem

•Given a set of <movie, user, rating, date>entries, predict the ratings values for unknown<movie, user, ?, date> entries.

•Example:

–X-Men, Philip, 5, 05-02-2007

–Spiderman 3, Philip, 4, 05-10-2007

–X-Men, Justin, 4, 04-05-2006

–Spiderman 3, Justin, ?, 02-28-2008

•What rating do you predict Justin wouldgive Spiderman 3?

Our Approach - Motivation

•Motivating Factors

–Review current approaches taken by theNetflix prize top leaders

–Leverage and extend existing libraries, tominimize the ramp-up time required toimplement a working system

–Utilize the UC Davis elvis cluster to alleviateany scale problems

What - Our Approach

•Collaborative Filtering (CF)‏

–Weighted average of predictions from thefollowing recommenders:

•Slope One recommender

•Item-based recommender

•User-based recommender

What - Our Approach

•Leveraging three CF recommenders

–Similarities:

•Each uses prior preference information to predictvalues for unrated entries

–Differences:

•How is the similarity

between two entries

computed?

•How are the neighbors

selected?

•How are the interpolation weights determined?

Image source: http://taste.sourceforge.net/

Why - Our Approach

•Why Collaborative Filtering?

–“Those who agreed in the past tend to agreeagain in the future“

–Requires no external data sources

–Uses k-Nearest-Neighbor approaches topredict the class (rating) of an unknown entry

–Exists a full features CF Java library- Taste

–CF is one of two main approaches used bythe Netflix prize top leaders (with the otherbeing SVD).

How – Slope One Recommender

•Introduced by Daniel Lemire and AnnaMaclachlan

•Simple and accurate predictor

•Average difference between two items

•Weighted average to produce betterresults

•Number of user having rated both items

Ex: Slope One Recommender

Average difference between X-Men andSpiderman 3 is 1.

Justin's rating for Spiderman 3 is then 4+1=5

X-Men

Spiderman3

Batman Begins

Nacho Libre

Justin

Philip

Dan

Ian

Michael

How – User-based Recommender

•Predicts a user u’s rating for an item i:

–Find the k nearest neighbors to the user u

•Similarity measure = Pearson correlation

•Missing preferences are inferred by using theuser’s average rating

–Interpolate between those in-commonneighbors’ ratings for item i

•Interpolation weights = Pearson correlation

•Neighbors are ignored if they did not rate i

Ex: User-based Recommender

X-Men

Spiderman3

Batman Begins

Nacho Libre

avg

Justin

4.33333

Philip

3.5

Dan

4.5

Ian

3.25

Michael

2.75

centered data (user average)‏

X-Men

Spiderman 3

Batman Begins

Nacho Libre

EuclNorm

Justin

-0.3333

0.666666667

-0.333333333

0.816497

Philip

1.5

-0.5

0.5

-1.5

2.236068

Dan

-0.5

0.5

Ian

-0.25

0.75

-0.25

0.866025

Michael

-0.75

0.25

-1.75

2.25

2.95804

Ex: User-based Recommender

•Similarities are calculated using thePearson correlation coefficient (oncentered data):

•Interpolation between

nearest neighbors

produces the prediction:

User-user similarities

Justin-Phil

0.182574186

Justin-Dan

0.40824829

Justin-Ian

-3.14018E-16

Justin-Michael

-0.690065559

Prediction using 2-nearest neighbors

Philip, Dan

3.690983006

round(prediction)‏

How – Item-based Recommender

•Predicts a user u’s rating for an item i:

–Find the k most similar items to i

•Similarity measure = Pearson correlation

–Keep only similar items also rated by u

–Interpolate between the remaining items’ratings

•Interpolation weights = Pearson correlation

–Note: Item-item similarities allow for more efficient computations ascnt(items) << cnt(users) and, thus, the similarity matrix can be pre-computed and leveraged as needed.

Ex: Item-based Recommender

X-Men

Spiderman3

Batman Begins

Nacho Libre

Justin

Philip

Dan

Ian

Michael

avg

3.6

3.5

3.6

3.8

centered data (item average)‏

X-Men

Spiderman3

Batman Begins

Nacho Libre

Justin

0.4

1.4

0.2

Philip

1.4

-0.5

0.4

-1.8

Dan

0.4

0.5

1.4

1.2

Ian

-0.6

0.5

-0.6

-0.8

Michael

-1.6

-0.5

-2.6

1.2

Eucl Norm

2.24499

3.03973

2.6

Ex: Item-based Recommender

•Similarities are calculated using thePearson correlation coefficient (oncentered data):

•Interpolation between

nearest neighbors

produces the prediction:

Item-item similarities

S-Xm

S-BB

0.493463771

S-NL

0.192307692

Prediction from 2-nearestneighbors

BB, NL

4.719574665

round(prediction)‏

Initial Results

•Bottom line: correct=91934, loss=319,710

•Parameters used: 40% user, 60% item, 20nearest neighbors

•~97% scored with compositerecommender (user,item)‏

•~3% scored with random recommender

•RMSE 1.4445

Final Results

•Bottom line: correct=106,253,loss=236,523

•Parameters used: 25% user, 5% item,70% slope one, 20 nearest neighbors

•~97% scored with compositerecommender (user, item, slope one)‏

•~3% scored with weighted average

•RMSE: 1.0871

Questions?

Conclusion