“You might also like: …” Privacy risks of collaborative filtering

Privacy risks of collaborative filtering

Yuval Madar, June 2012

Based on a paper by J.A. Calandrino, A. Kilzer, A. Narayanan, E. W.Felten & V. Shmatikov

Help suggesting users items and other usersto their liking by deducing them fromprevious purchases.

Numerous examples

◦Amazon, iTunes, CNN, Last.fm, Pandora, Netflix,Youtube, Hunch, Hulu, LibraryThing, IMDb andmany others.

User to item – “You might also like…”

Item to item – Similar items list

User to user – Another customer withcommon interests

Content Based filtering

◦Based on A-priori similarity between items, andrecommendations are derived from a user’s own history.

◦Doesn’t pose a privacy threat.

Collaborative filtering

◦Based on correlations between other uses purchases.

◦Our attacks will target this type of systems.

Hybrid

◦A system employing both filtering techniques

The data the recommendation system uses ismodeled as a matrix, where the rows correspondto users, and columns correspond to items.

Some auxiliary information on the target user isavailable (A subset of a target user’s transactionhistory)

An attack is successful, if it allows the attacker tolearn transactions not part of its auxiliaryinformation.

User public rating and comments on products

Shared transactions (Via facebook, or othermediums)

Discussions in 3rd party sites

Favorite books in facebook profile

Non-online interactions (With friends,neighbors, coworkers, etc.)

Other sources…

Input:

◦a set of target items T and a set of auxiliary items A

Observe the related items list of A, until an item in T appears, ormoves up.

If a target item appears in enough related items lists in the sametime, the attacker may infer it was bought by the target user.

Note 1 – Scoring may be far more complex, since different items in A arecorrelated. (Books which belong to a single series, bundle discounts, etc.)

Note 2 – It is preferable that A consist of obscure and uncommon items, toimprove the effect of the target user’s choices on its related items lists.

In some sites, the covariance matrix, describingthe correlation between items in the site, isexposed to the users. (Hunch is one suchwebsite)

Similarly, the attacker is required to watch forimprovement in the correlation between theauxiliary items and the target items.

Note 1 – Asynchronous updates to different matrix cells.

Note2 – inference probability improves if the auxiliary items are user-unique.(No other user bought all auxiliary items) More likely if some of them areunpopular, or if there are enough of them.

System model

◦For each user, the system finds the k users most similar to it, and ranks itemspurchased by them by total number of sales.

Active Attack

Create k dummy users, each buying all known auxiliary items.

With high probability, the k dummy users and the target user will beclustered together. (Given auxiliary items list of size logarithmic in thetotal number of users. In practice, 8 items were found to be enough formost sites)

In that case, the recommendations to the dummy users will consist oftransactions of the target user previously unknown to the attacker.

Note – The attack is more feasible in a system where user interactions with items doesnot involve spending money.

Attack 3 - kNN recommender systems inference

The main parameters for evaluation of aninference attack are:

◦Yield – How many inferences are produced.

◦Accuracy – How likely is each inference.

Yield-accuracy tradeoff - stricter accuracyalgorithms reject less probable inferences.

The paper further discusses specific attacksperformed against:

◦Hunch

◦LibraryThing

◦Last.fm

◦Amazon

And measures the accuracy and yield of theseattacks, arriving in some instances toimpressive tradeoff figures. (Such as 70%accuracy for 100% yield in Hunch)

Not discussed in the paper.

Achieved in other papers for staticrecommendation databases.

Remains an open problem for dynamicsystems. (Which all real world examples are)

Limited-length related items list – The firstelements of such lists have low sensitivity tosingle purchases.

Factoring item popularity into updatefrequency – less popular items are moresensitive to single purchases. Batching theirpurchases together will decrease theinformation leak.

Limit data access rate – Preventing large-scale privacy attacks, though lowering utilityand may be circumvented using a botnet.

User opt-out – A privacy conscious user maydecide to opt-out of recommender systemsentirely. (At clear cost of utility)

A passive attack on recommender systemsusing auxiliary information on a certain user’spurchases, allowing the attacker to inferundisclosed private transactions.

Increased user awareness is required

Suggested several methods to decrease theinformation leaked by these systems

Questions?