Slide 1

Threshold Setting and Performance Monitoringfor Novel Text Mining

Wenyin Tang and Flora S. Tsai

School of Electrical and Electronic EngineeringNanyang Technological University

E-mail: wenyintang@ntu.edu.sg, fst1@columbia.edu

May 2, 2009

Outline

•Introduction

–Novel Text Mining (NTM) System

–Performance Evaluation of NTM

•Adaptive Threshold Setting for NTM

–Motivations

–Our Method: Gaussian-based AdaptiveThreshold Setting (GATS)

–Experimental Result

•Conclusion

Overview of Novel Text Mining System

Categorise eachincoming document orsentence into itsrelevant topic bin.

Detect novel yetrelevant documents orsentences in eachtopic.

Prepare a clean datamatrix which can beeasily processed by

a computer.

Interact with users:input documents, outputnovel info, preferencesetting and feedback.

Vector space

Given a set of relevant documentsin a specific topic, e.g. “footballgames”, NTM retrieves the noveldocuments by:

–Step 1: rank documents in thetopic “football games” in achronological order.

–Step 2: assign a novelty scorefor each document bycomparing the document with itshistory documents.

–Step 3: predict the document as“novel” if its novelty score isgreater than the predefinednovelty threshold.

Novel Text Mining Algorithm

I am “novel”because I amthe firstdocument

I am “novel”because I amdissimilar to D1

I am “novel”because I amdissimilar withmy nearestneighbor D2

D1, D2, D3, D4 …

Unfortunately, I am

“non-novel” because Iam very similar to mynearest neighbor D3

NTM Performance Evaluation

•Given a set of documents D1, D2, to D10, relevant tosome topic, for example,

D1, D2, D3, D4, D5, D6, D7, D8, D9, D10

System (S):

Assessor (A):

Matched (M):

# Novel:

novel

non-novel

•Precision (P) reflects how likely the system retrieved docs are trulynovel. P=M/S=4/8=0.5, i.e. 50% system retrieved docs are trulynovel.

•Recall (R) reflects how likely the truly novel docs can be retrievedby the system. R=M/A=4/5=0.8, i.e. 80% truly novel docs can beretrieved by the system.

•Fβ score: the function of P and R:

Threshold Setting vs. Users’ Requirements

$C:\Program Files\Microsoft Office\MEDIA\CAGCAT10\j0292020.wmf$

I want to readthe mostnovelinformation ina short time1.

$C:\Program Files\Microsoft Office\MEDIA\CAGCAT10\j0195384.wmf$

I do not wantto miss anynovelinformation2.

$C:\Program Files\Microsoft Office\MEDIA\CAGCAT10\j0291984.wmf$

I am not sureuntil I can seethe documents

The NTM systemshould define thenovelty thresholdbased on the users’requirements

adaptively.

Different users may have different performance requirements.

1.High-precision NTM systems are desired;

2.High-recall NTM systems are desired.

Why Adaptive Threshold Setting

Motivations:

1.As NTM system is a real-time system, there is little or notraining information in the initial stages of NTM.Therefore, the threshold cannot be predefined withconfidence.

2.As NTM system is an accumulating system, moretraining information will be available for threshold setting,based on user’s feedback given over time.

3.Different users may have different definitions of“novelty”:

–One user: a document with 50% novel info

–Another user: a document with 90% novel info

Gaussian-based Adaptive ThresholdSetting (GATS)

Basic idea:

•GATS is a score distribution-based thresholdsetting method. It models the score distributionsof both novel and non-novel documents (basedon the user feedback);

•This parametric model provides the globalinformation of data, from which we can constructan optimization criterion of desired performanceto search the best threshold.

Novelty Score Distributions

Empirical probability distribution and its Gaussianprobability distribution approximation for TREC 2004Novelty Track data topic N54

Gaussian probabilitydistribution approximation

Novel

Non-novel

Optimization Criterion

Satisfy 2 conditions:

1.Criterion is a function of Threshold:

J=f (θ)

2. Criterion is directly related to systemperformance:

J=Fβ (θ)

Optimization Criterion

Novel

Non-novel

Flow Chart of NTM with GATS

Experimental Data

Sentence-level data: TREC 2004 Novelty Track data

The news providers of the document set are Xinghua English (XIE) , NewYork Times (NYT), and Associated Press Worldstream (APW). The NISTassessors created 50 topics for this data. Each topic consists of around 25documents. These documents were ordered chronologically and thensegmented into sentences. Each sentence was given an identifier andconcatenated together to form the target sentence set. In this data, theoverall percentage of novel sentences is around 41.4%. The statistics of datais summarized in Table 1.

#Novel

#Non-novel

Sum

Relevant

3454

(41.4%)

4889

(58.6%)

8343

Table 1 Statistics of TREC 2004 Novelty Track data

Experimental Data

Document-level data: APWSJ

APWSJ consists of news articles from Associate Press (AP) and Wall StreetJournal (WSJ), which cover the same period from 1988 to 1990 [Zhang et al.,2002]. There are 50 TREC topics from Q101 to Q150 in this data and 5topics (Q131, Q142, Q145, Q147, Q150) that lack non-novel documents areexcluded from the experiments. The statistics of this data are summarized inTable 2.

Table 2 Statistics of APWSJ data

#Novel

#Non-novel

Sum

Relevant

10,839(91.1%)

1057(8.9%)

11,896

Methods & Parameters

•Baseline:

–Fixed threshold setting θ from 0.05~0.95 withan equal step 0.05.

•Our method, GATS:

–Complete feedback: with β from 0.1~0.9 withan equal step 0.1.

–Partial feedback: with β from 0.1~0.9 with anequal step 0.1, percentages of feedback:10%, 20%, 50% and 80%.

Experimental Result

Sentence-Level NTM on TREC 2004 Data

Recall

Precision

Experimental Result

Document-Level NTM on APWSJ Data

Redundancy-Recall

Redundancy-Precision

Comparison: GATS vs. Fixed Threshold

•For precision-recall tradeoff

–Fixed threshold θ cannot reflect the tradeoff of theprecision and recall directly.

–GATS parameter β reflects the weights of precisionand recall directly.

•Under various performance requirements, GATS isable to approximate the best fixed threshold.

Table 3 Comparison of Fβ on TREC 2004 Novelty Track data

Experimental Result

PR curves of GATS (tuned for Fβ) with different percentages ofthe user’s feedback.

Recall

Precision

Sentence-Level NTM on TREC 2004 Data

Experimental Result

R-PR curves of GATS with different percentages of theuser’s feedback.

Redundancy-Recall

Redundancy-Precision

Document-Level NTM on APWSJ Data

Conclusion

•A Gaussian-based Adaptive Threshold Setting (GATS)algorithm was proposed for NTM system.

•GATS is a generic method, which can be tunedaccording to different performance requirements varyingfrom high-precision to high-recall.

•By testing the proposed method on both document andsentence-level datasets, we found the experimentalresults showed the promising performance of GATS fora real-time NTM system.

Q & AQ & A