Evaluating answer validation in multi-stream Question Answering

Evaluating Answer Validation in multi-stream Question Answering

Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo

UNED NLP & IR group

nlp.uned.es

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)

Tokyo, 16 December 2008

UNED

nlp.uned.es

Content

1.Context and motivation

•Question Answering at CLEF

•Answer Validation Exercise at CLEF

2.Evaluating the validation of answers

3.Evaluating the selection of answers

•Correct selection

•Correct rejection

4.Analysis and discussion

5.Conclusion

UNED

nlp.uned.es

Evolution of the CLEF-QA Track

2003

2004

2005

2006

2007

2008

2009

Targetlanguages

UE Official

Collections

News 1994

+News 1995

+ WikipediaNov. 2006

JRC-Acquis

Type ofquestions

200 Factoid

+ Temporalrestrictions

+ Definitions

- Type ofquestion

+ Lists

+ Linkedquestions

+ Closed lists

Factoid

Definition

Motive

Purpose

Procedure

Supportinginformation

Document

Snippet

Paragraph

Pilots andExercises

Temporalrestriction

Lists

AVE

Real Time

WiQA

AVE

QAST

AVE

QAST

WSDQA

GikiCLEF

QAST

UNED

nlp.uned.es

Evolution of Results

2003 - 2006 (Spanish)

Overall

Best result

<60%

Definitions

Best result

>80%

NOT

IR approach

UNED

nlp.uned.es

Pipeline Upper Bounds

Use Answer Validation to break the pipeline

Question

Answer

Question

analysis

Passage

Retrieval

Answer

Extraction

Answer

Ranking

1.0

0.8

0.64

Not enough evidence

UNED

nlp.uned.es

Results in CLEF-QA 2006 (Spanish)

Perfectcombination

81%

Best system52,5%

Best withORGANIZATION

Best withPERSON

Best withTIME

UNED

nlp.uned.es

Collaborative architectures

Diferent systems response better differenttypes of questions

•Specialisation

•Collaboration

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidateanswers

Answer Validation& Selection

Answer

Evaluation Framwork

UNED

nlp.uned.es

Collaborative architectures

How to select the good answer?

•Redundancy

•Voting

•Confidence score

•Performance history

Why not deeper analysis?

UNED

nlp.uned.es

Answer Validation Exercise (AVE)

Objective

Validate the correctness of the answers

Given by real QA systems...

...the participants at CLEF QA

UNED

nlp.uned.es

Answer Validation Exercise (AVE)

Question

Answering

Question

Candidate answer

Supporting Text

TextualEntailment

Answer is not correct or not enoughevidence

AutomaticHypothesis

Generation

Question

Hypothesis

Answer is correct

AVE 2006

AVE 2007 - 2008

Answer Validation

UNED

nlp.uned.es

Techniques in AVE 2007

Overview AVE 2007

Generates hypotheses

Wordnet

Chunking

n-grams, longest commonSubsequences

Phrase transformations

NER

Num. expressions

Temp. expressions

Coreference resolution

Dependency analysis

Syntactic similarity

Functions (sub, obj, etc)

Syntactic transformations

Word-sense disambiguation

Semantic parsing

Semantic role labeling

First order logic representation

Theorem prover

Semantic similarity

UNED

nlp.uned.es

Evaluation linked to main QA task

Question

Answering

Track

Systems’ answers

Systems’ Supporting Texts

Answer

Validation

Exercise

Questions

Systems’ Validation (YES, NO)

Human Judgements (R,W,X,U)

QA Track results

Mapping

(YES, NO)

Evaluation

AVE Track results

Reuse human assessments

UNED

nlp.uned.es

Content

1.Context and motivation

2.Evaluating the validation of answers

3.Evaluating the selection of answers

4.Analysis and discussion

5.Conclusion

UNED

nlp.uned.es

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidateanswers

Answer Validation& Selection

Answer

Participant systems in a

CLEF – QA

Evaluation of Answer

Validation & Selection

Evaluation Proposed

UNED

nlp.uned.es

Collections

<q_str> What is Zanussi? </q_str>

<a_str> was an Italian producer of home appliances </a_str>

<t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For thehot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of homeappliances that in 1984 was bought</t_str>

</a>

<a_str> who had also been in Cassibile since August 31 </a_str>

<t_str doc="en/p29/2998260.xml">Only after the signing had taken place was GiuseppeCastellano informed of the additional clauses that had been presented by general RonaldCampbell to another Italian general, Zanussi, who had also been in Cassibile since August31.</t_str>

</a>

<a_str> 3 </a_str>

<t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>

</a>

</q>

UNED

nlp.uned.es

Evaluating the Validation

Validation

Decide if each candidate answer is correct or not

•YES | NO

Not balanced collections

Approach: Detect if there is enough evidence to accept ananswer

Measures: Precision, recall and F over correct answers

Baseline system: Accept all answers

UNED

nlp.uned.es

Evaluating the Validation

CorrectAnswer

IncorrectAnswer

Answer

Accepted

nCA

nWA

Answer

Rejected

nCR

nWR

UNED

nlp.uned.es

Evaluating the Selection

Quantify the potential gain of Answer Validation in QuestionAnswering

•Compare AV systems with QA systems

Develop measures more comparable to QA accuracy

UNED

nlp.uned.es

Evaluating the selection

Given a question with several candidate answers

Two options:

Selection

Select an answer ≡ try to answer the question

•Correct selection: answer was correct

•Incorrect selection: answer was incorrect

Rejection

Reject all candidate answers ≡ leave question unanswered

•Correct rejection: All candidate answers were incorrect

•Incorrect rejection: Not all candidate answers were incorrect

UNED

nlp.uned.es

Evaluating the Selection

n questions

n= nCA + nWA + nWS + nWR + nCR

Question withCorrect Answer

Question withoutCorrect Answer

Question Answered Correctly

(One Answer Selected)

nCA

Question Answered Incorrectly

nWA

nWS

Question Unanswered

(All Answers Rejected)

nWR

nCR

Not comparable toqa_accuracy

UNED

nlp.uned.es

Evaluating the Selection

n questions

n= nCA + nWA + nWS + nWR + nCR

Question withCorrect Answer

Question withoutCorrect Answer

Question Answered Correctly

(One Answer Selected)

nCA

Question Answered Incorrectly

nWA

nWS

Question Unanswered

(All Answers Rejected)

nWR

nCR

UNED

nlp.uned.es

Evaluating the Selection

Rewards rejection

(not balanced cols)

Interpretation for QA: all questions correctly rejectedby AV will be answered correctly

UNED

nlp.uned.es

Evaluating the Selection

Interpretation for QA: questions correctly rejected by AV will beanswered correctly in qa_accuracy proportion

UNED

nlp.uned.es

Content

1.Context and motivation

2.Evaluating the validation of answers

3.Evaluating the selection of answers

4.Analysis and discussion

5.Conclusion

UNED

nlp.uned.es

Analysis and discussion(AVE 2007 English)

Validation

Selection

QA_acc correlated to R

“Estimated” adjusts it

UNED

nlp.uned.es

Multi-stream QA performance (AVE2007 English)

UNED

nlp.uned.es

Analysis and discussion(AVE 2007 Spanish)

Validation

Selection

Comparing AV & QA

UNED

nlp.uned.es

Conclusion

Evaluation framework for Answer Validation & Selectionsystems

Measures that reward not only Correct Selection but alsoCorrect Rejection

•Promote improvement of QA systems

Allow comparison between AV and QA systems

•In what conditions multi-stream perform better

•Room for improvement just using multi-stream-QA

•Potential gain that AV systems can provide to QA

Thanks!

http://nlp.uned.es/clef-qa/ave

http://www.clef-campaign.org

Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)

Evaluating Answer Validation in multi-stream Question Answering

Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo

UNED NLP & IR group

nlp.uned.es

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)

Tokyo, 16 December 2008