fondoAnuk
Evaluating Answer Validation in multi-stream Question Answering
Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo
UNED NLP & IR group
nlp.uned.es
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)
Tokyo, 16 December 2008
UNED
fondoAnuk
nlp.uned.es
Content
1.Context and motivation
Question Answering at CLEF
Answer Validation Exercise at CLEF
2.Evaluating the validation of answers
3.Evaluating the selection of answers
Correct selection
Correct rejection
4.Analysis and discussion
5.Conclusion
UNED
fondoAnuk
nlp.uned.es
Evolution of the CLEF-QA Track
2003
2004
2005
2006
2007
2008
2009
Targetlanguages
3
7
8
9
10
11
UE Official
Collections
News 1994
+News 1995
+ WikipediaNov. 2006
JRC-Acquis
Type ofquestions
200 Factoid
+ Temporalrestrictions
+ Definitions
 - Type ofquestion
+ Lists
+ Linkedquestions
+ Closed lists
Factoid
Definition
Motive
Purpose
Procedure
Supportinginformation
Document
Snippet
Paragraph
Pilots andExercises
Temporalrestriction
Lists
AVE
Real Time
WiQA
AVE
QAST
AVE
QAST
WSDQA
GikiCLEF
QAST
UNED
fondoAnuk
nlp.uned.es
Evolution of Results
2003 - 2006 (Spanish)
PerformaceEvolution
Overall
Best result
<60%
Definitions
Best result
>80%
NOT
IR approach
UNED
fondoAnuk
nlp.uned.es
Pipeline Upper Bounds
Use Answer Validation to break the pipeline
Question
Answer
Question
analysis
Passage
Retrieval
Answer
Extraction
Answer
Ranking
1.0
0.8
0.8
0.64
x
x
=
Not enough evidence
UNED
fondoAnuk
nlp.uned.es
Results in CLEF-QA 2006 (Spanish)
genoma
Perfectcombination
81%
Best system52,5%
Best withORGANIZATION
Best withPERSON
Best withTIME
UNED
fondoAnuk
nlp.uned.es
Collaborative architectures
Diferent systems response better differenttypes of questions
Specialisation
Collaboration
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidateanswers
Answer Validation& Selection
Answer
Evaluation Framwork
UNED
fondoAnuk
nlp.uned.es
Collaborative architectures
How to select the good answer?
Redundancy
Voting
Confidence score
Performance history
Why not deeper analysis?
UNED
fondoAnuk
nlp.uned.es
Answer Validation Exercise (AVE)
Objective
Validate the correctness of the answers
Given by real QA systems...
...the participants at CLEF QA
UNED
fondoAnuk
nlp.uned.es
Answer Validation Exercise (AVE)
Question
Answering
Question
Candidate answer
Supporting Text
TextualEntailment
Answer is not correct or not enoughevidence
AutomaticHypothesis
Generation
Question
Hypothesis
Answer is correct
AVE 2006
AVE 2007 - 2008
Answer Validation
UNED
fondoAnuk
nlp.uned.es
Techniques in AVE 2007
Overview AVE 2007
Generates hypotheses
6
Wordnet
3
Chunking
3
n-grams,  longest commonSubsequences
5
Phrase transformations
2
NER
5
Num. expressions
6
Temp. expressions
4
Coreference resolution
2
Dependency analysis
3
Syntactic similarity
4
Functions (sub, obj, etc)
3
Syntactic transformations
1
Word-sense disambiguation
2
Semantic parsing
4
Semantic role labeling
2
First order logic representation
3
Theorem prover
3
Semantic similarity
2
UNED
fondoAnuk
nlp.uned.es
Evaluation linked to main QA task
Question
Answering
Track
Systems’ answers
Systems’ Supporting Texts
Answer
Validation
Exercise
Questions
Systems’ Validation (YES, NO)
Human Judgements (R,W,X,U)
QA Track results
Mapping
(YES, NO)
Evaluation
AVE Track results
Reuse human assessments
UNED
fondoAnuk
nlp.uned.es
Content
1.Context and motivation
2.Evaluating the validation of answers
3.Evaluating the selection of answers
4.Analysis and discussion
5.Conclusion
UNED
fondoAnuk
nlp.uned.es
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidateanswers
Answer Validation& Selection
Answer
Participant systems in a
CLEF – QA
Evaluation of Answer
Validation & Selection
Evaluation Proposed
UNED
fondoAnuk
nlp.uned.es
Collections
<q id="116" lang="EN">
<q_str> What is Zanussi? </q_str>
<a id="116_1" value="">
<a_str> was an Italian producer of home appliances </a_str>
<t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For thehot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of homeappliances that in 1984 was bought</t_str>
</a>
<a id="116_2" value="">
<a_str> who had also been in Cassibile since August 31 </a_str>
<t_str doc="en/p29/2998260.xml">Only after the signing had taken place was GiuseppeCastellano informed of the additional clauses that had been presented by general RonaldCampbell to another Italian general, Zanussi, who had also been in Cassibile since August31.</t_str>
</a>
<a id="116_4" value="">
<a_str> 3 </a_str>
<t_str doc="1618911.xml">(1985) Out of Live (1985)      What Is This?</t_str>
</a>
</q>
UNED
fondoAnuk
nlp.uned.es
Evaluating the Validation
Validation
Decide if each candidate answer is correct or not
YES | NO
Not balanced collections
Approach: Detect if there is enough evidence to accept ananswer
Measures: Precision, recall and F over correct answers
Baseline system: Accept all answers
UNED
fondoAnuk
nlp.uned.es
Evaluating the Validation
CorrectAnswer
IncorrectAnswer
 Answer
Accepted
nCA
nWA
Answer
Rejected
nCR
nWR
UNED
fondoAnuk
nlp.uned.es
Evaluating the Selection
Quantify the potential gain of Answer Validation in QuestionAnswering
Compare AV systems with QA systems
Develop measures more comparable to QA accuracy
UNED
fondoAnuk
nlp.uned.es
Evaluating the selection
Given a question with several candidate answers
Two options:
Selection
Select an answer  try to answer the question
Correct selection: answer was correct
Incorrect selection: answer was incorrect
Rejection
Reject all candidate answers  leave question unanswered
Correct rejection: All candidate answers were incorrect
Incorrect rejection: Not all candidate answers were incorrect
UNED
fondoAnuk
nlp.uned.es
Evaluating the Selection
n questions
n= nCA + nWA + nWS + nWR + nCR
Question withCorrect Answer
Question withoutCorrect Answer
Question Answered Correctly
(One Answer Selected)
nCA
-
Question Answered Incorrectly
nWA
nWS
Question Unanswered
(All Answers Rejected)
nWR
nCR
Not comparable toqa_accuracy
UNED
fondoAnuk
nlp.uned.es
Evaluating the Selection
n questions
n= nCA + nWA + nWS + nWR + nCR
Question withCorrect Answer
Question withoutCorrect Answer
Question Answered Correctly
(One Answer Selected)
nCA
-
Question Answered Incorrectly
nWA
nWS
Question Unanswered
(All Answers Rejected)
nWR
nCR
UNED
fondoAnuk
nlp.uned.es
Evaluating the Selection
Rewards rejection
(not balanced cols)
Interpretation for QA: all questions correctly rejectedby AV will be answered correctly
UNED
fondoAnuk
nlp.uned.es
Evaluating the Selection
Interpretation for QA: questions correctly rejected by AV will beanswered correctly in qa_accuracy proportion
UNED
fondoAnuk
nlp.uned.es
Content
1.Context and motivation
2.Evaluating the validation of answers
3.Evaluating the selection of answers
4.Analysis and discussion
5.Conclusion
UNED
fondoAnuk
nlp.uned.es
Analysis and discussion(AVE 2007 English)
Validation
Selection
QA_acc correlated to R
“Estimated” adjusts it
UNED
fondoAnuk
nlp.uned.es
Multi-stream QA performance (AVE2007 English)
UNED
fondoAnuk
nlp.uned.es
Analysis and discussion(AVE 2007 Spanish)
Validation
Selection
Comparing AV & QA
UNED
fondoAnuk
nlp.uned.es
Conclusion
Evaluation framework for Answer Validation & Selectionsystems
Measures that reward not only Correct Selection but alsoCorrect Rejection
Promote improvement of QA systems
Allow comparison between AV and QA systems
In what conditions multi-stream perform better
Room for improvement just using multi-stream-QA
Potential gain that AV systems can provide to QA
fondoAnuk
Thanks!
Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)
fondoAnuk
Evaluating Answer Validation in multi-stream Question Answering
Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo
UNED NLP & IR group
nlp.uned.es
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)
Tokyo, 16 December 2008