Information Retrieval

Introduction toInformation Retrieval

Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL

What is Information Retrieval?

Google ?

Yahoo ?

MSN search?

What goes on behind search engine???

How do they work??????

..Many Question

What makes a system like Google searchtrick?

How does it gather information?

What trick does it use?

Extending beyond the Web

How can those approaches be madebetter?

Natural language understanding?

User interactions?

..Many Question

How can we do to make things workquickly?

Faster computer?

Caching?

Compression?

How do we decide it works well?

For all queries?

For special types of queries?

On every collection of information?

Motivation

IR: representation, storage, organizationof, and access to information items

Focus is on the user information need

User information need:

Find all docs containing information on college tennisteams which: (1) are maintained by a USA universityand (2) participate in the NCAA tournament.

Emphasis is on the retrieval of information (notdata)

Motivation

Data retrieval

which documents contain a set of keywords?

Well defined semantics

a single erroneous object implies failure!

Information retrieval

information about a subject or topic

semantics is frequently loose

small errors are tolerated

IR system:

interpret contents of information items

generate a ranking which reflects relevance

notion of relevance is most important

Motivation

IR at the center of the stage

IR in the last 20 years:

classification and categorization

systems and languages

user interfaces and visualization

Advent of the Web changed this perception onceand for all

universal repository of knowledge

free (low cost) universal access

no central editorial board

many problems though: IR seen as key to finding thesolutions!

Information Retrieval

The indexing and retrieval of textualdocuments.

Concerned firstly with retrieving relevantdocuments to a query.

Concerned secondly with retrieving from largesets of documents efficiently.

Typical IR Task

Given:

A corpus of textual natural-languagedocuments.

A user query in the form of a textual string.

Find:

A ranked set of documents that are relevantto the query.

Relevance

Relevance is a subjective judgment andmay include:

Being on the proper subject.

Being timely (recent information).

Being authoritative (from a trusted source).

Satisfying the goals of the user and his/herintended use of the information (informationneed).

Relevance

Much of IR depends upon idea that

Similar vocabulary -> relevant to same queries

Usually look for documents matching query words

“Similar” can be measured in many ways

String matching/comparison

Same vocabulary used

Probability that documents arise from same model

Same meaning of text

Keyword Search

Simplest notion of relevance is that thequery string appears verbatim in thedocument.

Slightly less strict notion is that thewords in the query appear frequently inthe document, in any order (bag ofwords).

Problems with Keywords

May not retrieve relevant documents thatinclude synonymous terms.

“restaurant” vs. “café”

May retrieve irrelevant documents thatinclude ambiguous terms.

“bat” (baseball vs. mammal)

“Apple” (company vs. fruit)

“bit” (unit of data vs. act of eating)

Intelligent IR

Taking into account the meaning of thewords used.

Taking into account the order of words inthe query.

Adapting to the user based on direct orindirect feedback.

Taking into account the authority of thesource.

IR Basic Concepts

The User Task

Retrieval

information or data

purposeful

Browsing

glancing around

F1; cars, Le Mans, France, tourism

Retrieval

Browsing

Database

IR Basic Concepts

Logical view of documents

Document representation viewed as acontinuum: logical view of documents mightshift

Docs

structure

Accents

spacing

stopwords

Noun

groups

stemming

Manual

indexing

structure

Full text

Index terms

IR System Architecture

User

Interface

Text Operations

Query

Operations

Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical view

inverted file

DB Manager

Module

Text

Database

Text

IR System Components

Text: Operations forms index words (tokens).

Stopword removal

Stemming

Indexing: constructs an inverted index of wordto document pointers.

Searching: retrieves documents that contain agiven query token from the inverted index.

Ranking: scores all retrieved documentsaccording to a relevance metric.

IR System Components

User Interface: manages interaction with theuser:

Query input and document output.

Relevance feedback.

Visualization of results.

Query Operations: transform the query toimprove retrieval:

Query expansion using a thesaurus.

Query transformation using relevance feedback.

Web Search

Application of IR to HTML documents on theWorld Wide Web.

Differences:

Must assemble document corpus by spidering theweb.

Can exploit the structural layout information inHTML (XML).

Documents change uncontrollably.

Can exploit the link structure of the web.

Web Search System

QueryString

System

Ranked

Documents

1. Page1

2. Page2

3. Page3

Document

corpus

Web

Spider

History of IR

1960-70’s:

 Initial exploration of text retrieval systemsfor “small” corpora of scientific abstracts,and law and business documents.

Development of the basic Boolean andvector-space models of retrieval.

Prof. Salton and his students at CornellUniversity are the leading researchers in thearea.

History of IR

1980’s:

Large document database systems, manyrun by companies:

Lexis-Nexis

Dialog

MEDLINE

History of IR

1990’s:

Searching FTP able documents on theInternet

Archie

WAIS

Searching the World Wide Web

Yahoo

Altavista

History of IR

2000’s & continued :

Link analysis for Web Search

Google

Multimedia IR

Image

Video

Audio and music

Document Summarization