Breaking down the walls

Moving libraries from collectors toportals

Carl LagozeCornell Universitylagoze@cs.cornell.edu

$\\Falcon\office2000\Photodraw\PFiles\Common\MSShared\Clipart\Standard\stddir1\BD07236_.WMF$

$\\Falcon\office2000\Photodraw\PFiles\Common\MSShared\Clipart\Standard\stddir3\IN00326_.wmf$

The Library should selectively adopt the portalmodel for targeted program areas. By creating linksfrom the Library’s Web site, this approach wouldmake available the ever-increasing body of researchmaterials distributed across the Internet. TheLibrary would be responsible for carefully selectingand arranging for access to licensed commercialresources for its users, but it would not house localcopies of materials or assume responsibility forlong-term preservation.

LC21: Digital Strategy for the Library of Congresspage 5

LC21: Digital Strategy for the Library of Congresspage 5

Some of the most fundamental aspects of libraryoperations entail the existence of a border, acrosswhich objects of information are transferred andmaintained. Such a parameter, demarcating a single,distributed digital library (the "control zone"), needs tobe created and managed by the academic librarycommunity at the earliest opportunity.

Ross AtkinsonLibrary Quarterly, 1996

Towards a Virtual Control Zone

Why distributed collections?

•Scale of the Web

•Prevalence of new publishing modelsand agents

•Increasing complexity of licensing andaccess management

•Dynamic nature of content

Towards Hybrid Portals

•Traditional portal (e.g., Yahoo!)

–linkage without responsibility

•Hybrid Portal

–assertion of (some semblance) of curatorialrole over linked objects

New models have cultural/organizationalramifications…

•Performance and ranking metrics –"bigger is better"

•Levels of confidence

•Trust

…that can be assisted by newtechnical foundations

•Digital object architectures

–that enable aggregating and customizing contentfor local access and management

•Metadata frameworks

–that model changes of objects and theirmanagement over time

•OAI Harvesting Protocol

–for exchange of structured information

•Preservation models

–that enable non-cooperative and cooperativeoffsite monitoring

Digital Object Architectures:aggregating & localizingdistributed content

Acknowledgements:

– Naomi Dushay

– Sandy Payette

– Thorton Staples (U. Va.)

– Ross Wayland (U. Va.)

From Mediators toValue-Added Surrogates

•Wiederhold – mediators between raw dataand end-user applications for integration andtransformation

•Paepcke – mediators as foundation for digitallibrary interoperability

•Payette and Lagoze – mediators (V-Asurrogates) to aggregate and create alocalized service layer for distributedresources

http://www.dlib.org/dlib/june00/payette/payetteimages/image002.gif

FEDORA Digital Object Model

Establishing a Virtual ControlZone

V-A Surrogate Applications

•Access management

–Shared responsibility among trusted partners

•Enhanced and customized functionality

–Examples: reference linking, format translation,special needs

•Preservation

–Monitoring "significant" events and acting on them

Context

Broker

DigitalObject A

Structural

Characteristics

Realaudio video

Powerpoint presentation

SMIL synchronization metadata

Tool

DigitalObject A:

• View Slides

• View Video

• View synchronized presentation using applet

Tool

Context

Broker

DigitalObject A:

• Get Transcript of Audio

• Search for keyword

• Get Slides translated to French

Digital Object

Structural

Metadata

Context

Broker

Tool 8

Tool 93

Context

Broker

User requests

DigitalObject from

ContextBroker

DigitalObject presented

to user with contextually

bound behaviors

Tool 5

Tool 27

Find/Retrieve tools

Get DigitalObject

Structural Metadata

Where we are now…

•Ongoing FEDORA reference prototype

–http://www.cs.cornell.edu/cdlrg/FEDORA.html

–Policy enforcement research

–Content mediation

•Proposed joint deployment with University ofVirginia

–Open source scalable implementation of FEDORAarchitecture

–Testing and deployment with a number ofresearch library partners.

Event-AwareMetadata Frameworks:describing changes over time

•Acknowledgements:

– Dan Brickley (ILRT, Bristol)

– Martin Doer (FORTH, Crete)

– Jane Hunter (DSTC, Brisbane)

Distributed ContentThe Metadata Challenge

•From fixed, contained physical artifactsto fluid, distributed digital objects

•Need for basis of trust and authenticityin network environment

•Decentralization and specialization ofresource description and need formapping formalisms

Multi-entity nature of objectdescription

$C:\WINDOWS\Profiles\lagoze\Application Data\Microsoft\Media Catalog\Downloaded Clips\cl1\PE03720_.wmf$

$E:\PFiles\MSOffice\Clipart\standard\stddir1\bd07098_.wmf$

http://www.pipeline.com/~rabaron/images/Curlers.jpg

Photographer

Camera type

Software

Computerartist

Attribute/Value approaches tometadata…

Hamlet has a creator Shakespeare

subject

implied verb

metadata noun

literal

Playwright

metadata adjective

The playwright of Hamlet was Shakespeare

“Shakespeare”

“Hamlet”

dc:creator.playwright

dc:title

…run into problems for richerdescriptions…

Hamlet has a creator Stratford

birthplace

The playwright of Hamlet was Shakespeare,who was born in Stratford

“Stratford”

“Shakespeare”

dc:creator.playwright

dc:creator.birthplace

…because of their failure tomodel entity distinctions

“Stratford”

creator

name

“Shakespeare”

birthplace

title

“Hamlet”

ABC/Harmony Event-awaremetadata model

•Recognizing inherent lifecycle aspectsof description (esp. of digital content)

•Modeling incorporates time (events andsituations) as first-class objects

–Supplies clear attachment points foragents, roles, occurrent properties

•Resource description as a “story-telling”activity

Resource-centric Metadata

Title

Anna Karenina

Author

Leo Tolstoy

Illustrator

Orest Vereisky

Translator

Margaret Wettlin

Date Created

1877

Date Translated

1978

Description

Adultery &Depression

Birthplace

Moscow

Birthdate

1828

“translator”

“Margaret Wettlin”

“Orest Vereisky”

“illustrator”

“Anna Karenina”

“Tragic adultery andthe search for meaningfullove”

“English”

“author”

“creation”

“1877”

“1978”

“translation”

“Russian”

“Leo Tolstoy”

"Moscow"

“1828”

Queries over descriptive graphs

List details of events where Lagoze is a participating agent

SELECT ?title, ?type, ?time, ?place, ?name

FROM

http://ilrt.org/discovery/harmony/oai.rdf

WHERE

(web::type ?event abc::Event)

(abc::context ?event ?context)

…..

AND ?name ~ lagoze

USING web FOR http://www.w3.org/1999/02/22-rdf-syntax-ns#

Rudolf Squish – http://swordfish.rdfweb.org/rdfquery

Where we are now

•Stabilization of model

•Collaboration with museum/CIDOCcommunity for joint modeling principles

•Plans

–RDF api for model elements

–UI for metadata creation

–Query engine testing

Open Archives Initiative:facilitating exchange ofstructured information

•Acknowledgements:

– Herbert Van de Sompel

– OAI Steering and Technical Committees

Open Archives Initiative

•Testing the hypotheses

–exposing metadata in various forms willfacilitate creation of value-added services

–key to deployable DL infrastructure is low-entry cost

–Individual communities can/will customizecommon infrastructure

Where we’ve come from

•Late 1999 Santa Fe UPS meeting – increaseimpact of eprint initiatives through federation

•Santa Fe Convention – metadata harvestingamong eprint archives

•Increasing interest outside the eprintcommunity

–Research libraries

–Museums

–Publishers

Progress over the past year

•OAI workshops at US and EC DL conferences

•Organizational stability

–Executive committee and steering committee

•September 2000 technical meeting

–Reframe and rethink technical solutions forbroader domain

•Extensive testing and refinement of technicalinfrastructure

Technical Infrastructure – keytechnical features

•Deploy now technology – 80/20 rule

•Two-party model – providers and consumers

•Simple HTTP encoding

•XML schema for some degree of protocolconformance

•Extensibility

–Multiple item-level metadata

–Collection level metadata

OAI protocol requests

$C:\UPS\OA50.gif$

Supporting protocol requests:

• Identify

• ListMetadataFormats

• ListSets

Harvesting protocol requests:

• ListRecords

• ListIdentifiers

• GetRecord

$C:\UPS\OA50.gif$

repository

$C:\UPS\OA50.gif$

harvester

service provider

data provider

Where we are now

•“Stable” 1.0 protocol specification

•Hopefully, self-documenting infrastructure

–http://www.openarchives.org

•27 registered data providers

•Increasing number of tools available

•Research initiatives

–NSF-funded NSDL

–EC-funded Cyclades

–Andrew W. Mellon service proposals

–EC-funded community building

Where do we go from here

•Controlling the stampede

•Maintaining the organizational model – lean andmean while encouraging community-specificexploitation

•Encouraging testing especially through deploymentand especially service development

•Encouraging metadata diversification – this isn’t justabove Dublin Core!!!

–Preservation

–Document access

–Authentication

OAI & Metadata Research

•Dictionary of metadata terms (Tom Baker)

•Mandating usage rules has only limitedeffectiveness

•Compiling usage of those terms is vital tomachine understanding and interoperability

–Provide context heuristics for search engine andindexer processing

•Large-scale deployment of OAI and webcrawling enables (partial) automation ofusage compilation (e.g., data mining of termusage)

Preservation Models:monitoring threats to distributedcontent

•Acknowledgements:

– Bill Arms

– Peter Botticelli (CUL)

– Anne Kenney (CUL)

Preservation & Remote Control

•Organization Issues

–“assured preservation” may not be possiblewithout direct custodial control.

–what are the levels of acceptability and for whichtypes of resources?

•Technical Issues

–what are the technologies for remote control atthe various levels of assurance deemed acceptableby the library?

–what is the probability of a reasonable level ofpreservation in the context of such technologies?

Cost vs. Functionality

Leveraging Current Work

•Event-based metadata

•Metadata harvesting

•Longevity and threats to digitalresources

Level 0 Experiment

Level 1 Experiment

$C:\William\Slides 01\CNI-NSDL\nsflogo.jpg$

$C:\William\Slides 01\CNI-NSDL\newcu.logo.gif$

$C:\William\Slides 01\CNI-NSDL\SITEforsci.jpg$

$C:\William\Slides 01\CNI-NSDL\site-top.titles.gif$

One of Six Core IntegrationDemonstration ProjectsOne of Six Core IntegrationDemonstration Projects

for the NSDLfor the NSDL

$C:\William\Slides 01\CNI-NSDL\Images\nsflogo.jpg$

How Big might the NSDL be?

The NSDL aims to be comprehensive -- all branches of science,all levels of education, very broadly defined.

Five year targets:

1,000,000 different users

10,000,000digital objects

100,000independent sites

Requires: low-cost, scalable, technology

automated collection building and maintenance

$C:\William\Slides 01\CNI-NSDL\SITEforsci.jpg$

$C:\William\Slides 01\CNI-NSDL\site-top.titles.gif$

Levels of Interoperability:Metadata Harvesting

Agreements on simple protocol and metadata standard(s)

Example:

Metadata harvesting protocol of

the Open Archives Initiative (MHP)

• Moderate-quality services

• Low cost of entry to participating sites

Moderately large numbers of loosely collaborating sites

Promising but still an emerging approach

$C:\William\Slides 01\CNI-NSDL\SITEforsci.jpg$

$C:\William\Slides 01\CNI-NSDL\site-top.titles.gif$

Levels of Interoperability:Gathering

Robots gather collections automatically with no participationfrom individual sites

Examples:

Web search services (e.g., Google)

CiteSeer (a.k.a. ResearchIndex)

• Restricted but useful services

• Zero cost of entry to gathered sites

Very large numbers of independent sites

Only suitable for open access collections

$C:\William\Slides 01\CNI-NSDL\SITEforsci.jpg$

$C:\William\Slides 01\CNI-NSDL\site-top.titles.gif$