Clustering the Reliable File Transfer Service

June 6, 2007

TeraGrid '07

Clustering theReliable File Transfer Service

Jim Basney and Patrick DudaNCSA, University of Illinois

This material is based upon work supported by the NationalScience Foundation under Grant No. 0426972.

June 6, 2007

TeraGrid '07

Goal

•Provide a highly availableReliable File Transfer (RFT) Service

–Tolerate server failures

•Hardware/software faults and resourceexhaustion

–Continue to handle incoming requests

–Continue to make forward progress on filetransfers in the queue

June 6, 2007

TeraGrid '07

Globus ToolkitReliable File Transfer Service

RFT

Client

GridFTP

June 6, 2007

TeraGrid '07

RFT and GridFTP Clustering

GridFTPcontrol

RFT

GridFTPdata

RFT

June 6, 2007

TeraGrid '07

Clustering Approach

RFT

LoadBalancer

HADBMS

June 6, 2007

TeraGrid '07

Web ServiceContainer

RFT State Management

RFT

DelegationService

Client

DBMS

June 6, 2007

TeraGrid '07

RFT DB Tables

Request

Transfer

Restart

Termination Time

Started Flag

Max Attempts

Delegated EPR

Container ID

Start Time

Request ID

Source URL

Destination URL

Status

Attempts

Retry Time

Transfer ID

Restart Marker

Last Update Time

Added Fields

June 6, 2007

TeraGrid '07

New Tables

Delegation Service

Persistent Subscription

Resource ID

Caller DN

Local Name

Termination Time

Listener

Certificate

Container ID

Consumer

Producer

Policy

Precondition

Selector

Topic

Security Descriptor

…

June 6, 2007

TeraGrid '07

RFT Fail-Over

•Based on time-outs

•Periodically query database for pendingrequests with no recent activity

–Stalled requests could be caused by RFT servicecrash, hardware failure, RFT service overload, etc.

–If found, obtain DB write lock, query again, claimstalled requests, and release lock

•Configuration values:

–Query interval (default: 30 seconds)

–Recent interval (default: 60 seconds)

June 6, 2007

TeraGrid '07

Evaluation Environment

•Dedicated 12 node Linux cluster

–Red Hat Enterprise Linux AS Release 3

–Switched Gigabit Ethernet

–2 GB RAM

–dual 2GHz Intel Xeon CPUs 512KB cache

•Globus Toolkit 4.0.3

•MySQL Standard 5.0.27

June 6, 2007

TeraGrid '07

Evaluation

•Correctness / Effectiveness

–Submitted multiple RFT requests ofdifferent sizes to 12 RFT instances

–Verified fail-over and notificationfunctionality

•Performance

–Evaluate overhead of shared DBMS

–Stress test: transfer many small files

June 6, 2007

TeraGrid '07

web servicescontainer stopped

fail-over

60 second fail-over interval

June 6, 2007

TeraGrid '07

June 6, 2007

TeraGrid '07

10%

14%

22%

43%

57%

82%

95%

June 6, 2007

TeraGrid '07

Related Work

•HAND: Highly Available Dynamic DeploymentInfrastructure for GT4

–Migrate services between containers to maintain availabilityduring planned outages

–Does not address management of persistent service state orfail-over for unplanned outages

•myGrid

–DBMS persistence of WS-ResourceProperties in ApacheWSRF

–Points to a general-purpose approach for DBMS-basedpersistence of stateful WSRF services

June 6, 2007

TeraGrid '07

Conclusion

•Clustering RFT provides load-balancing andfail-over with acceptable performance forsmall clusters

•Clustering is a promising approach forapplication to other grid services

June 6, 2007

TeraGrid '07

Future Work

•Correctly handle replay of FTP deletes

•Implement credentialRefreshListener

•Evaluate use of different DBMS solutions

•Investigate GT4 DBMS persistence in general

•Investigate use of WS-Naming

June 6, 2007

TeraGrid '07

Thanks!

•Questions? Comments?

•This material is based upon work supported by the NationalScience Foundation under Grant No. 0426972.

•Performance experiments were conducted on computers at theTechnology Research, Education, and CommercializationCenter (TRECC), a program of the University of Illinois atUrbana-Champaign, funded by the Office of Naval Researchand administered by the National Center for SupercomputingApplications. We thank Tom Roney for his assistance with theTRECC cluster.

•We also thank Ravi Madduri from the Globus project foranswering our questions about RFT.