Huoltotoiminnan optimointi

S ystems

Analysis Laboratory

Helsinki University of Technology

Flight Time Allocation UsingReinforcement Learning

Ville Mattila and Kai Virtanen

Systems Analysis Laboratory, Helsinki University of Technology

www.sal.tkk.fi Ville.A.Mattila@tkk.fi

S ystems

Analysis Laboratory

Helsinki University of Technology

Abstract

Fighter aircraft are maintained periodically on the basis of cumulated usage hours. In a fleet ofaircraft, the timing of the maintenance therefore depends on the allocation of flight time. Thetiming is also subject to a number of uncertainties such as failures of the aircraft. A fleet withlimited maintenance resources is faced with a design problem in assigning the aircraft to flightmissions so that the overall amount of maintenance needs will not exceed the maintenancecapacity. We consider the assignment of aircraft to flight missions as a Markov DecisionProblem over a finite time horizon. The average availability of aircraft is taken as theoptimization criterion. We describe the fleet operations with a simulation model. An efficientassignment policy is solved using a Reinforcement Learning technique called Q-learning thatpresents actions to the simulation and observes the resulting system behavior. We compare theperformance of the Q-learning algorithm to a set of heuristic assignment rules using problemsinvolving varying number of aircraft and types of periodic maintenance. Moreover, we considerthe possibilities of practical implementation of the produced solutions.

S ystems

Analysis Laboratory

Helsinki University of Technology

1. The Flight Time Allocation Problem

S ystems

Analysis Laboratory

Helsinki University of Technology

Problem setting

A fraction of aircraftassigned to flight missions

Air base

Flight missions

Periodic maintenanceafter fixed number offlight hours

end of day

Mission-capable aircraft to base

Limited maintenancecapacity

Which assignment preservesaircraft availability?

start of day

S ystems

Analysis Laboratory

Helsinki University of Technology

The flight time allocation problem

•The timing of periodic maintenance depends on the assignment of aircraft toflight missions, i.e., allocation of flight time

→Problem: How to allocate flight time so that aircraft availability is preserved

S ystems

Analysis Laboratory

Helsinki University of Technology

Availability as performance indicator

•Availability:

The proportion of mission-capable aircraft to the total size of the fleet

•One of the primary performance indicators of operational capability in actualmaintenance-related decision making

•We consider average availability over a finite time horizon

–Need to study operational capability given certain initial state

–Operational environment remains the same for a limited amount of time

S ystems

Analysis Laboratory

Helsinki University of Technology

Difficulty of flight time allocation

•Uncertainties

–Maintenance duration

–Accumulated flight hours during missions

–Unplanned maintenance through failure repairs

–Unplanned maintenance through battle damage repairs

•Dimension of the problem

–Potentially a large number of aircraft

–Different types of periodic maintenance

–Multiple, different level maintenance facilities

S ystems

Analysis Laboratory

Helsinki University of Technology

2. Problem formulation

S ystems

Analysis Laboratory

Helsinki University of Technology

Formulation as a Markov Decision Problem

m-1

State of a single aircraft→

State of the fleet →

the number of aircraft in each state

m-1

Days in use since last periodic maintenance

Periodic maintenance

Transition:assigned tofligh missions

Transition:maintenancecompleted

Action:

The number of aircraftassigned to flightmissions from each state

Performance criterion:

The number of aircraft inmaintenance /

total fleet size

S ystems

Analysis Laboratory

Helsinki University of Technology

System state

•Denote the size of the fleet with N

•A single aircraft

–State i [0, m-1]: ‘the number of stages in use since last periodic maintenance’

–State m: ‘aircraft in maintenance’

–m equals the maintenance interval of the aircraft

•The aircraft fleet

–State s=(s0, s1,…, sm), where si denotes the number of aircraft in state i

S ystems

Analysis Laboratory

Helsinki University of Technology

Actions

•The number of aircraft assigned to perform flight missions d

•Action a=(a0, a1,…, am-1) where ai is the number of assigned aircraft instate i

•The set of admissible actions in state s of the aircraft fleet:

–At most d aircraft are assigned

–If the number of available aircraft is less than d, all are assigned

S ystems

Analysis Laboratory

Helsinki University of Technology

Simulation of the aircraft fleet

•Current state s, action a, resulting state s’

•Maintenance capacity M, expected duration D

•State transitions

–Completed maintenance:for k = 1 to min(M,sm)

draw z~U(0,1)

if z < 1/D, s0’=s0 + 1, sm’=sm - 1

–Usage of aircraft:for k = 1 to m-1

sk’=sk’+ak-1’ - ak’

S ystems

Analysis Laboratory

Helsinki University of Technology

Optimization criterion

•Immediate reward r(s,a,s’) is the aircraft availability in s’

•Optimization criterion: average availability over finite number of stages

S ystems

Analysis Laboratory

Helsinki University of Technology

3. The reinforcement learning approach

S ystems

Analysis Laboratory

Helsinki University of Technology

Learning optimal flight time allocation policy

The learning algorithm

Present action to the simulation and observe its benefit on the basis ofthe simulated response

Simulation of the system

Simulate the state transition and reward following the execution of theaction

action

system state

reward

Repeat

Learned policy: actions that produce greatest immediate and expected future rewards

S ystems

Analysis Laboratory

Helsinki University of Technology

Q-learning

•The value of executing action a in state s and following the optimal policy fromthen on is stored in the Q-factor Q(s,a) of the state-action pair

•The factors Q(s,a) are updated s follows:

where (0,1) is the step size and (0,1) the discount factor

•The learned policy:

S ystems

Analysis Laboratory

Helsinki University of Technology

Details of the learning algorithm

•Action selection

with probability e, select a for which Q(s,a) is highest, i.e., a greedy action

with probability 1-e, select any other a randomly from A(s)

•Step size

where V(s,a) denotes number of times pair (s,a) has been visited

•Discounting

–Q-learning is actually a technique for discounted total reward

–Can however optimize average reward, if  is sufficiently high

S ystems

Analysis Laboratory

Helsinki University of Technology

Heuristic policies

•Can represent efficient solution for many complex problems

•Can act as reference to the policy produced by Q-learning

•Two simple policies are considered

–‘advance’:

flight time is allocated to aircraft with least time to maintenance

–‘postpone’:

flight time is allocated to aircraft with most time to maintenance

S ystems

Analysis Laboratory

Helsinki University of Technology

4. Results

S ystems

Analysis Laboratory

Helsinki University of Technology

Example problem

•Problem instance

–N = 4 the number of aircraft

–m = 2 maintenance interval

–d = 1 number of aircraft to flight missions

–M = 1 maintenance capacity

–D = 2 expected duration of maintenance

–L = 50 number of stages

–Initial state s(0)=[1 2 1]

Learning parameters

–e = 0.9 probability of choosing a greedy action

– = 0.98 the discount factor

S ystems

Analysis Laboratory

Helsinki University of Technology

Convergence of average reward

•A convergent solution is obtainedafter 1000 state transitions

–20 replications of the 50-daytime period

•Average availablity over the timeperiod outperforms simpleheuristic policies

S ystems

Analysis Laboratory

Helsinki University of Technology

Availability under the different policies

•The learned policy

–Maintains higher availability thanheuristic policies in the beginning

–Matches the availability of theheuristics during later stages

S ystems

Analysis Laboratory

Helsinki University of Technology

Characterizing the learned policy

•Since m was taken very small, the learned solution can be characterizedwith the ‘advance’ and the ‘postpone’ heuristic policies as follows:

–if number of aircraft in maintenance is equal to or more than capacity:

→ ‘postpone’

–if maintenance facility is idle:

if s2 >1 → ‘advance’

else → ‘postpone’

S ystems

Analysis Laboratory

Helsinki University of Technology

5. Conclusions

S ystems

Analysis Laboratory

Helsinki University of Technology

Contributions

•Insight to a difficult problem actually faced by fleet commanders

•Flight time allocation as a means for timing maintenance

–Has not been considered as a dynamic problem to the best of ourknowledge

–Has not been considered with RL-techniques

S ystems

Analysis Laboratory

Helsinki University of Technology

The reinforcement learning approach

•Results of the reinforcement learning approach for the studied probleminstances are promising

–A convergent policy is found

–The obtained policy outperforms simple heuristic policies

–Learning time is manageable for fleet sizes of up to 16 aircraft

S ystems

Analysis Laboratory

Helsinki University of Technology

Extensions to the model

•A number of extensions to the presented model are likely required inorder to describe more realistic scenarios

•Of particular interest are the effects of

–Additional uncertainties such as battle damage

–Operational environment that evolves through time

–Violations of the Markovian property of states

S ystems

Analysis Laboratory

Helsinki University of Technology

Analysis of obtained policies

•The purpose of studying the flight time allocation problem is to obtain newinsight for the use of human decision makers

•Until now, Q-learning has been implemented as a look-up table version

–Q-factors are stored explicitly → representation of learned requires largestorage space

–Post-learning analysis to build intuition of efficient policies

•Future challenge is to represent policies in compact form that allows both

–Efficient learning

–Intuitive representation to human decision makers