Simple stochastic models

Haplotype Blocks

An Overview

A. Polanski

Department of Statistics

Rice University

Key Papers

1.N. Patil et al., (2001), Blocks of Limited HaplotypeDiversity Revealed by High-Resolution Scanning ofHuman Chromosome 21, Science, vol. 294, pp. 1719-1723

2.N. Wang et al., (2002), Distribution of RecombinationCrossovers and the Origin of Haplotype Blocks: TheInterplay of Population History, Recombination andMutation, Am. J. Hum. Genet., vol. 71, pp. 1227-1234.

3.K. Zhang et al., (2002), A Dynamic ProgrammingAlgorithm for Haplotype Block Partitioning, PNAS, vol.99, pp. 7335-7339

Supplementary Papers

1.R. Hudson, N. Kaplan, (1985), Statistical Properties ofthe Number of Recombination Events in The History ofa Sample of DNA sequences, Genetics, vol. 111, pp.147-164

2.R. Hudson, 2002, Generating Samples under a Wright-Fisher Neutral Model of Genetic Variation,Bioinformatics, vol. 18, pp. 337-338

3.D. Reich et al., (2001), Linkage Disequilibrium in theHuman Genome, Nature, vol. 411, pp. 199-204

What are Haplotype Blocks ?

Haplotype block = a sequence of contiguous markerson DNA, homogeneous according to somecriterion

Markers = Single Nucleotide Polymorphisms (SNPs)

Data (Patil et al. 2001)

Chromosome 21

Physically separated the two copies ofchromosome 21 using a rodent-humansomatic cell hybrid technique

Sample of 20 copies of chromosome 21(32397439 bases)

Found: 35989 SNPs

Fig. 2 from (Patil et al. 2001)

0100000000000000000010000000000000010000111000000000100000001001000000001001000000000000000000001000000001101000010101010

0000000010000000000010000000000100100001000000000000001011001001001010001001000000000010010001011000000001101010010101010

0000000001000100010110001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010

0000000000000100010010001010000000010100011000000000010100000000000100000100110000011101001000000110000110001000100011010

0000000010000000000010000100000100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000

0010000000100001000010010000000000010000011000000000010100000000100100110100010000000010000001001000001001110100000000000

0000000010000000000010000100110100100000000000000000001001001001001010001001000000000010010001011000000001100100000000000

1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010

0000000000001000000010000000000000010000011000000000000000001001000000001001000000000000000000001000000001101000010101010

0000000010000000000010000100000100100000000000000000001001001101001010001001000000000010010001011000000001100100000000000

1000100000000000000001000001000101000000000000000001000000001001000000001001000000000000000000001000010001101010010101010

0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010

1000100000000000000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010

0000000100100000000010010000000000011000011010000000010100000010100100100100010010000010100001001000001001110100000000000

1000100000000010000001000001000101000000000000000001000000001001000001001001000000100000000100001000000001101010010101010

0000000000100000000010010000000000010000011000000000010100000000100100100100010000000010000001001000001001110101000000001

0000000000100000000010010000000000010000011010000000010100000010100100100100010010000010000001001001001001110100000000000

0001001000010000001000100000001010000000011001111110000000110000000000000010011101010000001010100100000000001000001011110

0000100000000000100001000000000101000000000000000000000000001001010000001001000000000000000000001000000001101000010101010

0001010000000000001000000000000010000010011101000010000000100000000000000010010001010000001000100100100000001000001011010

……

i = 1, 2, …, 35989

SNP no i

Problems

How do we determine boundaries betweenblocks ?

1.Average value of standarized coefficient of linkagedisequilibrium is greater than some threshold (Wang etal. 2002, Reich et al. 2001)

2.Infer sites in the sample of DNA sequences whererecombination events happened in the past history(Wang et al. 2002, Hudson, 2002)

3.Chromosome coverage – minimum number of SNPs toaccount for majority of haplotypes (Patil et al. 2001,Zhang et al. 2002)

What evolutionary forces areresponsible for haplotype blocksformation ?

•Mutation

•Genetic drift

•Recombination

•Recombination hot spots

Methods

Method 1 (Wang et al. 2002)

Infer sites in the sample of DNA sequences where

recombination events happened in the past history

Three gamete condition

Consider a pair of SNPs, SNP1 and SNP2. Ifthere was no recombination between SNP1and SNP2, they must satisfy three gametecondition

SNP1

SNP2

SNP1

SNP2

AG

CT

Four gamete test (Hudson andKaplan, 1985)

If we see all four gametes at SNP1 and SNP2

SNP1

SNP2

Then there must have been a recombination event between

these sites in their past history

4GT

Array of pairwise 4GT test results

Hudson and Kaplan, 1985

D, dij=

0, if there are less then 4 gametes

1, if there are 4 gametes

What is the minimal number of recombinations that could

explain observed data ?

Statistics FR (Hudson and Kaplan, 1985)

Fig. 1 from Wang et al., 2002

Block 1

Block 2

Block 3

Wang et al., 2002 - Study

•R. Hudson’s program for simulating genealogies withmutation, drift and recombination under variousdemographic scenarios

•Study of dependence of average lengths of blocks ondifferent factors

•Comparison of simulation results to data from Patil et al.,2002

Dependence of average lengths of blockson recombination frequency

… on sample size

... on mutation intensity

Comparison to data from Patil etal. 2001

•Compute distribution of haplotype blocklengths in the data from Patil et al. 2001

•Try to tune parameters  and R to obtainsimilar distribution in the simulations

… Failed

Try a mixture of two different recombinationfrequencies - better

Method 2 (Patil, 2001)

Chromosome coverage – minimum number of SNPs

to account for majority of haplotypes

Fig. 2 from (Patil et al. 2001)

Problem formulation

Define block boundaries to minimize thenumber of SNPs that distinguish at least percent of the haplotypes in each block

Common haplotypes

Those represented more than one in the block

Condition

Common haplotypes must constitute at least=80 percent of all haplotypes in the block

Blocks that do not satisfy this are not allowed

Fragment of Fig. 2 from Patil etal., 2001

Notation

•B – block defined as numbers of SNPs,

e.g., B = 45, 46,….50, or B = i, i+1,…, j

•L(B) length of the block (number of SNPs)

•f(B) – minimum number of SNP’s requiredto distinguish common haplotypes

Greedy Solution