Analytical Tools for High Throughput Sequencing Data

Marshall University School of Medicine

Department of Biochemistry and Microbiology

BMS 617

Lecture 13: Multiple, Logistic andProportional Hazards Regression

Marshall University Genomics Core Facility

Multiple Regression

•In linear regression, we had one independentvariable, and one dependent (outcome) variable

–In lab experiments, this is fairly common

–The investigator manipulates the value of one variableand keeps everything else the same

•In some lab experiments, and in mostobservational studies, there is more than oneindependent variable

–Multiple Regression is used for these scenarios

–"Multiple Regression" really refers to a collection ofdifferent techniques

Marshall University School of Medicine

Aims of Multiple Regression

•Quantifying the effect of one variable of interest while adjusting forthe effects of other variables

–Very common in observational studies

–The other variables change outside of the control of the investigator

–These other variables are often called covariates

•Creating an equation which is useful for predicting the value of theoutcome variable given the values of the various independentvariables

–For example, predict the probability of cancer recurrence after surgeryalone given characteristics of the tumor (grade, stage, etc) and of thepatient (age, height, weight, etc)

•Might be used to decide whether or not to use chemotherapy in addition tosurgery

•Developing a scientific understanding of the impact of severalvariables on the outcome

Marshall University School of Medicine

Types of Multiple Regression

•We will look at the following types of multiple regression(there are many others):

–Multiple Linear Regression

•The dependent variable is a linear function of the independentvariables

–Logistic Regression

•The outcome variable is binary (dichotomous, or categorical with twopossible outcomes)

•The log odds ratio of the outcome is modeled as a function of theindependent variables

–Proportional Hazards Regression

•Proportional Hazards Regression is used when the outcome is theelapsed time to a non-recurring event

•It is effectively used to compute the effect of independent variableson a survival curve

Marshall University School of Medicine

Multiple Linear Regression

•Multiple Linear Regression finds the linear equation which bestpredicts an outcome variable, Y, from multiple independentvariables X1, X2,…, Xk

•Example (from Motulsky): Lead Exposure and Kidney Function

–Staessen et al. (1992) investigated the relationship between leadconcentration in the blood and kidney function

•Kidney function measured by creatinine clearance

–Observational study of 965 men

–Naive approach would be to measure lead concentration andcreatinine clearance and analyze just the two variables

–However, kidney function is known to decrease with age, and leadaccumulates in the blood over time

•Age is a confounding variable

•Must account for this

Marshall University School of Medicine

Multiple Regression Model

•The model Staessen et al. used was

•Yi = β0 + β1Xi,1 + β2Xi,2 + β3Xi,3 + β4Xi,4 + β5Xi,5 + εi

•where the variables are

Marshall University School of Medicine

Variable

Description

Creatine clearance of subject i

Xi,1

log(serum lead) of subject i

Xi,2

Age of subject i

Xi,3

Body mass of subject i

Xi,4

log(GGT) of subject i (liver function)

Xi,5

1 if subject i had previously taken diuretics, 0 otherwise

εi

Random scatter

Multiple Regression Parameters

•The β in the equation for the model are theparameters of the model

–Do not vary from data point to data point

–Are values associated with the population

–Will be estimated from the data

•Note that one of the variables (Xi,5) iscategorical, and we use a “dummy variable” inits place

Marshall University School of Medicine

What multiple regression does

•Multiple linear regression finds values for theparameters that make the model predict theactual data as well as possible

•Estimates for β0, … β5 are usually denoted b0 … b5

•Software performing the regression will reportthe best estimates for each parameter, aconfidence interval and p-value for eachestimate, and an R2 value for the model

•Null hypotheses for the p-values are that thevariable provides no information to the model,i.e. that the parameter is zero

Marshall University School of Medicine

Interpreting the Coefficients

•The coefficients can be interpreted in a similar way tothe slope estimate in simple linear regression

–Represent the change in the dependent variable for oneunit increase in the corresponding independent variable,keeping all the other independent variables fixed

•In the example, b1 (estimate for log(leadconcentration)) was -9.5 ml/min, with a 95% CI of [-18.1, -0.9].

•This means for every one unit increase in log(leadconcentration), creatinine clearance decreased by -9.5ml/min on average, if all other variables were keptfixed.

Marshall University School of Medicine

Statistical Significance of theCoefficients

•One unit increase in log(lead concentration) means a 10fold increase in lead concentration

•So the average decrease in creatinine clearancecorresponding to a 10 fold increase in lead concentrationwas 9.5 ml/min, and the 95% confidence interval for thedecrease was 0.9ml/min to 18.1ml/min.

–Since the 95% CI does not contain 0, the p-value for thiscoefficient must be less than 0.05

•This is the p-value for the null hypothesis that thecoefficient is zero

•Alternatively think of this as a comparison of models:

–Compare the full model (including this variable) to the modelnot including this variable

Marshall University School of Medicine

Interpreting coefficients for “dummyvariables”

•One of the variables in the model was really a binaryvariable

–Has the subject previously taken diuretics?

–Coded as 0 for no and 1 for yes

•Estimate for the coefficient for this variable was -8.8ml/min

–An increase in one unit for this variable results in adecrease in creatinine clearance of 8.8 ml/min, on average

–Since the only values are 0 and 1, this means thatparticipants who has previously taken diuretics had anaverage creatinine clearance 8.8 ml/min lower than thosewho had not, if all other variables are held equal

Marshall University School of Medicine

Interpreting the R2 value for the model

•Multiple linear regression reports an R2 value

–For our example, R2 is 0.27

•This means that 27% of the variation in creatinineclearance is accounted for by the model

•The remaining 73% is due to random scatter, or isassociated with variables not included in the model

•Unlike simple linear regression, we cannot plot a graphof the model

•One approach to visualizing the model is to plot thepredicted outcome variable from the model against theactual measured value

Marshall University School of Medicine

Multiple Linear Regression Plot

Marshall University School of Medicine

Variable Selection

•The authors of the article collected much moredata

•Stated that other variables did not improve the fitof the model

•Adding additional parameters will almost alwaysincrease the R2 value

–Should use the sum-of-squares F test explained earlierto test if there really is an improvement in the model

–Beware of overfitting (explained later)

Marshall University School of Medicine

Logistic Regression

•Logistic Regression is used when the outcomevariable is binary

–i.e. categorical with two possible outcomes

•The general idea is to build a multiple linearmodel with the outcome variable being the log ofthe odds ratio

–i.e. we build a model predicting the log of the odds ofone of the two outcomes from the independentvariables

–the parameters describe the difference in odds whenthe variables change by one unit

Marshall University School of Medicine

Logistic Regression Example

•We performed chart reviews on 99 post-menopausal women

•Ran a logistic regression for an outcome ofdiabetes with age at menopause, smokingstatus, and BMI as independent variables

Marshall University School of Medicine

Logistic Regression Results

Marshall University School of Medicine

Interpreting Logistic Regression Results

•The "Model Summary" box describes how well the model fits the data.

–-2 Log likelihood is computed from the likelihood of our observed data giventhe model. Since likelihood must be between 0 and 1, this is always positiveand a small value means a better fit. (Our data do not fit the model well.)

•R2 cannot be calculated in the same way for logisitic regression. Theremaining two values give two alternate approaches, and theinterpretation for these is similar to a regular R2. Again, our data do not fitthe model well.

•The "Classification Table" describes the accuracy of using the model as apredictor.

•Use the independent variables to compute the predicted odds, andpredict the class based on the most likely

•Note that adding more variables will always improve the accuracy; thisshould really be tested on an independent data set

Marshall University School of Medicine

Interpreting the Logistic RegressionParameters

•The "Variables in the Equation" box gives the parameter estimates,95% CIs, and p-values

•The parameter for Smoking is 1.204. This means that a one-unitincrease in the smoking variable results in an increase in the logodds ratio of 1.204.

•Logs here are natural logs; so the increase in odds ratio ise1.204=3.335 fold

•This is a dummy variable, so a smoker has about 3.3 times the oddsof becoming diabetic than a non-smoker

•The parameter for BMI is 0.072; e0.072=1.075, so an increase of oneunit in BMI results in a 1.075-fold increase in the odds ratio of beingdiabetic.

•The p-values and 95% CIs show that the parameter for smoking issignificant at a significance level of 0.05.

•BMI has a p-value of 0.055.

Marshall University School of Medicine

Mathematical Model for LogisticRegression

•The mathematical setup for logistic regression is:

•log(ORi) = β0 + Xi,1 β1 + … + Xi,k βk

•where the variables are

•OR: Odds ratio for subject i

•Xi,j: Value of variable j for subject i

•For our model, the estimates give

•log(OR) = -3.307 + 1.208 S + 0.071 B

•OR = e-3.307 + 1.208 S + 0.071 B

•OR = e-3.307e1.208 Se0.071 B = 0.037 x 3.347S x 1.073B

Marshall University School of Medicine

Proportional Hazards Regression

•Proportional Hazards Regression is used whenthe outcome is elapsed time to a non-recurring event

–i.e. the same basic scenario as for survival analysis

–We previously compared two groups for differentsurvival rates using the Mantel-Cox test

–Computed hazard ratio between the two groups

Marshall University School of Medicine

Proportional Hazards extends Mantel-Cox test

•In Proportional Hazards regression, we estimatethe effect of multiple factors on the hazard ratio

•Can be used to correct the hazard ratio forconfounding variables

•Short et al. (2012) compared survival curves fortwo different treatments of COPD

–Computed a “crude” hazard ratio using Mantel-Cox,and then a hazard ratio corrected for covariates(confounding variables)

Marshall University School of Medicine

Summary

•Multiple Linear Regression fits a dependent variable as alinear model of multiple independent variables

–Provides parameter estimates for each independent variable,along with confidence intervals and p-values

–The null hypothesis for the p-value is that the variable doesn'tcontribute to the model

–Used for finding the effect of a variable while correcting forconfounding variables

•Logistic regression is used when the dependent variable isbinary

–Models the log odds ratio as a linear function of the dependentvariables

–Parameters are the increase in log odds ratio per unit increase inthe independent variable

Marshall University School of Medicine