Introduction to MetaAnalysis
Michael Borenstein, Larry V. Hedges, Julian P. T. Higgins, Hannah R. Rothstein Outlines the role of metaanalysis in the research process
 Shows how to compute effects sizes and treatment effects
 Explains the fixedeffect and randomeffects models for synthesizing data
 Demonstrates how to assess and interpret variation in effect size across studies
 Clarifies concepts using text and figures, followed by formulas and examples
 Explains how to avoid common mistakes in metaanalysis
 Discusses controversies in metaanalysis
 Features a web site with additional material and exercises
A superb combination of lucid prose and informative graphics, written by four of the worldвЂ™s leading experts on all aspects of metaanalysis. Borenstein, Hedges, Higgins, and Rothstein provide a refreshing departure from cookbook approaches with their clear explanations of the what and why of metaanalysis. The book is ideal as a course textbook or for selfstudy. My students, who used prepublication versions of some of the chapters, raved about the clarity of the explanations and examples. David Rindskopf, Distinguished Professor of Educational Psychology, City University of New York, Graduate School and University Center, & Editor of the Journal of Educational and Behavioral Statistics .
The approach taken by Introduction to Metaanalysis is intended to be primarily conceptual, and it is amazingly successful at achieving that goal. The reader can comfortably skip the formulas and still understand their application and underlying motivation. For the more statistically sophisticated reader, the relevant formulas and worked examples provide a superb practical guide to performing a metaanalysis. The book provides an eclectic mix of examples from education, social science, biomedical studies, and even ecology. For anyone considering leading a course in metaanalysis, or pursuing selfdirected study, Introduction to Metaanalysis would be a clear first choice. Jesse A. Berlin, ScDВ
Introduction to MetaAnalysis is an excellent resource for novices and experts alike. The book provides a clear and comprehensive presentation of all basic and most advanced approaches to metaanalysis. This book will be referenced for decades. Michael A. McDaniel, Professor of Human Resources and Organizational Behavior, Virginia Commonwealth University
 文件已损坏
 文件受 DRM 保护
 文件不是书籍（例如 xls、html、xml）
 文件是文章
 文件是书籍的摘录
 文件是杂志
 文件是测试表格
 文件是垃圾邮件
Together we will make our library even better
请注意：您需要验证要发送到Kindle的每本书。检查您的邮箱中是否有来自亚马逊Kindle的验证电子邮件。
您可能会感兴趣 Powered by Rec2Me
关键词
关联书单


Introduction to MetaAnalysis Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 Introduction to MetaAnalysis Michael Borenstein Biostat, Inc, New Jersey, USA. Larry V. Hedges Northwestern University, Evanston, USA. Julian P.T. Higgins MRC, Cambridge, UK. Hannah R. Rothstein Baruch College, New York, USA. A John Wiley and Sons, Ltd., Publication This edition first published 2009 Ó 2009 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professio; nal advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress CataloguinginPublication Data Introduction to metaanalysis / Michael Borenstein . . . [et al.]. p. ; cm. Includes bibliographical references and index. ISBN 9780470057247 (cloth) 1. Metaanalysis. I. Borenstein, Michael. [DNLM: 1. MetaAnalysis as Topic. WA 950 I614 2009]. R853.M48I58 2009 610.72—dc22 2008043732 A catalogue record for this book is available from the British Library. ISBN: 9780470057247 Set in 10.5/13pt Times by Integra Software Services Pvt. Ltd, Pondicherry, India Printed in the UK by TJ International, Padstow, Cornwall Contents List of Tables List of Figures Acknowledgements Preface Web site xiii xv xix xxi xxix PART 1: INTRODUCTION 1 HOW A METAANALYSIS WORKS Introduction Individual studies The summary effect Heterogeneity of effect sizes Summary points 2 WHY PERFORM A METAANALYSIS Introduction The streptokinase metaanalysis Statistical significance Clinical importance of the effect Consistency of effects Summary points 3 3 3 5 6 7 9 9 10 11 12 12 14 PART 2: EFFECT SIZE AND PRECISION 3 OVERVIEW Treatment effects and effect sizes Parameters and estimates Outline of effect size computations 17 17 18 19 4 EFFECT SIZES BASED ON MEANS Introduction Raw (unstandardized) mean difference D Standardized mean difference, d and g Response ratios Summary points 21 21 21 25 30 32 vi Contents 5 EFFECT SIZES BASED ON BINARY DATA (2 2 TABLES) Introduction Risk ratio Odds ratio Risk difference Choosing an effect size index Summary points 33 33 34 36 37 38 39 6 EFFECT SIZES BASED ON CORRELATIONS Introduction Computing r Other approaches Summary points 41 41 41 43 43 7 CONVERTING AMONG EFFECT SIZES Introduction Converting from the log odds ratio to d Converting from d to the log odds ratio Converting from r to d Converting from d to r Summary points 45 45 47 47 48 48 49 8 FACTORS THAT AFFECT PRECISION Introduction Factors that affect precision Sample size Study design Summary points 51 51 52 52 53 55 9 CONCLUDING REMARKS 57 PART 3: FIXEDEFFECT VERSUS RANDOMEFFECTS MODELS 10 OVERVIEW Introduction Nomenclature 61 61 62 11 FIXEDEFFECT MODEL Introduction The true effect size Impact of sampling error 63 63 63 63 Contents Performing a fixedeffect metaanalysis Summary points vii 65 67 12 RANDOMEFFECTS MODEL Introduction The true effect sizes Impact of sampling error Performing a randomeffects metaanalysis Summary points 69 69 69 70 72 74 13 FIXEDEFFECT VERSUS RANDOMEFFECTS MODELS Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval The null hypothesis Which model should we use? Model should not be based on the test for heterogeneity Concluding remarks Summary points 77 77 77 78 79 80 83 83 84 85 85 14 WORKED EXAMPLES (PART 1) Introduction Worked example for continuous data (Part 1) Worked example for binary data (Part 1) Worked example for correlational data (Part 1) Summary points 87 87 87 92 97 102 PART 4: HETEROGENEITY 15 OVERVIEW Introduction Nomenclature Worked examples 105 105 106 106 16 IDENTIFYING AND QUANTIFYING HETEROGENEITY Introduction Isolating the variation in true effects Computing Q Estimating 2 The I 2 statistic 107 107 107 109 114 117 viii Contents Comparing the measures of heterogeneity Confidence intervals for 2 Confidence intervals (or uncertainty intervals) for I 2 Summary points 119 122 124 125 17 PREDICTION INTERVALS Introduction Prediction intervals in primary studies Prediction intervals in metaanalysis Confidence intervals and prediction intervals Comparing the confidence interval with the prediction interval Summary points 127 127 127 129 131 132 133 18 WORKED EXAMPLES (PART 2) Introduction Worked example for continuous data (Part 2) Worked example for binary data (Part 2) Worked example for correlational data (Part 2) Summary points 135 135 135 139 143 147 19 SUBGROUP ANALYSES Introduction Fixedeffect model within subgroups Computational models Random effects with separate estimates of 2 Random effects with pooled estimate of 2 The proportion of variance explained Mixedeffects model Obtaining an overall effect in the presence of subgroups Summary points 149 149 151 161 164 171 179 183 184 186 20 METAREGRESSION Introduction Fixedeffect model Fixed or random effects for unexplained heterogeneity Randomeffects model Summary points 187 187 188 193 196 203 21 NOTES ON SUBGROUP ANALYSES AND METAREGRESSION Introduction Computational model Multiple comparisons Software Analyses of subgroups and regression analyses are observational 205 205 205 208 209 209 Contents Statistical power for subgroup analyses and metaregression Summary points ix 210 211 PART 5: COMPLEX DATA STRUCTURES 22 OVERVIEW 215 23 INDEPENDENT SUBGROUPS WITHIN A STUDY Introduction Combining across subgroups Comparing subgroups Summary points 217 217 218 222 223 24 MULTIPLE OUTCOMES OR TIMEPOINTS WITHIN A STUDY Introduction Combining across outcomes or timepoints Comparing outcomes or timepoints within a study Summary points 225 225 226 233 238 25 MULTIPLE COMPARISONS WITHIN A STUDY Introduction Combining across multiple comparisons within a study Differences between treatments Summary points 239 239 239 240 241 26 NOTES ON COMPLEX DATA STRUCTURES Introduction Summary effect Differences in effect 243 243 243 244 PART 6: OTHER ISSUES 27 OVERVIEW 249 28 VOTE COUNTING – A NEW NAME FOR AN OLD PROBLEM Introduction Why vote counting is wrong Vote counting is a pervasive problem Summary points 251 251 252 253 255 29 POWER ANALYSIS FOR METAANALYSIS Introduction A conceptual approach In context When to use power analysis 257 257 257 261 262 x Contents Planning for precision rather than for power Power analysis in primary studies Power analysis for metaanalysis Power analysis for a test of homogeneity Summary points 30 PUBLICATION BIAS Introduction The problem of missing studies Methods for addressing bias Illustrative example The model Getting a sense of the data Is there evidence of any bias? Is the entire effect an artifact of bias? How much of an impact might the bias have? Summary of the findings for the illustrative example Some important caveats Smallstudy effects Concluding remarks Summary points 263 263 267 272 275 277 277 278 280 281 281 281 283 284 286 289 290 291 291 291 PART 7: ISSUES RELATED TO EFFECT SIZE 31 OVERVIEW 295 32 EFFECT SIZES RATHER THAN p VALUES Introduction Relationship between pvalues and effect sizes The distinction is important The pvalue is often misinterpreted Narrative reviews vs. metaanalyses Summary points 297 297 297 299 300 301 302 33 SIMPSON’S PARADOX Introduction Circumcision and risk of HIV infection An example of the paradox Summary points 303 303 303 305 308 34 GENERALITY OF THE BASIC INVERSEVARIANCE METHOD Introduction Other effect sizes Other methods for estimating effect sizes Individual participant data metaanalyses 311 311 312 315 316 Contents Bayesian approaches Summary points xi 318 319 PART 8: FURTHER METHODS 35 OVERVIEW 323 36 METAANALYSIS METHODS BASED ON DIRECTION AND p VALUES Introduction Vote counting The sign test Combining pvalues Summary points 325 325 325 325 326 330 37 FURTHER METHODS FOR DICHOTOMOUS DATA Introduction MantelHaenszel method Onestep (Peto) formula for odds ratio Summary points 331 331 331 336 339 38 PSYCHOMETRIC METAANALYSIS Introduction The attenuating effects of artifacts Metaanalysis methods Example of psychometric metaanalysis Comparison of artifact correction with metaregression Sources of information about artifact values How heterogeneity is assessed Reporting in psychometric metaanalysis Concluding remarks Summary points 341 341 342 344 346 348 349 349 350 351 351 PART 9: METAANALYSIS IN CONTEXT 39 OVERVIEW 355 40 WHEN DOES IT MAKE SENSE TO PERFORM A METAANALYSIS? Introduction Are the studies similar enough to combine? Can I combine studies with different designs? How many studies are enough to carry out a metaanalysis? Summary points 357 357 358 359 363 364 41 REPORTING THE RESULTS OF A METAANALYSIS Introduction The computational model 365 365 366 xii Contents Forest plots Sensitivity analysis Summary points 366 368 369 42 CUMULATIVE METAANALYSIS Introduction Why perform a cumulative metaanalysis? Summary points 371 371 373 376 43 CRITICISMS OF METAANALYSIS Introduction One number cannot summarize a research field The file drawer problem invalidates metaanalysis Mixing apples and oranges Garbage in, garbage out Important studies are ignored Metaanalysis can disagree with randomized trials Metaanalyses are performed poorly Is a narrative review better? Concluding remarks Summary points 377 377 378 378 379 380 381 381 384 385 386 386 PART 10: RESOURCES AND SOFTWARE 44 SOFTWARE Introduction The software Three examples of metaanalysis software Comprehensive MetaAnalysis (CMA) 2.0 RevMan 5.0 Stata macros with Stata 10.0 Summary points 391 391 392 393 395 398 400 403 45 BOOKS, WEB SITES AND PROFESSIONAL ORGANIZATIONS Books on systematic review methods Books on metaanalysis Web sites 405 405 405 406 REFERENCES 409 INDEX 415 List of Tables Table 3.1 Table 5.1 Table 5.2 Table 8.1 Table 8.2 Table 14.1 Table 14.2 Table 14.3 Table 14.4 Table 14.5 Table 14.6 Table 14.7 Table 14.8 Table 14.9 Table 16.1 Table 18.1 Table 18.2 Table 18.3 Table 18.4 Table 18.5 Table 18.6 Table 19.1 Table 19.2 Table 19.3 Table 19.4 Table 19.5 Table 19.6 Table 19.7 Table 19.8 Table 19.9 Table 19.10 Roadmap of formulas in subsequent chapters Nomenclature for 2 2 table of outcome by treatment Fictional data for a 2 2 table Impact of sample size on variance Impact of study design on variance Dataset 1 – Part A (basic data) Dataset 1 – Part B (fixedeffect computations) Dataset 1 – Part C (randomeffects computations) Dataset 2 – Part A (basic data) Dataset 2 – Part B (fixedeffect computations) Dataset 2 – Part C (randomeffects computations) Dataset 3 – Part A (basic data) Dataset 3 – Part B (fixedeffect computations) Dataset 3 – Part C (randomeffects computations) Factors affecting measures of dispersion Dataset 1 – Part D (intermediate computations) Dataset 1 – Part E (variance computations) Dataset 2 – Part D (intermediate computations) Dataset 2 – Part E (variance computations) Dataset 3 – Part D (intermediate computations) Dataset 3 – Part E (variance computations) Fixed effect model – computations Fixedeffect model – summary statistics Fixedeffect model – ANOVA table Fixedeffect model – subgroups as studies Randomeffects model (separate estimates of 2 ) – computations Randomeffects model (separate estimates of 2 ) – summary statistics Randomeffects model (separate estimates of 2 ) – ANOVA table Randomeffects model (separate estimates of 2 ) – subgroups as studies Statistics for computing a pooled estimate of 2 Randomeffects model (pooled estimate of 2) – computations 19 33 33 52 54 88 88 88 93 93 93 98 98 98 119 136 136 140 140 144 144 152 155 158 159 165 167 169 171 173 173 xiv Table 19.11 Table 19.12 Table 19.13 Table 20.1 Table 20.2 Table 20.3 Table 20.4 Table 20.5 Table 20.6 Table 23.1 Table 23.2 Table 23.3 Table 23.4 Table 24.1 Table 24.2 Table 24.3 Table 24.4 Table 24.5 Table 24.6 Table 24.7 Table 33.1 Table 33.2 Table 33.3 Table 33.4 Table 33.5 Table 34.1 Table 36.1 Table 37.1 Table 37.2 Table 37.3 Table 37.4 Table 38.1 Table 38.2 Table 38.3 List of Tables Randomeffects model (pooled estimate of 2 ) – summary statistics Randomeffects model (pooled estimate of 2 ) – ANOVA table Randomeffects model (pooled estimate of 2 ) – subgroups as studies The BCG dataset Fixedeffect model – Regression results for BCG Fixedeffect model – ANOVA table for BCG regression Randomeffects model – regression results for BCG Randomeffects model – test of the model Randomeffects model – comparison of model (latitude) versus the null model Independent subgroups – five fictional studies Independent subgroups – summary effect Independent subgroups – synthetic effect for study 1 Independent subgroups – summary effect across studies Multiple outcomes – five fictional studies Creating a synthetic variable as the mean of two outcomes Multiple outcomes – summary effect Multiple outcomes – Impact of correlation on variance of summary effect Creating a synthetic variable as the difference between two outcomes Multiple outcomes – difference between outcomes Multiple outcomes – Impact of correlation on the variance of difference HIV as function of circumcision (by subgroup) HIV as function of circumcision – by study HIV as a function of circumcision – full population HIV as a function of circumcision – by risk group HIV as a function of circumcision/risk group – full population Simple example of a genetic association study Streptokinase data – calculations for metaanalyses of pvalues Nomenclature for 2 2 table of events by treatment MantelHaenszel – odds ratio MantelHaenszel – variance of summary effect Onestep – odds ratio and variance Fictional data for psychometric metaanalysis Observed (attenuated) correlations Unattenuated correlations 175 178 179 190 190 191 197 198 202 218 219 220 220 226 227 230 231 233 235 237 304 305 306 306 307 314 329 331 333 334 338 346 346 347 List of Figures Figure 1.1 Figure 2.1 Figure 4.1 Figure 5.1 Figure 5.2 Figure 6.1 Figure 7.1 Figure 8.1 Figure 8.2 Figure 10.1 Figure 11.1 Figure 11.2 Figure 11.3 Figure 12.1 Figure 12.2 Figure 12.3 Figure 12.4 Figure 13.1 Figure 13.2 Figure 13.3 Figure 13.4 Figure 14.1 Figure 14.2 Figure 14.3 Figure 14.4 Figure 14.5 Figure 14.6 Figure 16.1 Figure 16.2 Highdose versus standarddose of statins (adapted from Cannon et al., 2006) Impact of streptokinase on mortality (adapted from Lau et al., 1992) Response ratios are analyzed in log units Risk ratios are analyzed in log units Odds ratios are analyzed in log units Correlations are analyzed in Fisher’s z units Converting among effect sizes Impact of sample size on variance Impact of study design on variance Symbols for true and observed effects Fixedeffect model – true effects Fixedeffect model – true effects and sampling error Fixedeffect model – distribution of sampling error Randomeffects model – distribution of true effects Randomeffects model – true effects Randomeffects model – true and observed effect in one study Randomeffects model – betweenstudy and withinstudy variance Fixedeffect model – forest plot showing relative weights Randomeffects model – forest plot showing relative weights Very large studies under fixedeffect model Very large studies under randomeffects model Forest plot of Dataset 1 – fixedeffect weights Forest plot of Dataset 1 – randomeffects weights Forest plot of Dataset 2 – fixedeffect weights Forest plot of Dataset 2 – randomeffects weights Forest plot of Dataset 3 – fixedeffect weights Forest plot of Dataset 3 – randomeffects weights Dispersion across studies relative to error within studies Q in relation to df as measure of dispersion 4 10 31 34 36 42 46 53 54 62 64 64 65 70 70 71 72 78 78 80 80 89 89 94 94 99 99 108 110 xvi Figure 16.3 Figure 16.4 Figure 16.5 Figure 16.6 Figure 16.7 Figure 16.8 Figure 16.9 Figure 17.1 Figure 17.2 Figure 17.3 Figure 17.4 Figure 18.1 Figure 18.2 Figure 18.3 Figure 19.1 Figure 19.2 Figure 19.3 Figure 19.4 Figure 19.5 Figure 19.6 Figure 19.7 Figure 19.8 Figure 19.9 Figure 19.10 Figure 19.11 Figure 19.12 Figure 19.13 Figure 20.1 Figure 20.2 List of Figures Flowchart showing how T 2 and I 2 are derived from Q and df Impact of Q and number of studies on the pvalue Impact of excess dispersion and absolute dispersion on T2 Impact of excess and absolute dispersion on T Impact of excess dispersion on I2 Factors affecting T 2 but not I 2 Factors affecting I 2 but not T 2 Prediction interval based on population parameters and 2 Prediction interval based on sample estimates M* and T2 Simultaneous display of confidence interval and prediction interval Impact of number of studies on confidence interval and prediction interval Forest plot of Dataset 1 – randomeffects weights with prediction interval Forest plot of Dataset 2 – randomeffects weights with prediction interval Forest plot of Dataset 3 – randomeffects weights with prediction interval Fixedeffect model – studies and subgroup effects Fixedeffect – subgroup effects Fixedeffect model – treating subgroups as studies Flowchart for selecting a computational model Randomeffects model (separate estimates of 2 ) – studies and subgroup effects Randomeffects model (separate estimates of 2 ) – subgroup ffects Randomeffects model (separate estimates of 2 ) – treating subgroups as studies Randomeffects model (pooled estimate of 2 ) – studies and subgroup effects Randomeffects model (pooled estimate of 2 ) – subgroup effects Randomeffects model (pooled estimate of 2 ) – treating subgroups as studies A primary study showing subjects within groups Randomeffects model – variance within and between subgroups Proportion of variance explained by subgroup membership Fixedeffect model – forest plot for the BCG data Fixedeffect model – regression of log risk ratio on latitude 111 113 115 116 118 120 121 130 130 131 132 136 140 144 151 155 159 163 164 167 170 172 176 179 180 182 182 189 193 List of Figures Figure 20.3 Figure 20.4 Figure 20.5 Figure 20.6 Figure 20.7 Figure 20.8 Figure 20.9 Figure 23.1 Figure 28.1 Figure 29.1 Figure 29.2 Figure 29.3 Figure 30.1 Figure 30.2 Figure 30.3 Figure 30.4 Figure 32.1 Figure 32.2 Figure 32.3 Figure 32.4 Figure 32.5 Figure 33.1 Figure 33.2 Figure 36.1 Figure 41.1 Figure 41.2 Figure 42.1 Figure 42.2 Figure 43.1 Fixedeffect model – population effects as function of covariate Randomeffects model – population effects as a function of covariate Randomeffects model – forest plot for the BCG data Randomeffects model – regression of log risk ratio on latitude Betweenstudies variance (T 2 ) with no covariate Betweenstudies variance (T2) with covariate Proportion of variance explained by latitude Creating a synthetic variable from independent subgroups The pvalue for each study is > 0.20 but the pvalue for the summary effect is < 0.02 Power for a primary study as a function of n and Power for a metaanalysis as a function of number studies and Power for a metaanalysis as a function of number studies and heterogeneity Passive smoking and lung cancer – forest plot Passive smoking and lung cancer – funnel plot Passive smoking and lung cancer – funnel plot with imputed studies Passive smoking and lung cancer – cumulative forest plot Estimating the effect size versus testing the null hypothesis The pvalue is a poor surrogate for effect size Studies where p values differ but effect size is the same Studies where p values are the same but effect sizes differ Studies where the more significant p value corresponds to weaker effect size HIV as function of circumcision – by study HIV as function of circumcision – in three sets of studies Effect size in four fictional studies Forest plot using lines to represent the effect size Forest plot using boxes to represent the effect size and relative weight Impact of streptokinase on mortality – forest plot Impact of streptokinase on mortality – cumulative forest plot Forest plot of five fictional studies and a new trial (consistent effects) xvii 194 194 197 199 201 201 202 219 252 267 269 272 282 283 287 288 298 300 300 301 301 304 308 328 367 367 372 373 382 xviii Figure 43.2 Figure 44.1 Figure 44.2 Figure 44.3 Figure 44.4 Figure 44.5 Figure 44.6 Figure 44.7 Figure 44.8 List of Figures Forest plot of five fictional studies and a new trial (heterogeneous effects) CMA – data entry screen for 2 2 tables CMA – analysis screen CMA – high resolution forest plot RevMan – data entry screen for 2 2 tables RevMan – analysis screen Stata macros – data entry screen for 2 2 tables Stata macros – analysis screen Stata macros – high resolution forest plot 383 395 396 397 398 399 401 401 402 Acknowledgements This book was funded by the following grants from the National Institutes of Health: Combining data types in metaanalysis (AG021360), Publication bias in metaanalysis (AG20052), Software for metaregression (AG024771), From the National Institute on Aging, under the direction of Dr. Sidney Stahl; and Forest plots for metaanalysis (DA019280), from the National Institute on Drug Abuse, under the direction of Dr. Thomas Hilton. These grants allowed us to convene a series of workshops on metaanalysis, and parts of this volume reflect ideas developed as part of these workshops. We would like to acknowledge and thank Doug Altman, Betsy Becker, Jesse Berlin, Michael Brannick, Harris Cooper, Kay Dickersin, Sue Duval, Roger Harbord, Despina ContopoulosIoannidis, John Ioannidis, Spyros Konstantopoulos, Mark Lipsey, Mike McDaniel, Ingram Olkin, Fred Oswald, Terri Pigott, Simcha Pollack, David Rindskopf, Stephen Senn, Will Shadish, Jonathan Sterne, Alex Sutton, Thomas Trikalinos, Jeff Valentine, Jack Vevea, Vish Viswesvaran, and David Wilson. Steven Tarlow helped to edit this book and to ensure the accuracy of all formulas and examples. As always, the people at Wiley made this endeavor a pleasure. We want to acknowledge and thank our editor Kathryn Sharples, and also Graham Woodward, Susan Barclay, Beth Dufour, Heather Kay, and Sunita Jayachandran. Vivian Vargas and Shirley Rudolph at Biostat, and Patricia Ferguson at Northwestern University provided invaluable administrative assistance. Preface In his bestselling book Baby and Child Care, Dr. Benjamin Spock wrote ‘I think it is preferable to accustom a baby to sleeping on his stomach from the beginning if he is willing’. This statement was included in most editions of the book, and in most of the 50 million copies sold from the 1950s into the 1990s. The advice was not unusual, in that many pediatricians made similar recommendations at the time. During this same period, from the 1950s into the 1990s, more than 100,000 babies died of sudden infant death syndrome (SIDS), also called crib death in the United States and cot death in the United Kingdom, where a seemingly healthy baby goes to sleep and never wakes up. In the early 1990s, researchers became aware that the risk of SIDS decreased by at least 50% when babies were put to sleep on their backs rather than face down. Governments in various countries launched educational initiatives such as the Back to sleep campaigns in the UK and the US, which led to an immediate and dramatic drop in the number of SIDS deaths. While the loss of more than 100,000 children would be unspeakably sad in any event, the real tragedy lies in the fact that many of these deaths could have been prevented. Gilbert et al. (2005) write ‘Advice to put infants to sleep on the front for nearly half a century was contrary to evidence available from 1970 that this was likely to be harmful. Systematic review of preventable risk factors for SIDS from 1970 would have led to earlier recognition of the risks of sleeping on the front and might have prevented over 10,000 infant deaths in the UK and at least 50,000 in the Europe, the USA and Australasia.’ AN ETHICAL IMPERATIVE This example is one of several cited by Sir Iain Chalmers in a talk entitled The scandalous failure of scientists to cumulate scientifically (Chalmers, 2006). The theme of this talk was that we live in a world where the utility of almost any intervention will be tested repeatedly, and that rather than looking at any study in isolation, we need to look at the body of evidence. While not all systematic reviews carry the urgency of SIDS, the logic of looking at the body of evidence, rather than trying to understand studies in isolation, is always compelling. Metaanalysis refers to the statistical synthesis of results from a series of studies. While the statistical procedures used in a metaanalysis can be applied to any set of data, the synthesis will be meaningful only if the studies have been collected xxii Preface systematically. This could be in the context of a systematic review, the process of systematically locating, appraising, and then synthesizing data from a large number of sources. Or, it could be in the context of synthesizing data from a select group of studies, such as those conducted by a pharmaceutical company to assess the efficacy of a new drug. If a treatment effect (or effect size) is consistent across the series of studies, these procedures enable us to report that the effect is robust across the kinds of populations sampled, and also to estimate the magnitude of the effect more precisely than we could with any of the studies alone. If the treatment effect varies across the series of studies, these procedures enable us to report on the range of effects, and may enable us to identify factors associated with the magnitude of the effect size. FROM NARRATIVE REVIEWS TO SYSTEMATIC REVIEWS Prior to the 1990s, the task of combining data from multiple studies had been primarily the purview of the narrative review. An expert in a given field would read the studies that addressed a question, summarize the findings, and then arrive at a conclusion – for example, that the treatment in question was, or was not, effective. However, this approach suffers from some important limitations. One limitation is the subjectivity inherent in this approach, coupled with the lack of transparency. For example, different reviewers might use different criteria for deciding which studies to include in the review. Once a set of studies has been selected, one reviewer might give more credence to larger studies, while another gives more credence to ‘quality’ studies and yet another assigns a comparable weight to all studies. One reviewer may require a substantial body of evidence before concluding that a treatment is effective, while another uses a lower threshold. In fact, there are examples in the literature where two narrative reviews come to opposite conclusions, with one reporting that a treatment is effective while the other reports that it is not. As a rule, the narrative reviewer will not articulate (and may not even be fully aware of) the decisionmaking process used to synthesize the data and arrive at a conclusion. A second limitation of narrative reviews is that they become less useful as more information becomes available. The thought process required for a synthesis requires the reviewer to capture the finding reported in each study, to assign an appropriate weight to that finding, and then to synthesize these findings across all studies in the synthesis. While a reviewer may be able to synthesize data from a few studies in their head, the process becomes difficult and eventually untenable as the number of studies increases. This is true even when the treatment effect (or effect size) is consistent from study to study. Often, however, the treatment effect will vary as a function of studylevel covariates, such as the patient population, the dose of medication, the outcome variable, and other factors. In these cases, a proper synthesis requires that the researcher be able to understand how the treatment effect varies as a function of these variables, and the narrative review is poorly equipped to address these kinds of issues. Preface xxiii THE SYSTEMATIC REVIEW AND METAANALYSIS For these reasons, beginning in the mid 1980s and taking root in the 1990s, researchers in many fields have been moving away from the narrative review, and adopting systematic reviews and metaanalysis. For systematic reviews, a clear set of rules is used to search for studies, and then to determine which studies will be included in or excluded from the analysis. Since there is an element of subjectivity in setting these criteria, as well as in the conclusions drawn from the metaanalysis, we cannot say that the systematic review is entirely objective. However, because all of the decisions are specified clearly, the mechanisms are transparent. A key element in most systematic reviews is the statistical synthesis of the data, or the metaanalysis. Unlike the narrative review, where reviewers implicitly assign some level of importance to each study, in metaanalysis the weights assigned to each study are based on mathematical criteria that are specified in advance. While the reviewers and readers may still differ on the substantive meaning of the results (as they might for a primary study), the statistical analysis provides a transparent, objective, and replicable framework for this discussion. The formulas used in metaanalysis are extensions of formulas used in primary studies, and are used to address similar kinds of questions to those addressed in primary studies. In primary studies we would typically report a mean and standard deviation for the subjects. If appropriate, we might also use analysis of variance or multiple regression to determine if (and how) subject scores were related to various factors. Similarly, in a metaanalysis, we might report a mean and standard deviation for the treatment effect. And, if appropriate, we would also use procedures analogous to analysis of variance or multiple regression to assess the relationship between the effect and studylevel covariates. Metaanalyses are conducted for a variety of reasons, not only to synthesize evidence on the effects of interventions or to support evidencebased policy or practice. The purpose of the metaanalysis, or more generally, the purpose of any research synthesis has implications for when it should be performed, what model should be used to analyze the data, what sensitivity analyses should be undertaken, and how the results should be interpreted. Losing sight of the fact that metaanalysis is a tool with multiple applications causes confusion and leads to pointless discussions about what is the right way to perform a research synthesis, when there is no single right way. It all depends on the purpose of the synthesis, and the data that are available. Much of this book will expand on this idea. METAANALYSIS IS USED IN MANY FIELDS OF RESEARCH In medicine, systematic reviews and metaanalysis form the core of a movement to ensure that medical treatments are based on the best available empirical data. For example, The Cochrane Collaboration has published the results of over 3700 metaanalyses (as of January 2009) which synthesize data on treatments in all areas of xxiv Preface health care including headaches, cancer, allergies, cardiovascular disease, pain prevention, and depression. The reviews look at interventions relevant to neonatal care, childbirth, infant and childhood diseases, as well as diseases common in adolescents, adults, and the elderly. The kinds of interventions assessed include surgery, drugs, acupuncture, and social interventions. BMJ publishes a series of journals on Evidence Based Medicine, built on the results from systematic reviews. Systematic reviews and metaanalyses are also used to examine the performance of diagnostic tests, and of epidemiological associations between exposure and disease prevalence, among other topics. Pharmaceutical companies usually conduct a series of studies to assess the efficacy of a drug. They use metaanalysis to synthesize the data from these studies, yielding a more powerful test (and more precise estimate) of the drug’s effect. Additionally, the metaanalysis provides a framework for evaluating the series of studies as a whole, rather than looking at each in isolation. These analyses play a role in internal research, in submissions to governmental agencies, and in marketing. Metaanalyses are also used to synthesize data on adverse events, since these events are typically rare and we need to accumulate information over a series of studies to properly assess the risk of these events. In the field of education, metaanalysis has been applied to topics as diverse as the comparison of distance education with traditional classroom learning, assessment of the impact of schooling on developing economies, and the relationship between teacher credentials and student achievement. Results of these and similar metaanalyses have influenced practice and policy in various locations around the world. In psychology, metaanalysis has been applied to basic science as well as in support of evidencebased practice. It has been used to assess personality change over the life span, to assess the influence of media violence on aggressive behavior, and to examine gender differences in mathematics ability, leadership, and nonverbal communication. Metaanalyses of psychological interventions have been use to compare and select treatments for psychological problems, including obsessivecompulsive disorder, impulsivity disorder, bulimia nervosa, depression, phobias, and panic disorder. In the field of criminology, government agencies have funded metaanalyses to examine the relative effectiveness of various programs in reducing criminal behavior. These include initiatives to prevent delinquency, reduce recidivism, assess the effectiveness of different strategies for police patrols, and for the use of special courts to deal with drugrelated crimes. In business, metaanalyses of the predictive validity of tests that are used as part of the hiring process, have led to changes in the types of tests that are used to select employees in many organizations. Metaanalytic results have also been used to guide practices for the reduction of absenteeism, turnover, and counterproductive behavior, and to assess the effectiveness of programs used to train employees. In the field of ecology, metaanalyses are being used to identify the environmental impact of wind farms, biotic resistance to exotic plant invasion, the effects of changes Preface xxv in the marine food chain, plant reactions to global climate change, the effectiveness of conservation management interventions, and to guide conservation efforts. METAANALYSIS AS PART OF THE RESEARCH PROCESS Systematic reviews and metaanalyses are used to synthesize the available evidence for a given question to inform policy, as in the examples cited above from medicine, social science, business, ecology, and other fields. While this is probably the most common use of the methodology, metaanalysis can also play an important role in other parts of the research process. Systematic reviews and metaanalyses can play a role in designing new research. As a first step, they can help determine whether the planned study is necessary. It may be possible to find the required information by synthesizing data from prior studies, and in this case, the research should not be performed. Iain Chalmers (2007) made this point in an article entitled The lethal consequences of failing to make use of all relevant evidence about the effects of medical treatments: the need for systematic reviews. In the event that the new study is needed, the metaanalysis may be useful in helping to design that study. For example, the metaanalysis may show that in the prior studies one outcome index had proven to be more sensitive than others, or that a specific mode of administration had proven to be more effective than others, and should be used in the planned study as well. For these reasons, various government agencies, including institutes of health in various countries, have been encouraging (or requiring) researchers to conduct a metaanalysis of existing research prior to undertaking new funded studies. The systematic review can also play a role in the publication of any new primary study. In the introductory section of the publication, a systematic review can help to place the new study in context by describing what we knew before, and what we hoped to learn from the new study. In the discussion section of the publication, a systematic review allows us to address not only the information provided by the new study, but the body of evidence as enhanced by the new study. Iain Chalmers and Michael Clarke (1998) see this approach as a way to avoid studies being reported without context, which they refer to as ‘Islands in Search of Continents’. Systematic reviews would provide this context in a more rigorous and transparent manner than the narrative reviews that are typically used for this purpose. THE INTENDED AUDIENCE FOR THIS BOOK Since metaanalysis is a relatively new field, many people, including those who actually use metaanalysis in their work, have not had the opportunity to learn about it systematically. We hope that this volume will provide a framework that allows them to understand the logic of metaanalysis, as well as how to apply and interpret metaanalytic procedures properly. xxvi Preface This book is aimed at researchers, clinicians, and statisticians. Our approach is primarily conceptual. The reader will be able to skip the formulas and still understand, for example, the differences between fixedeffect and randomeffects analysis, and the mechanisms used to assess the dispersion in effects from study to study. However, for those with a statistical orientation, we include all the relevant formulas, along with worked examples. Additionally, the spreadsheets and data files can be downloaded from the web at www.MetaAnalysis.com. This book can be used as the basis for a course in metaanalysis. Supplementary materials and exercises are posted on the book’s web site. This volume is intended for readers from various substantive fields, including medicine, epidemiology, social science, business, ecology, and others. While we have included examples from many of these disciplines, the more important message is that metaanalytic methods that may have developed in any one of these fields have application to all of them. Since our goal in using these examples is to explain the metaanalysis itself rather than to address the substantive issues, we provide only the information needed for this purpose. For example, we may present an analysis showing that a treatment reduces pain, while ignoring other analyses that show the same treatment increases the risk of adverse events. Therefore, any reader interested in the substantive issues addressed in an example should not rely on this book for that purpose. AN OUTLINE OF THIS BOOK’S CONTENTS Part 1 is an introduction to metaanalysis. We present a completed metaanalysis to serve as an example, and highlight the elements of this analysis – the effect size for each study, the summary effect, the dispersion of effects across studies, and so on. Our intent is to show where each element fits into the analysis, and thus provide the reader with a context as they move on to the subsequent parts of the book where each of the elements is explored in detail. Part 2 introduces the effect sizes, such as the standardized mean difference or the risk ratio, that are computed for each study, and that serve as the unit of currency in the metaanalysis. We also discuss factors that determine the variance of an effect size and show how to compute the variance for each study, since this affects the weight assigned to that study in the metaanalysis. Part 3 discusses the two computational models used in the vast majority of metaanalyses, the fixedeffect model and the randomeffects model. We discuss the conceptual and practical differences between the two, and show how to compute a summary effect using either one. Part 4 focuses on the issue of dispersion in effect sizes, the fact that the effect size varies from one study to the next. We discuss methods to quantify the heterogeneity, to test it, to incorporate it in the weighting scheme, and to understand it in a substantive as well as a statistical context. Then, we discuss methods to explain the heterogeneity. These include subgroup analyses to compare the effect in Preface xxvii different subgroups of studies (analogous to analysis of variance in primary studies), and metaregression (analogous to multiple regression). Part 5 shows how to work with complex data structures. These include studies that report an effect size for two or more independent subgroups, for two or more outcomes or timepoints, and for two or more comparison groups (such as two treatments being compared with the same control). Part 6 is used to address three separate issues. One chapter discusses the procedure called vote counting, common in narrative reviews, and explains the problems with this approach. One chapter discusses statistical power for a metaanalysis. We show how metaanalysis often (but not always) yields a more powerful test of the null than do any of the included studies. Another chapter addresses the question of publication bias. We explain what this is, and discuss methods that have been developed to assess its potential impact. Part 7 focuses on the issue of why we work with effect sizes in a metaanalysis. In one chapter we explain why we work with effect sizes rather than pvalues. In another we explain why we compute an effect size for each study, rather than summing data over all studies and then computing an effect size for the summed data. The final chapter in this part shows how the use of inversevariance weights can be extended to other applications including Bayesian metaanalysis and analyses based on individual participant data. Part 8 includes chapters on methods that are sometimes used in metaanalysis but that fall outside the central narrative of this volume. These include metaanalyses based on pvalues, alternate approaches (such as the MantelHaenszel method) for assigning study weights, and options sometimes used in psychometric metaanalyses. Part 9 is dedicated to a series of general issues related to metaanalysis. We address the question of when it makes sense to perform a metaanalysis. This Part is also the location for a series of chapters on separate issues such as reporting the results of a metaanalysis, and the proper use of cumulative metaanalysis. Finally, we discuss some of the criticisms of metaanalysis and try to put them in context. Part 10 is a discussion of resources for metaanalysis and systematic reviews. This includes an overview of several computer programs for metaanalysis. It also includes a discussion of organizations that promote the use of systematic reviews and metaanalyses in specific fields, and a list of useful web sites. WHAT THIS BOOK DOES NOT COVER Other elements of a systematic review This book deals only with metaanalysis, the statistical formulas and methods used to synthesize data from a set of studies. A metaanalysis can be applied to any data, but if the goal of the analysis is to provide a synthesis of a body of data from various sources, then it is usually imperative that the data be compiled as part of a systematic review. xxviii Preface A systematic review incorporates many components, such as specification of the question to be addressed, determination of methods to be used for searching the literature and for including or excluding studies, specification of mechanisms to appraise the validity of the included studies, specification of methods to be used for performing the statistical analysis, and a mechanism for disseminating the results. If the entire review is performed properly, so that the search strategy matches the research question, and yields a reasonably complete and unbiased collection of the relevant studies, then (providing that the included studies are themselves valid) the metaanalysis will also be addressing the intended question. On the other hand, if the search strategy is flawed in concept or execution, or if the studies are providing biased results, then problems exist in the review that the metaanalysis cannot correct. In Part 10 we include an annotated listing of suggested readings for the other components in the systematic review, but these components are not otherwise addressed in this volume. Other metaanalytic methods In this volume we focus primarily on metaanalyses of effect sizes. That is, analyses where each study yields an estimate of some statistic (a standardized mean difference, a risk ratio, a prevalence, and so on) and our goal is to assess the dispersion in these effects and (if appropriate) compute a summary effect. The vast majority of metaanalyses performed use this approach. We deal only briefly (see Part 8) with other approaches, such as metaanalyses that combine pvalues rather than effect sizes. We do not address metaanalysis of diagnostic tests. Further Reading Chalmers, I. (2007). The lethal consequences of failing to make use of all relevant evidence about the effects of medical treatments: the need for systematic reviews. In P. Rothwell(ed.), Treating Individuals, ed. London: Lancet: 37–58. Chalmers, I., Hedges, L.V. & Cooper, H. (2002). A brief history of research synthesis. Evaluation in the Health Professions. 25(1): 12–37. Clarke, M, Hopewell, S. & Chalmers, I. (2007). Reports of clinical trials should begin and end with uptodate systematic reviews of other relevant evidence: a status report. Journal of the Royal Society of Medicine. 100: 187–190. Hunt, M. (1999). How Science Takes Stock: The Story of Metaanalysis. New York: Russell Sage Foundation. Sutton, A.J. & Higgins, J.P.T. (2008). Recent developments in metaanalysis. Statistics in Medicine 27: 625–650. Web Site The web site for this book is www.MetaAnalysis.com. There, you will find easy access to n n n n n n All of the datasets used in this book All computations from this book as Excel spreadsheets Additional formulas for computing effect sizes Any corrections to this book Links to other metaanalysis sites A free trial of Comprehensive Meta Analysis For those planning to use this book as a text, there are also worked examples and exercises. Please send any questions or comments to [email protected] PART 1 Introduction Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 CHAPTER 1 How a MetaAnalysis Works Introduction Individual studies The summary effect Heterogeneity of effect sizes INTRODUCTION Figure 1.1 illustrates a metaanalysis that shows the impact of high dose versus standard dose of statins in preventing death and myocardial infarction (MI). This analysis is adapted from one reported by Cannon et al. and published in the Journal of the American College of Cardiology (2006). Our goal in presenting this here is to introduce the various elements in a metaanalysis (the effect size for each study, the weight assigned to each effect size, the estimate of the summary effect, and so on) and show where each fits into the larger scheme. In the chapters that follow, each of these elements will be explored in detail. INDIVIDUAL STUDIES The first four rows on this plot represent the four studies. For each, the study name is shown at left, followed by the effect size, the relative weight assigned to the study for computing the summary effect, and the pvalue. The effect size and weight are also shown schematically. Effect size The effect size, a value which reflects the magnitude of the treatment effect or (more generally) the strength of a relationship between two variables, is the unit of currency in a metaanalysis. We compute the effect size for each study, and then Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 4 Introduction Figure 1.1 Highdose versus standarddose of statins (adapted from Cannon et al., 2006). work with the effect sizes to assess the consistency of the effect across studies and to compute a summary effect. The effect size could represent the impact of an intervention, such as the impact of medical treatment on risk of infection, the impact of a teaching method on test scores, or the impact of a new protocol on the number of salmon successfully returning upstream. The effect size is not limited to the impact of interventions, but could represent any relationship between two variables, such as the difference in test scores for males versus females, the difference in cancer rates for persons exposed or not exposed to secondhand smoke, or the difference in cardiac events for persons with two distinct personality types. In fact, what we generally call an effect size could refer simply to the estimate of a single value, such as the prevalence of Lyme disease. In this example the effect size is the risk ratio. A risk ratio of 1.0 would mean that the risk of death or MI was the same in both groups, while a risk ratio less than 1.0 would mean that the risk was lower in the highdose group, and a risk ratio greater than 1.0 would mean that the risk was lower in the standarddose group. The effect size for each study is represented by a square, with the location of the square representing both the direction and magnitude of the effect. Here, the effect size for each study falls to the left of center (indicating a benefit for the highdose group). The effect is strongest (most distant from the center) in the TNT study and weakest in the Ideal study. Note. For measures of effect size based on ratios (as in this example) a ratio of 1.0 represents no difference between groups. For measures of effect based on differences (such as mean difference), a difference of 0.0 represents no difference between groups. Chapter 1: How a MetaAnalysis Works 5 Precision In the schematic, the effect size for each study is bounded by a confidence interval, reflecting the precision with which the effect size has been estimated in that study. The confidence interval for the last study (Ideal) is noticeably narrower than that for the first study (Proveit), reflecting the fact that the Ideal study has greater precision. The meaning of precision and the factors that affect precision are discussed in Chapter 8. Study weights The solid squares that are used to depict each of the studies vary in size, with the size of each square reflecting the weight that is assigned to the corresponding study when we compute the summary effect. The TNT and Ideal studies are assigned relatively high weights, while somewhat less weight is assigned to the A to Z study and still less to the Proveit study. As one would expect, there is a relationship between a study’s precision and that study’s weight in the analysis. Studies with relatively good precision (TNT and Ideal) are assigned more weight while studies with relatively poor precision (Proveit) are assigned less weight. Since precision is driven primarily by sample size, we can think of the studies as being weighted by sample size. However, while precision is one of the elements used to assign weights, there are often other elements as well. In Part 3 we discuss different assumptions that one can make about the distribution of effect sizes across studies, and how these affect the weight assigned to each study. p  values For each study we show the pvalue for a test of the null. There is a necessary correspondence between the pvalue and the confidence interval, such that the pvalue will fall under 0.05 if and only if the 95% confidence interval does not include the null value. Therefore, by scanning the confidence intervals we can easily identify the statistically significant studies. The role of pvalues in the analysis, as well as the relationship between pvalues and effect size, is discussed in Chapter 32. In this example, for three of the four studies the confidence interval crosses the null, and the pvalue is greater than 0.05. In one (the TNT study) the confidence interval does not cross the null, and the pvalue falls under 0.05. THE SUMMARY EFFECT One goal of the synthesis is usually to compute a summary effect. Typically we report the effect size itself, as well as a measure of precision and a pvalue. 6 Introduction Effect size On the plot the summary effect is shown on the bottom line. In this example the summary risk ratio is 0.85, indicating that the risk of death (or MI) was 15% lower for patients assigned to the high dose than for patients assigned to standard dose. The summary effect is nothing more than the weighted mean of the individual effects. However, the mechanism used to assign the weights (and therefore the meaning of the summary effect) depends on our assumptions about the distribution of effect sizes from which the studies were sampled. Under the fixedeffect model, we assume that all studies in the analysis share the same true effect size, and the summary effect is our estimate of this common effect size. Under the randomeffects model, we assume that the true effect size varies from study to study, and the summary effect is our estimate of the mean of the distribution of effect sizes. This is discussed in Part 3. Precision The summary effect is represented by a diamond. The location of the diamond represents the effect size while its width reflects the precision of the estimate. In this example the diamond is centered at 0.85, and extends from 0.79 to 0.92, meaning that the actual impact of the high dose (as compared to the standard) likely falls somewhere in that range. The precision addresses the accuracy of the summary effect as an estimate of the true effect. However, as discussed in Part 3 the exact meaning of the precision depends on the statistical model. p  value The pvalue for the summary effect is 0.00003. This pvalue reflects both the magnitude of the summary effect size and also the volume of information on which the estimate is based. Note that the pvalue for the summary effect is substantially more compelling than that of any single study. Indeed, only one of the four studies had a pvalue under 0.05. The relationship between pvalues and effect sizes is discussed in Chapter 32. HETEROGENEITY OF EFFECT SIZES In this example the treatment effect is consistent across all studies (by a criterion explained in Chapter 16), but such is not always the case. A key theme in this volume is the importance of assessing the dispersion of effect sizes from study to study, and then taking this into account when interpreting the data. If the effect size is consistent, then we will usually focus on the summary effect, and note that this effect is robust across the domain of studies included in the analysis. If the effect size varies modestly, then we might still report the summary effect but note that the Chapter 1: How a MetaAnalysis Works 7 true effect in any given study could be somewhat lower or higher than this value. If the effect varies substantially from one study to the next, our attention will shift from the summary effect to the dispersion itself. Because the dispersion in observed effects is partly spurious (it includes both real difference in effects and also random error), before trying to interpret the variation in effects we need to determine what part (if any) of the observed variation is real. In Part 4 we show how to partition the observed variance into the part due to error and the part that represents variation in true effect sizes, and then how to use this information in various ways. In this example our goal was to estimate the summary effect in a single population. In some cases, however, we will want to compare the effect size for one subgroup of studies versus another (say, for studies that used an elderly population versus those that used a relatively young population). In other cases we may want to assess the impact of putative moderators (or covariates) on the effect size (say, comparing the effect size in studies that used doses of 10, 20, 40, 80, 160 mg.). These kinds of analyses are also discussed in Part 4. SUMMARY POINTS To perform a metaanalysis we compute an effect size and variance for each study, and then compute a weighted mean of these effect sizes. To compute the weighted mean we generally assign more weight to the more precise studies, but the rules for assigning weights depend on our assumptions about the distribution of true effects. CHAPTER 2 Why Perform a MetaAnalysis Introduction The streptokinase metaanalysis Statistical significance Clinical importance of the effect Consistency of effects INTRODUCTION Why perform a metaanalysis? What are the advantages of using statistical methods to synthesize data rather than taking the results that had been reported for each study and then having these collated and synthesized by an expert? In this chapter we start at the point where we have already selected the studies to be included in the review, and are planning the synthesis itself. We do not address the differences between systematic reviews and narrative reviews in the process of locating and selecting studies. These differences can be critically important, but (as always) our focus is on the data analysis rather than the full process of the review. The goal of a synthesis is to understand the results of any study in the context of all the other studies. First, we need to know whether or not the effect size is consistent across the body of data. If it is consistent, then we want to estimate the effect size as accurately as possible and to report that it is robust across the kinds of studies included in the synthesis. On the other hand, if it varies substantially from study to study, we want to quantify the extent of the variance and consider the implications. Metaanalysis is able to address these issues whereas the narrative review is not. We start with an example to show how metaanalysis and narrative review would approach the same question, and then use this example to highlight the key differences between the two. Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 10 Introduction THE STREPTOKINASE METAANALYSIS During the time period beginning in 1959 and ending in 1988 (a span of nearly 30 years) there were a total of 33 randomized trials performed to assess the ability of streptokinase to prevent death following a heart attack. Streptokinase, a socalled clot buster which is administered intravenously, was hypothesized to dissolve the clot causing the heart attack, and thus increase the likelihood of survival. The trials all followed similar protocols, with patients assigned at random to either treatment or placebo. The outcome, whether or not the patient died, was the same in all the studies. The trials varied substantially in size. The median sample size was slightly over 100 but there was one trial with a sample size in the range of 20 patients, and two large scale trials which enrolled some 12,000 and 17,000 patients, respectively. Of the 33 studies, six were statistically significant while the other 27 were not, leading to the perception that the studies yielded conflicting results. In 1992 Lau et al. published a metaanalysis that synthesized the results from the 33 studies. The presentation that follows is based on the Lau paper (though we use a risk ratio where Lau used an odds ratio). The forest plot (Figure 2.1) provides context for the analysis. An effect size to the left of center indicates that treated patients were more likely to survive, while an Figure 2.1 Impact of streptokinase on mortality (adapted from Lau et al., 1992). Chapter 2: Why Perform a MetaAnalysis 11 effect size to the right of center indicates that control patients were more likely to survive. The plot serves to highlight the following points. The effect sizes are reasonably consistent from study to study. Most fall in the range of 0.50 to 0.90, which suggests that it would be appropriate to compute a summary effect size. The summary effect is a risk ratio of 0.79 with a 95% confidence interval of 0.72 to 0.87 (that is, a 21% decrease in risk of death, with 95% confidence interval of 13% to 28%). The pvalue for the summary effect is 0.0000008. The confidence interval that bounds each effect size indicates the precision in that study. If the interval excludes 1.0, the pvalue is less than 0.05 and the study is statistically significant. Six of the studies were statistically significant while 27 were not. In sum, the treatment reduces the risk of death by some 21%. And, this effect was reasonably consistent across all studies in the analysis. Over the course of this volume we explain the statistical procedures that led to these conclusions. Our goal in the present chapter is simply to explain that metaanalysis does offer these mechanisms, whereas the narrative review does not. The key differences are as follows. STATISTICAL SIGNIFICANCE One of the first questions asked of a study is the statistical significance of the results. The narrative review has no mechanism for synthesizing the pvalues from the different studies, and must deal with them as discrete pieces of data. In this example six of the studies were statistically significant while the other 27 were not, which led some to conclude that there was evidence against an effect, or that the results were inconsistent (see vote counting in Chapter 28). By contrast, the metaanalysis allows us to combine the effects and evaluate the statistical significance of the summary effect. The pvalue for the summary effect is p 5 0.0000008. While one might assume that 27 studies failed to reach statistical significance because they reported small effects, it is clear from the forest plot that this is not the case. In fact, the treatment effect in many of these studies was actually larger than the treatment effect in the six studies that were statistically significant. Rather, the reason that 82% of the studies were not statistically significant is that these studies had small sample sizes and low statistical power. In fact, as discussed in Chapter 29, most had power of less than 20%. By contrast, power for the metaanalysis exceeded 99.9% (see Chapter 29). As in this example, if the goal of a synthesis is to test the null hypothesis, then metaanalysis provides a mathematically rigorous mechanism for this purpose. However, metaanalysis also allows us to move beyond the question of 12 Introduction statistical significance, and address questions that are more interesting and also more relevant. CLINICAL IMPORTANCE OF THE EFFECT Since the point of departure for a narrative review is usually the pvalues reported by the various studies, the review will often focus on the question of whether or not the body of evidence allows us to reject the null hypothesis. There is no good mechanism for discussing the magnitude of the effect. By contrast, the metaanalytic approaches discussed in this volume allow us to compute an estimate of the effect size for each study, and these effect sizes fall at the core of the analysis. This is important because the effect size is what we care about. If a clinician or patient needs to make a decision about whether or not to employ a treatment, they want to know if the treatment reduces the risk of death by 5% or 10% or 20%, and this is the information carried by the effect size. Similarly, if we are thinking of implementing an intervention to increase the test scores of students, or to reduce the number of incarcerations among atrisk juveniles, or to increase the survival time for patients with pancreatic cancer, the question we ask is about the magnitude of the effect. The pvalue can tell us only that the effect is not zero, and to report simply that the effect is not zero is to miss the point. CONSISTENCY OF EFFECTS When we are working with a collection of studies, it is critically important to ask whether or not the effect size is consistent across studies. The implications are quite different for a drug that consistently reduces the risk of death by 20%, as compared with a drug that reduces the risk of death by 20% on average, but that increases the risk by 20% in some populations while reducing it by 60% in others. The narrative review has no good mechanism for assessing the consistency of effects. The narrative review starts with pvalues, and because the pvalue is driven by the size of a study as well as the effect in that study, the fact that one study reported a pvalue of 0.001 and another reported a pvalue of 0.50 does not mean that the effect was larger in the former. The pvalue of 0.001 could reflect a large effect size but it could also reflect a moderate or small effect in a large study (see the GISSI1 study in Figure 2.1, for example). The pvalue of 0.50 could reflect a small (or nil) effect size but could also reflect a large effect in a small study (see the Fletcher study, for example). This point is often missed in narrative reviews. Often, researchers interpret a nonsignificant result to mean that there is no effect. If some studies are statistically significant while others are not, the reviewers see the results as conflicting. This problem runs through many fields of research. To borrow a phrase from Cary Grant’s character in Arsenic and Old Lace, we might say that it practically gallops. Chapter 2: Why Perform a MetaAnalysis 13 Schmidt (1996) outlines the impact of this practice on research and policy. Suppose an idea is proposed that will improve test scores for AfricanAmerican children. A number of studies are performed to test the intervention. The effect size is positive and consistent across studies but power is around 50%, and only around 50% of the studies yield statistically significant results. Researchers report that the evidence is ‘conflicting’ and launch a series of studies to determine why the intervention had a positive effect in some studies but not others (Is it the teacher’s attitude? Is it the students’ socioeconomic status?), entirely missing the point that the effect was actually consistent from one study to the next. No pattern can be found (since none exists). Eventually, researchers decide that the issue cannot be understood. A promising idea is lost, and a perception builds that research is not to be trusted. A similar point is made by Meehl (1978, 1990). Rossi (1997) gives an example from the field of memory research that shows what can happen to a field of research when reviewers work with discrete pvalues. The issue of whether or not researchers could demonstrate the spontaneous recovery of previously extinguished associations had a bearing on a number of important learning theories, and some 40 studies on the topic were published between 1948 and 1969. Evidence of the effect (that is, statistically significant findings) was obtained in only about half the studies, which led most texts and reviews to conclude that the effect was ephemeral and ‘the issue was not so much resolved as it was abandoned’ (p. 179). Later, Rossi returned to these studies and found that the average effect size (d) was 0.39. If we assume that this is the population effect size, the mean power for these studies would have been slightly under 50%. On this basis we would expect about half the studies to yield a significant effect, which is exactly what happened. Even worse, when the significant study was performed in one type of sample and the nonsignificant study was performed in another type of sample, researchers would sometimes interpret this difference as meaning that the effect existed in one population but not the other. Abelson (1997) notes that if a treatment effect yields a pvalue of 0.07 for wombats and 0.05 for dingbats we are likely to see a discussion explaining why the treatment is effective only in the latter group— completely missing the point that the treatment effect may have been virtually identical in the two. The treatment effect may have even been larger for the wombats if the sample size was smaller. By contrast, metaanalysis completely changes the landscape. First, we work with effect sizes (not pvalues) to determine whether or not the effect size is consistent across studies. Additionally, we apply methods based on statistical theory to allow that some (or all) of the observed dispersion is due to random sampling variation rather than differences in the true effect sizes. Then, we apply formulas to partition the variance into random error versus real variance, to quantify the true differences among studies, and to consider the implications of this variance. In the Schmidt and the Rossi examples, a metaanalysis might have found that the effect size was 14 Introduction consistent across studies, and that all of the observed variation in effects could be attributed to random sampling error. SUMMARY POINTS Since the narrative review is based on discrete reports from a series of studies, it provides no real mechanism for synthesizing the data. To borrow a phrase from Abelson, it involves doing arithmetic with words. And, when the words are based on pvalues the words are the wrong words. By contrast, in a metaanalysis we introduce two fundamental changes. First, we work directly with the effect size from each study rather than the pvalue. Second, we include all of the effects in a single statistical synthesis. This is critically important for the goal of computing (and testing) a summary effect. Metaanalysis also allows us to assess the dispersion of effects, and distinguish between real dispersion and spurious dispersion. PART 2 Effect Size and Precision Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 CHAPTER 3 Overview Treatment effects and effect sizes Parameters and estimates Outline of effect size computations TREATMENT EFFECTS AND EFFECT SIZES The terms treatment effects and effect sizes are used in different ways by different people. Metaanalyses in medicine often refer to the effect size as a treatment effect, and this term is sometimes assumed to refer to odds ratios, risk ratios, or risk differences, which are common in metaanalyses that deal with medical interventions. Similarly, metaanalyses in the social sciences often refer to the effect size simply as an effect size and this term is sometimes assumed to refer to standardized mean differences or to correlations, which are common in social science metaanalyses. In fact, though, both the terms effect size and treatment effect can refer to any of these indices, and the distinction between these terms lies not in the index itself but rather in the nature of the study. The term effect size is appropriate when the index is used to quantify the relationship between two variables or a difference between two groups. By contrast, the term treatment effect is appropriate only for an index used to quantify the impact of a deliberate intervention. Thus, the difference between males and females could be called an effect size only, while the difference between treated and control groups could be called either an effect size or a treatment effect. While most metaanalyses focus on relationships between variables, some have the goal of estimating a mean or risk or rate in a single population. For example, a metaanalysis might be used to combine several estimates for the prevalence of Lyme disease in Wabash or the mean SAT score for students in Utah. In these cases the index is clearly not a treatment effect, and is also not an effect size, since effect implies a relationship. Rather, the parameter being estimated could be called simply a single group summary. Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 18 Effect Size and Precision Note, however, that the classification of an index as an effect size and/or a treatment effect (or simply a single group summary) has no bearing on the computations. In the metaanalysis itself we have simply a series of values and their variances, and the same mathematical formulas apply. In this volume we generally use the term effect size, but we use it in a generic sense, to include also treatment effects, single group summaries, or even a generic statistic. How to choose an effect size Three major considerations should drive the choice of an effect size index. The first is that the effect sizes from the different studies should be comparable to one another in the sense that they measure (at least approximately) the same thing. That is, the effect size should not depend on aspects of study design that may vary from study to study (such as sample size or whether covariates are used). The second is that estimates of the effect size should be computable from the information that is likely to be reported in published research reports. That is, it should not require the reanalysis of the raw data (unless these are known to be available). The third is that the effect size should have good technical properties. For example, its sampling distribution should be known so that variances and confidence intervals can be computed. Additionally, the effect size should be substantively interpretable. This means that researchers in the substantive area of the work represented in the synthesis should find the effect size meaningful. If the effect size is not inherently meaningful, it is usually possible to transform the effect size to another metric for presentation. For example, the analyses may be performed using the log risk ratio but then transformed to a risk ratio (or even to illustrative risks) for presentation. In practice, the kind of data used in the primary studies will usually lead to a pool of two or three effect sizes that meet the criteria outlined above, which makes the process of selecting an effect size relatively straightforward. If the summary data reported by the primary study are based on means and standard deviations in two groups, the appropriate effect size will usually be either the raw difference in means, the standardized difference in means, or the response ratio. If the summary data are based on a binary outcome such as events and nonevents in two groups the appropriate effect size will usually be the risk ratio, the odds ratio, or the risk difference. If the primary study reports a correlation between two variables, then the correlation coefficient itself may serve as the effect size. PARAMETERS AND ESTIMATES Throughout this volume we make the distinction between an underlying effect size parameter (denoted by the Greek letter ) and the sample estimate of that parameter (denoted by Y). Chapter 3: Overview 19 If a study had an infinitely large sample size then it would yield an effect size Y that was identical to the population parameter . In fact, though, sample sizes are finite and so the effect size estimate Y always differs from by some amount. The value of Y will vary from sample to sample, and the distribution of these values is the sampling distribution of Y. Statistical theory allows us to estimate the sampling distribution of effect size estimates, and hence their standard errors. OUTLINE OF EFFECT SIZE COMPUTATIONS Table 3.1 provides an outline of the computational formulas that follow. These are some of the more common effect sizes and study designs. A more extensive array of formulas is offered in Borenstein et al. (2009). Table 3.1 Roadmap of formulas in subsequent chapters. Effect sizes based on means (Chapter 4) Raw (unstandardized) mean difference (D ) Based on studies with independent groups Based on studies with matched groups or prepost designs Standardized mean difference (d or g) Based on studies with independent groups Based on studies with matched groups or prepost designs Response ratios (R ) Based on studies with independent groups Effect sizes based on binary data (Chapter 5) Risk ratio (RR ) Based on studies with independent groups Odds ratio (OR ) Based on studies with independent groups Risk difference (RD ) Based on studies with independent groups Effect sizes based on correlational data (Chapter 6) Correlation (r ) Based on studies with one group CHAPTER 4 Effect Sizes Based on Means Introduction Raw (unstandardized) mean difference D Standardized mean difference, d and g Response ratios INTRODUCTION When the studies report means and standard deviations, the preferred effect size is usually the raw mean difference, the standardized mean difference, or the response ratio. These effect sizes are discussed in this chapter. RAW (UNSTANDARDIZED) MEAN DIFFERENCE D When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the metaanalysis can be performed directly on the raw difference in means (henceforth, we will use the more common term, raw mean difference). The primary advantage of the raw mean difference is that it is intuitively meaningful, either inherently (for example, blood pressure, which is measured on a known scale) or because of widespread use (for example, a national achievement test for students, where all relevant parties are familiar with the scale). Consider a study that reports means for two groups (Treated and Control) and suppose we wish to compare the means of these two groups. Let 1 and 2 be the true (population) means of the two groups. The population mean difference is defined as D ¼ 1 2 : ð4:1Þ In the two sections that follow we show how to compute an estimate D of this parameter and its variance from studies that used two independent groups and from studies that used paired groups or matched designs. Introduction to MetaAnalysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 9780470057247 22 Effect Size and Precision Computing D from studies that use independent groups We can estimate the mean difference D from a study that used two independent groups as follows. Let X 1 and X2 be the sample means of the two independent groups. The sample estimate of D is just the difference in sample means, namely D ¼ X 1 X 2: ð4:2Þ Note that uppercase D is used for the raw mean difference, whereas lowercase d will be used for the standardized mean difference (below). Let S1 and S2 be the sample standard deviations of the two groups, and n1 and n2 be the sample sizes in the two groups. If we assume that the two population standard deviations are the same (as is assumed to be the case in most parametric data analysis techniques), so that 1 5 2 5 , then the variance of D is VD 5 where n1 þ n2 2 S ; n1 n2 pooled sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðn1 1ÞS21 þ ðn2 1ÞS22 Spooled 5 : n1 þ n2 2 ð4:3Þ ð4:4Þ If we don’t assume that the two population standard deviations are the same, then the variance of D is S2 S2 VD 5 1 þ 2 : ð4:5Þ n1 n2 In either case, the standard error of D is then the square root of V, pﬃﬃﬃﬃﬃﬃ SED 5 VD : ð4:6Þ For example, suppose that a study has sample means X1 5 103.00, X2 5 100.00, sample standard deviations S1 5 5.5, S2 5 4.5, and sample sizes n1 5 n2 5 50. The raw mean difference D is D 5 103:00 100:00 5 3:00: If we assume that 1 5 2 then the pooled standard deviation within groups is sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð50 1Þ 5:52 þ ð50 1Þ 4:52 Spooled 5 5 5:0249: 50 þ 50 2 The variance and standard error of D are given by VD 5 and 50 þ 50 5:02492 5 1:0100; 50 50 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SED 5 1:0100 5 1:0050: Chapter 4: Effect Sizes Based on Means 23 If we do not assume that 15 2 then the variance and standard error of D are given by VD 5 5:52 4:52 þ 5 1:0100 50 50 and pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SED 5 1:0100 5 1:0050: In this example formulas (4.3) and (4.5) yield the same result, but this will be true only if the sample size and/or the estimate of the variances is the same in the two groups. Computing D from studies that use matched groups or prepost scores The previous formulas are appropriate for studies that use two independent groups. Another study design is the use of matched groups, where pairs of participants are matched in some way (for example, siblings, or patients at the same stage of disease), with the two members of each pair then being assigned to different groups. The unit of analysis is the pair, and the advantage of this design is that each pair serves as its own control, reducing the error term and increasing the statistical power. The magnitude of the impact depends on the correlation between (for example) siblings, with a higher correlation yielding a lower variance (and increased precision). The sample estimate of D is just the sample mean difference, D. If we have the difference score for each pair, which gives us the mean difference Xdiff and the standard deviation of these differences (Sdiff), then D 5 X diff ; ð4:7Þ S2diff ; n ð4:8Þ VD 5 where n is the number of pairs, and pﬃﬃﬃﬃﬃﬃ SED 5 VD : ð4:9Þ For example, if the mean difference is 5.00 with standard deviation of the difference of 10.00 and n of 50 pairs, then D 5 5:0000; 10:002 5 2:0000; 50 ð4:10Þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃ SED 5 2:00 5 1:4142: ð4:11Þ VD 5 and 24 Effect Size and Precision Alternatively, if we have the mean and standard deviation for each set of scores (for example, siblings A and B), the difference is D ¼ X 1 X 2: ð4:12Þ The variance is again given by VD 5 S2diff ; n where n is the number of pairs, and the standard error is given by pﬃﬃﬃﬃﬃﬃ SED 5 VD : ð4:13Þ ð4:14Þ However, in this case we need to compute the standard deviation of the difference scores from the standard deviation of each sibling’s scores. This is given by qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Sdiff 5 S21 þ S22 2 r S1 S2 ð4:15Þ where r is the correlation between ‘siblings’ in matched pairs. If S1 5 S2, then (4.15) simplifies to qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Sdiff 5 2 S2pooled ð1 rÞ: ð4:16Þ In either case, as r moves toward 1.0 the standard error of the paired difference will decrease, and when r 5 0 the standard error of the difference is the same as it would be for a study with two independent groups, each of size n. For example, suppose the means for siblings A and B are 105.00 and 100.00, with standard deviations 10 and 10, the correlation between the two sets of scores is 0.50, and the number of pairs is 50. Then D 5 105:00 100:00 5 5:0000; VD 5 and 10:002 5 2:0000; 50 pﬃﬃﬃﬃﬃﬃﬃﬃﬃ SED 5 2:00 5 1:4142: In the calculation of VD, the Sdiff is computed using pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Sdiff 5 102 þ 102 2 0:50 10 10 5 10:0000 or Sdiff 5 qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 102 ð1 0:50Þ 5 10:0000: The formulas for matched designs apply to prepost designs as well. The pre and post means correspond to the means in the matched groups, n is the number of subjects, and r is the correlation between prescores and postscores. Chapter 4: Effect Sizes Based on Means 25 Calculation of effect size estimates from information that is reported When a researcher has access to a full set of summary data such as the mean, standard deviation, and sample size for each group, the computation of the effect size and its variance is relatively straightforward. In practice, however, the researcher will often be working with only partial data. For example, a paper may publish only the pvalue, means and sample sizes from a test of significance, leaving it to the metaanalyst to backcompute the effect size and variance. For information on computing effect sizes from partial information, see Borenstein et al. (2009). Including different study designs in the same analysis Sometimes a systematic review will include studies that used independent groups and also studies that used matched groups. From a statistical perspective the effect size (D) has the same meaning regardless of the study design. Therefore, we can compute the effect size and variance from each study using the appropriate formula, and then include all studies in the same analysis. While there is no technical barrier to using different study designs in the same analysis, there may be a concern that studies which used different designs might differ in substantive ways as well (see Chapter 40). For all study designs (whether using independent or paired groups) the direction of the effect (X1 X2 or X2 X1 ) is arbitrary, except that the researcher must decide on a convention and then apply this consistently. For example, if a positive difference will indicate that the treated group did better than the control group, then this convention must apply for studies that used independent designs and for studies that used prepost designs. In some cases it might be necessary to reverse the computed sign of the effect size to ensure that the convention is followed. STANDARDIZED MEAN DIFFERENCE, d AND g As noted, the raw mean difference is a useful index when the measure is meaningful, either inherently or because of widespread use. By contrast, when the measure is less well known (for example, a proprietary scale with limited distribution), the use of a raw mean difference has less to recommend it. In any event, the raw mean difference is an option only if all the studies in the metaanalysis use the same scale. If different studies use different instruments (such as different psychological or educational tests) to assess the outcome, then the scale of measurement will differ from study to study and it would not be meaningful to combine raw mean differences. In such cases we can divide the mean difference in each study by that study’s standard deviation to create an index (the standardized mean difference) that would be comparable across studies. This is the same approach suggested by Cohen (1969, 1987) in connection with describing the magnitude of effects in statistical power analysis. 26 Effect Size and Precision The standardized mean difference can be considered as being comparable across studies based on either of two arguments (Hedges and Olkin, 1985). If the outcome measures in all studies are linear transformations of each other, the standardized mean difference can be seen as the mean difference that would have been obtained if all data were transformed to a scale where the standard deviation withingroups was equal to 1.0. The other argument for comparability of standardized mean differences is the fact that the standardized mean difference is a measure of overlap between distributions. In this telling, the standardized mean difference reflects the difference between the distributions in the two groups (and how each represents a distinct cluster of scores) even if they do not measure exactly the same outcome (see Cohen, 1987, Grissom and Kim, 2005). Consider a study that uses two independent groups, and suppose we wish to compare the means of these two groups. Let 1 and 1 be the true (population) mean and standard deviation of the first group and let 2 and 2 be the true (population) mean and standard deviation of the other group. If the two population standard deviations are the same (as is assumed in most parametric data analysis techniques), so that 1 5 2 5 , then the standardized mean difference parameter or population standardized mean difference is defined as 2 : ð4:17Þ 5 1 In the sections that follow, we show how to estimate from studies that used independent groups, and from studies that used prepost or matched group designs. It is also possible to estimate from studies that used other designs (including clustered designs) but these are not addressed here (see resources at the end of this Part). We make the common assumption that 12 5 22, which allows us to pool the estimates of the standard deviation, and do not address the case where these are assumed to differ from each other. Computing d and g from studies that use independent groups We can estimate the standardized mean difference () from studies that used two independent groups as d5 X1 X 2 : Swithin ð4:18Þ In the numerator, X1 and X2 are the sample means in the two groups. In the denominator Swithin is the withingroups standard deviation, pooled across groups, sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðn1 1ÞS21 þ ðn2 1ÞS22 Swithin 5 n1 þ n2 2 ð4:19Þ where n1 and n2 are the sample sizes in the two groups, and S1 and S2 are the standard deviations in the two groups. The reason that we pool the two sample Chapter 4: Effect Sizes Based on Means 27 estimates of the standard deviation is that even if we assume that the underlying population standard deviations are the same (that is 1 5 2 5 ), it is unlikely that the sample estimates S1 and S2 will be identical. By pooling the two estimates of the standard deviation, we obtain a more accurate estimate of their common value. The sample estimate of the standardized mean difference is often called Cohen’s d in research synthesis. Some confusion about the terminology has resulted from the fact that the index , originally proposed by Cohen as a population parameter for describing the size of effects for statistical power analysis is also sometimes called d. In this volume we use the symbol to denote the effect size parameter and d for the sample estimate of that parameter. The variance of d is given (to a very good approximation) by Vd 5 n1 þ n2 d2 : þ n1 n2 2ðn1 þ n2 Þ ð4:20Þ In this equation the first term on the right of the equals sign reflects uncertainty in the estimate of the mean difference (the numerator in (4.18)), and the second reflects uncertainty in the estimate of Swithin (the denominator in (4.18)). The standard error of d is the square root of Vd, pﬃﬃﬃﬃﬃ SEd 5 Vd : ð4:21Þ It turns out that d has a slight bias, tending to overestimate the absolute value of in small samples. This bias can be removed by a simple correction that yields an unbiased estimate of , with the unbiased estimate sometimes called Hedges’ g (Hedges, 1981). To convert from d to Hedges’ g we use a correction factor, which is called J. Hedges (1981) gives the exact formula for J, but in common practice researchers use an approximation, J51 3 : 4df 1 ð4:22Þ In this expression, df is the degrees of freedom used to estimate Swithin, which for two independent groups is n1 þ n2 – 2. This approximation always has error of less than 0.007 and less than 0.035 percent when df 10 (Hedges, 1981). Then, g 5 J d; ð4:23Þ Vg 5 J 2 Vd ; ð4:24Þ and SEg 5 pﬃﬃﬃﬃﬃ Vg : ð4:25Þ For example, suppose a study has sample means X1 5 103, X2 5 100, sample standard deviations S1 5 5.5, S2 5 4.5, and sample sizes n1 5 n2 5 50. We would estimate the pooledwithingroups standard deviation as 28 Effect Size and Precision sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð50 1Þ 5:52 þ ð50 1Þ 4:52 Swithin 5 5 5:0249: 50 þ 50 2 Then, d5 Vd 5 and 103 100 5 0:5970; 5:0249 50 þ 50 0:59702 þ 5 0:0418; 50 50 2ð50 þ 50Þ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SEd 5 0:0418 5 0:2044: The correction factor (J), Hedges’ g, its variance and standard error are given by 3 J5 1 5 0:9923; 4 98 1 g 5 0:9923 0:5970 5 0:5924; vg 5 0:99232 0:0418 5 0:0411; and pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SEg 5 0:0411 5 0:2028: The correction factor (J) is always less than 1.0, and so g will always be less than d in absolute value, and the variance of g will always be less than the variance of d. However, J will be very close to 1.0 unless df is very small (say, less than 10) and so (as in this example) the difference is usually trivial (Hedges, 1981). Some slightly different expressions for the variance of d (and g) have been given by different authors and even the same authors at different times. For example, the denominator of the second term of the variance of d is given here as 2(n1 þ n2). This expression is obtained by one method (assuming the n’s become pﬃﬃﬃ large with fixed). An alternate derivation (assuming n’s become large with n fixed) leads to a denominator in the second term that is slightly different, namely 2(n1 þ n2 – 2). Unless n1 and n2 are very small, these expressions will be almost identical. Similarly, the expression given here for the variance of g is J2 times the variance of d, but many authors ignore the J2 term because it is so close to unity in most cases. Again, while it is preferable to include this correction factor, the inclusion of this factor is likely to make little practical difference. Computing d and g from studies that use prepost scores or matched groups We can estimate the standardized mean difference () from studies that used matched groups or prepost scores in one group. The formula for the sample estimate of d is Chapter 4: Effect Sizes Based on Means d5 Ydiff Y1 Y2 5 : Swithin Swithin 29 ð4:26Þ This is the same formula as for independent groups (4.18). However, when we are working with independent groups the natural unit of deviation is the standard deviation within groups and so this value is typically reported (or easily imputed). By contrast, when we are working with matched groups, the natural unit of deviation is the standard deviation of the difference scores, and so this is the value that is likely to be reported. To compute d from the standard deviation of the differences we need to impute the standard deviation within groups, which would then serve as the denominator in (4.26). Concretely, when working with a matched study, the standard deviation within groups can be imputed from the standard deviation of the difference, using Sdiff Swithin 5 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; 2ð1 rÞ ð4:27Þ where r is the correlation between pairs of observations (e.g., the pretestposttest correlation). Then we can apply (4.26) to compute d. The variance of d is given by 1 d2 þ Vd 5 2ð1 r Þ; ð4:28Þ n 2n where n is the number of pairs. The standard error of d is just the square root of Vd, pﬃﬃﬃﬃﬃ SEd 5 Vd : ð4:29Þ Since the correlation between pre and postscores is required to impute the standard deviation within groups from the standard deviation of the difference, we must assume that this correlation is known or can be estimated with high precision. Otherwise we may estimate the correlation from related studies, and possibly perform a sensitivity analysis using a range of plausible correlations. To compute Hedges’ g and associated statistics we would use formulas (4.22) through (4.25). The degrees of freedom for computing J is n – 1, where n is the number of pairs. For example, suppose that a study has pretest and posttest sample means X1 5 103, X2 5 100, sample standard deviation of the difference Sdiff 5 5.5, sample size n 5 50, and a correlation between pretest and posttest of r 5 0.7. The standard deviation within groups is imputed from the standard deviation of the difference by 5:5 Swithin 5 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 5 7:1005: 2ð1 0:7Þ Then d, its variance and standard error are computed as d5 103 100 5 0:4225; 7:1000 30 Effect Size and Precision vd 5 and 1 0:42252 þ ð2ð1 0:7ÞÞ 5 0:0131; 50 2 50 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SEd 5 0:0131 5 0:1143: The correction factor J, Hedges’ g, its variance and standard error are given by 3 J5 1 5 0:9846; 4 49 1 g 5 0:9846 0:4225 5 0:4160; Vg 5 0:98462 0:0131 5 0:0127; and pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SEg 5 0:0127 5 0:1126: Including different study designs in the same analysis As we noted earlier, a single systematic review can include studies that used independent groups and also studies that used matched groups. From a statistical perspective the effect size (d or g) has the same meaning regardless of the study design. Therefore, we can compute the effect size and variance from each study using the appropriate formula, and then include all studies in the same analysis. While there are no technical barriers to using studies with different designs in the same analysis, there may be a concern that these studies could differ in substantive ways as well (see Chapter 40). For all study designs the direction of the effect (X1 X2 or X2 X1 ) is arbitrary, except that the researcher must decide on a convention and then apply this consistently. For example, if a positive difference indicates that the treated group did better than the control group, then this convention must apply for studies that used independent designs and for studies that used prepost designs. It must also apply for all outcome measures. In some cases (for example, if some studies defined outcome as the number of correct answers while others defined outcome as the number of mistakes) it will be necessary to reverse the computed sign of the effect size to ensure that the convention is applied consistently. RESPONSE RATIOS In research domains where the outcome is measured on a physical scale (such as length, area, or mass) and is unlikely to be zero, the ratio of the means in the two groups might serve as the effect size index. In experimental ecology this effect size index is called the response ratio (Hedges, Gurevitch, & Curtis, 1999). It is important to recognize that the response ratio is only meaningful when the outcome Chapter 4: Effect Sizes Based on Means Study A Response ratio Log response ratio Study B Response ratio Log response ratio Study C Response ratio Log response ratio Summary Response ratio Summary Log response ratio 31 Figure 4.1 Response ratios are analyzed in log units. is measured on a true ratio scale. The response ratio is not meaningful for studies (such as most social science studies) that measure outcomes such as test scores, attitude measures, or judgments, since these have no natural scale units and no natural zero points. For response ratios, computations are carried out on a log scale (see the discussion under risk ratios, below, for an explanation). We compute the log response ratio and the standard error of the log response ratio, and use these numbers to perform all steps in the metaanalysis. Only then do we convert the results back into the original metric. This is shown schematically in Figure 4.1. The response ratio is computed as R5 X1 X2 ð4:30Þ where X1 is the mean of group 1 and X2 is the mean of group 2. The log response ratio is computed as X1 ð4:31Þ lnR 5 lnðRÞ 5 ln 5 ln X1 ln X2 : X2 The variance of the log response ratio is approximately VlnR 5 S2pooled ! 1 1 2 þ 2 ; n1 X1 n2 X2 ð4:32Þ where Spooled is the pooled standard deviation. The approximate standard error is pﬃﬃﬃﬃﬃﬃﬃﬃ SE ln R 5 VlnR : ð4:33Þ Note that we do not compute a variance for the response ratio in its original metric. Rather, we use the log response ratio and its variance in the analysis to yield 32 Effect Size and Precision a summary effect, confidence limits, and so on, in log units. We then convert each of these values back to response ratios using R 5 expðlnRÞ; ð4:34Þ LLR 5 expðLLlnR Þ; ð4:35Þ ULR 5 expðULlnR Þ; ð4:36Þ and where LL and UL represent the lower and upper limits, respectively. For example, suppose that a study has two independent groups with means X1 5 61.515, X2 5 51.015, pooled withingroup standard deviation 19.475, and sample size n1 5 n2 510. Then R, its variance and standard error are computed as R5 61:515 5 1:2058; 51:015 lnR 5 lnð1:2058Þ 5 0:1871; VlnR 5 19:475 2 1 10 ð61:515Þ2 þ 1 10 ð51:015Þ2 ! 5 0:0246: and pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ SElnR 5 0:0246 5 0:1581: SUMMARY POINTS The raw mean difference (D) may be used as the effect size when the outcome scale is either inherently meaningful or well known due to widespread use. This effect size can only be used when all studies in the analysis used precisely the same scale. The standardized mean difference (d or g) transforms all effect sizes to a common metric, and thus enables us to include different outcome measures in the same synthesis. This effect size is often used in primary research as well as metaanalysis