IHPI Seminar: Precision health, big data and evidence-based medicine – contradictions or companions?

IHPI Seminar: Precision health, big data and evidence-based medicine – contradictions or companions?


– Hi, good afternoon everyone. Thanks for coming. I’m excited about having
today’s speaker here today. John Ioannidis. This is his second of three talks, so he’s doing the Michigan
Triple crown here. So yesterday, he talked to us a little bit about the world of bias. Today he’s gonna talk
about precision health, and then tomorrow, as part of the Department of
Internal Medicine Grand Rounds, he’s gonna be wrapping up that series as well talking about reproducible and useful clinical research. So if you have the ability, try to catch all three of the talks. They’re sure to be really remarkable. I just wanted to start really with just a brief introduction. You could go on really all
afternoon talking about John. He’s the C.F. Rehnborg Chair in Disease Prevention Department
at Stanford University, where he’s also Professor of Medicine and Professor of Biomedical Data Science and a Professor of Statistics, and he also co-directs METRICS, which is their institute on meta-research, the study of how we
should be doing studies, and John said an interesting
kind of career and path. He was born in New York City
but then grew up in Athens, and he’s benefited from
actually experiences on both sides of the Atlantic. But currently he’s at Stanford University, and it goes really without saying that he’s probably one
of the most original and influential physician-scientists
of his generation. I could go on and on with
this specific reward, but I’m gonna share a particular story which I’ve found really compelling. If you do know John, you know him through his magnum opus, right? His paper “Why Most Published
Research Findings Are False”, a paper published in
2005 in Plus Medicine, a paper that’s been cited nearly
7,000 times at this point. What’s really fascinating about that story is when you ask John well how did you come up with that idea, how did you write that paper? It’s a single-authored paper. What I’ve learned over the last couple of days is he did it,
as most of us would do, on a vacation in Greek Island. He wrote that paper in Sikonos, which is a tiny island off the
southeastern part of Greece. It’s very close to other islands that some of you might
know, Santorini, etc. As opposed to those kinds of glamorous vacation
destinations, this is an island that has about 250 people in it. I just picture John Ioannidis in 2005 over a period of 48 hours typing out what has become
probably one of the most seminal works in our scientific
and meta-research fields. It’s a real honor to have you here, John, and we look forward to hearing your talk. (audience applauding) – Thank you. Thank you for the very kind invitation and the wonderful introduction. I will try to share some
thoughts on precision health, big data, evidence-based medicine, these are all terms that
have very strong friends and very strong skeptics also, probably among the audience and beyond. What is evidence based medicine? We have to go back to the original quotes of the term by David Eddy, 1990, said “Consciously anchoring a
policy, not to current practices “or the beliefs or experts,
but to experimental evidence.” And this means that the pertinent evidence must be identified,
described, and analyzed. So you have experiment and
some effort at synthesizing and putting together
information and analyzing it. David Sackett probably has written the classic definition of
evidence-based medicine. He says “It’s the conscientious, explicit, “and judicious use of
current best evidence “in making decisions about the
care of individual patients.” And I underline the word individual. And that means integrating
individual clinical expertise, again, individual is very prominent there, with the best available
external clinical evidence from systematic research. That definition immediately
has two major components. One is an individualized approach, both at the patient level
and at the clinician level, and therefore also at the level of their interaction, their encounter. And then a second component, which is science, information, evidence. The best possible evidence,
the best possible science, put together in the most unbiased manner. Seeds of wisdom and of
debate in these definitions. Experiment, which practically means randomized controlled trial. Evidence-based medicine
got almost synonymous with the advent of the need to get better and larger and to more
relevant clinical trials, randomized controlled trials. Systematic approach, systematic research, and that translated to systematic reviews and meta-analysis and being a tool for promoting integrated evidence. Individual patients, precisely so. So here’s the roots that lead exactly to precision medicine and individualized medicine. And individual clinical expertise, again, extremely well aligned
with we’re talking nowadays about individualized and precision health. Over the years, many hierarchies of evidence have been proposed. And the most popular
ones have meta-analysis and systematic reviews at the top. Then you have randomized
controlled trials following that. And very close to the level of no value or even negative
value, you have experts and even below that, you
have tweets or experts or powerful people tweeting all the time. Then, you have multiple types
of evidence that have evolved. Clearly, the traditional
evidence-based medicine has been dealing with meta-analysis and randomized trials, so
clinical types of information, but there’s also observational evidence, there’s mechanistic evidence,
there’s other evidence, and if you look across 180
million scholarly documents that are floating around
with 20 million authors, there’s lots of sand in
that desert of information that a clinician is trying to go through on a daily basis, multiple
times, back and forth. And a scientist is trying to make sense of and survive and get to some
oasis of real discovery. We have also learned that
evidence is less than optimal. Usually, these pyramids are destroyed, like this poor destroyed
pyramid in Abu Rawash. We lack the type of evidence
that we want to have in place in order to have actions that we feel certain are going
to do more good than harm. Also, pyramids can be bulldozed
by property developers, like this in Peru, and we
have learned over the years that evidence can be severely affected by conflicts of interest of
stakeholders who support, sponsor, and develop,
and disseminate evidence. Most of that is financial
conflicts of interest, but increasingly, we are recognize that there’s other conflicts of interest that could also be
important in some settings. Some of them stemming from
just genuine human curiosity. But still, you can get a
lot of strong allegiance, bias even by people who
otherwise have good intentions. There are ten views to
that standard picture of evidence-based medicine. One of them that
flourished in the mid-1990s was thinking that N-of-1 trials should be at the top instead of having the composite picture from all the studies, all the trials that have been done, put together, in meta-analysis,
in a systematic review, we should just look at what happens at the single-patient level. So N-of-1 trials, Gordon Guyatt and others proposed pyramids that
had them at the top, and many people favored that,
but very soon we realized that N-of-1 trails were not really doing what they were supposed to do. So N-of-1 trials, which
are currently proposed as the new wave of understanding and promoting precision
health and precision medicine, actually are a design that
was introduced in the sixties. We got its basic methods
correct in the 1980s, flourished in the early 1990s, and abandoned about 20 years ago. Now, they’re being resurrected. Why were they abandoned to start with? Because they have these caveats. They are not good if the disease does not have a steady, natural history. They’re not good if
there’s carryover effect. They’re not good if there
are priming effects, and if the effects depend on the sequence of previous choices that have happened or have been utilized in the same patient. They are not good if the
disease has a fatal outcome in a relatively short course because then you have no
time to test multiple options and choose which one’s the best. And they’re not good if there’s poor or unpredictable compliance,
adherence, tolerability, meaning if there’s real life. Therefore, N-of-1 trials met with all of these challenges and
didn’t really move very far. So here comes precision
medicine or precision health, as I see that you prefer
this term at Michigan. We use exactly the same term at Stanford. I’m not sure which one is best. So what is that? I don’t know, I’m supposed to be an expert in precision health, but
I have no clue what it is, so I went to Wikipedia to find out, and here’s the definition
that Wikipedia gives, “It’s a medical model that proposes “the customization of healthcare
with medical decisions, “practices, or products being tailored “to the individual patient.” So very prominently, again,
the individual patient. We’re back to the definitions of evidence-based medicine of the nineties. Now the individual, if you simplify in some sort of semi-mathematical terms, is one of the population,
the other extreme of the population. By definition, precision medicine
is thus aiming to have the most tiny and the most
negligible impacts possible at the population level. I mean, this is the starting point. We’re trying to have an
impact on individuals rather, therefore, we want to get
the most negligible impact. Big data. Again, I couldn’t find a good definition. I found probably about a hundred different definitions on what they are, and this is my attempt to define big data. Big data, it is data that carries the least possible
information content per unit. The more insignificant the content of the information per unit,
the bigger the big data. So why is that? Because if we really had information with a lot of content that is meaningful and useful, we wouldn’t need big data. Why should we waste our hard drives and our time and our resources
and our computational time if we could just measure one thing and that would be the answer
to all of our problems? The fact that we need to measure all these information units to try and build together
something that is useful means that we are in a situation where we are struggling with situations with minimal information per unit but hopefully so much information that if you multiply the per unit times the amount, hopefully
it can still be useful. It’s the the exact
opposite of Bradford Hill request of what should we believe? So Bradford Hill, one of the fathers of clinical epidemiologists, said that “I’m willing to
believe if something can be “checked on the back of an envelope.” If it’s two plus two, or maybe an odd ratio and a two-by-two table is fairly possible to fit in the back of an
envelope, that’s okay. If it’s more than that, forget it. It’s too complex for an epidemiologist, let alone for an everyday
clinical practitioner, let alone for a poor patient
who tries to get some benefit. How can we have these two
opposites come together? I would argue that we can
start from the premise that precision medicine
or precision health is the study of the most insignificant, and then use one quote from
one of my favorite poets, Odysseus Elytis, that “You’ll
come to learn a great deal “if you study the insignificant in depth.” So precision health is a way to study the
insignificant in depth. How deep can we go? Before we decide how much in depth we can go into the abyss, let’s try to see what do we have in place and whether we can really have the pathometer to reach the abysmal end of all this data. In 2018, evidence-based medicine has lots of data. Not necessarily reliable, but we have lots of data both at the individual level and at the population level. Sometimes we have both
ends of the spectrum being heavily armed in information. We have big and small, deep and shallow, broad and narrow types of databases. We also have the
patient-clinician interaction that is still there, but
is probably suffering much of the time because of limited time and because most physicians have to deal with a computer and with data rather than find time to even talk with the patient. And we believe and we have evidence, and pretty good evidence, actually, that shared decision-making
is a good thing. So if that information can communicate and can be shared meaningfully
between physicians and patients, that would be really nice. What kind of information are we going to share with patients? Unfortunately, despite the fact that we have tons of information, very little of that is clearly useful. This is an analysis that
we did a couple years ago. We took 1400 topics in
medicine that had been assessed in the Cochrane Database
of Systematic Reviews. And of those, less than half, 43% had GRADE summary
of findings assessments. GRADE is the grades of recommendation, assessment, development,
and evaluation tool that is trying to assess
what is the quality of the evidence when you have data from one or multiple trials? What was happening in the other 57%? Well, mostly, there was no evidence, and this is why there
was no great assessment. In these cases where we did have evidence, looking at the first primary outcome, only 13.5% of the time we
had high quality of evidence. Even when looking at all
outcomes that had been assessed, and where the primary set of outcomes only 19.1% had at least one outcome with high quality of evidence. If you limit your focus to the reviews that had high quality
of evidence available, and significant results,
nominally significant just with the typical 0.05 threshold and a favorable interpretation
of the intervention, meaning someone concluded
so this is a treatment that is good to use, only 25 out of the almost 1400 topics had this type of situation. So less than 2% of medical topics had high quality evidence,
significant results and someone said yes, go ahead and do it. 98% of the time, we had modest or very large uncertainty about how exactly to
deal with populations, let alone patients. One confounding problem is that we don’t have many discriminatory tools that can tell us which among this 98% of topics where we have uncertainty, we can lean more towards saying okay, maybe we have some evidence to act or maybe we don’t have enough evidence and maybe we need more or maybe we have some evidence not to do anything. The problem is that our typical tools that have been used for discrimination since the times of R.A.
Fisher have become obsolete. Almost all scientific papers claim that they have found statistically and/or conceptually significant results. Obviously, all of my grant proposals claim that what I’m planning to do will be highly significant
one way or another, although I mostly submit
mediocre ideas for funding. And if you look across the entire PubMed worth of abstracts and full text papers, 96% of them report statistically
significant results. This is an analysis that
we published in JAMA. About two years ago we looked at close to 15 million abstracts
in PubMed from 1990 until 2015, and close to one million
full text articles from PubMed Central. 96% of that literature
claims significant results. Practically, whenever they had p-values of some sort listed,
practically all of them claim to be novel. It looks as if discovery
has become so commonplace that it’s a boring nuisance at this point. So how can we tell what to use out of that huge mass of information? To make things worse, almost any result can be obtained unless we pre-specify what kind of analysis we’re going to have and the availability of big data is making that challenge even bigger. This is the Janus phenomenon,
after the Greek Roman god who can see in both directions. And these are data from the
National Household Survey. What I’m plotting for you here, for example this panel is
whether alpha-Tocopherol, or vitamin E, is associated
with the risk of death. And there’s a cloud of one
million different results. This is the hazard ratio, and this is the minus log 10p value that are obtained for the very same question being addressed in the very same database. How do you get one
million different options? Practically death can be affected by many other variables,
so for each variable that you can adust or not adjust, the regression you have two options. If you try to adjust or
not adjust for 19 options, two to the 19 is one million choices. And 70% of the results suggest that vitamin E decreases the risk of death. 30% of the results suggest that vitamin E increases the risk of death. So Janus is looking in both directions and depending on what you enjoy the most, you can report that vitamin E is great or vitamin E is horrible. When results seem to be more credible, unfortunately still we have the problem that most of those are not
necessarily patient-relevant. And one might argue that, in an individualized precision setting, we need to ask patients one at a time, what exactly is relevant to them. So that would be the ideal, and then we have a different textbook of medicine for each
patient one at a time. However, there are some outcomes that are important no matter what. So I think that death, for example, is a very important outcome regardless. Some people may still say “I want to die. “Please, get me to die.” Possible, but I think that still, even for that patient, if we can make him change his mind, and let him live longer and live a good life, that
would not be really that bad. So are there some outcomes that would be ubiquitously important so that regardless of the particular patient, they would be interesting to know what we can do with
different interventions at the population or individual level. There’s initiatives like
COMET that are trying to put together such a list of outcomes that seem to be essential to study for different diseases. And in the case of preterm infants, where for good or bad, the patients cannot even express their wishes on what is the important outcomes for them. Things like chronic lung disease are clearly very important. You know, you can measure many, many other things, but chronic lung disease is a major problem for preterm infants. So you want to know what
different interventions would do in that regard. When we checked more
than a thousand trials on preterm infants, less
than a third of them actually reported on chronic lung disease. Two-thirds plus did not mention something that was so ubiquitously
important to know about. Another premise about precision is that, since we’re talking
about one patient at a time, hopefully the effects associated with that intervention should be really large. If these individualized interventions have tiny effects, who cares? We’ve had so many tiny
effects floating around. The big promise is that
now we’ll put together biology, lots of data in
formatics, complex analysis black boxes, and will
get you a huge benefit that if you have that profile, you will really do much, much better. What does our previous
experience telling us when we look at evidence across medicine in terms of how often
to we see large effects? This is an empirical
evaluation we published about five years ago in JAMA where we took every single meta-analysis
that we could track in the Cochrane Database. There were 85,000 meta-analysis that we could track,
and we asked how often do we see treatment effects that are large and that are also replicable,
meaning they are seen in more than two studies, and hopefully they have good statistical support and hopefully they also have no clear bias that
is visible and that would invalidate our trust in them. What we found is that large effects are very common, but they’re very common in very small, early trials and very few
of them survive scrutiny. So if you ask for a mortality effect, where you have a five-fold
reduction of mortality risk, that has been seen in two trials with a p-value of less than 0.001, and with no florid evidence
of something being wrong in the evidence, along
these 85,000 meta-analysis, there was only 1 topic,
extracorporeal oxygenation, that achieved that kind of gold standard status of huge benefit clearly. Is that all? Are there no other interventions that are as effective
as wearing a parachute when you jump from an airplane? Yes, there are more in medicine. Obviously, if you have someone with diabetic ketoacidosis
and you don’t give insulin, it’s like letting them fall from the airplane without a parachute, or well, depending on how serious it is, maybe jumping from the 10th floor, but yeah, a small chance for survival. There are such, but
Paul Klasiu has created a list of such huge effects that have been thoroughly validated, large benefits clearly so. And the list includes
about 20 interventions, maybe 25 at the most. What is happening is that we do see lots of large effects like odd ratios of five plus, even for things
of mortality sometimes, but they’re seen in very small studies, on average with 12
participants’ worth of data. And whenever we perform yet another study, the effect goes away. It either completely evaporates or it becomes much, much smaller, so then it’s questionable whether it’s really worth it or not? Based on a lot of experience, the data that we complied here included about a quarter of a million
of randomized trials. We know that very large
effects are not uncommon to find when we’re dealing
with small numbers, which is the typical recipe
for most of the studies that are currently being done
in precision medicine circles, but most of those really
need to be validated before we can be certain
that they’re not flukes and would not disappear. Quality could also confound the picture. Quality problems are prevalent
in clinical research, much like in other disciplines. Some of them may be unavoidable. Sometimes masking, for
example, is not an option. Randomization, very often, is
not as proper as it should. Allocation concealments,
sometimes, cannot be guaranteed and when it can be guaranteed,
it has not been adhered. We have lots of evidence,
and this is summarizing the result of the Brando
project where I was involved, where we performed a meta,
meta, meta, meta-analysis including several thousands
of clinical trials and several thousands of meta-analysis. Practically, we concluded that if you have problems with randomization, if you have problems with
allocation concealment, if you have problems with blinding, you’re likely to get, on
average, inflated effects. However, the average
distortion is relatively small compared to the heterogeneity in the distortion that you can get. Most of these distortion effects are much larger compared to the precision effects that
we’re chasing currently. So, unless we can fix
these quality problems, we don’t really know whether these effects that will emerge will be
true or would be spurious. And, even worse, because
of the large heterogeneity in the amount of distortion
that these biases introduce, we will not be able to just correct by saying that well, we
failed three aspects, we didn’t have randomization, so let’s correct by
dialing the effect size by 10% and we will be fine. It will be 10% on average, but it could be 80% in some case. It could be even in the opposite direction in some other case, and this
is impossible to really know. Another confounding factor, much clinical research nowadays is done outside the US and
Europe, meaning countries that have a tradition of research, in particular of clinical research. A lot of research is done in China, a lot of research is
done in Eastern Europe, very often the price for running these studies there is very low. You can run a clinical
trial with one-tenth or one-fiftieth sometimes of the cost that you would need to do
in Michigan or at Stanford, where the cost would be prohibitive. Are these results unbiased? Here’s another empirical
evaluation where we looked, again, across all the Cochrane Database in situations where mortality outcomes had been assessed in
European and American trials, and also in trials done in countries that don’t necessarily have such strong tradition of clinical research. Here’s one example. If you look at calcium antagonists in aneurysmal subarachnoid hemorrhage, European and American studies show practically no significant benefit. It’s very small, if
there’s anything there. And clearly nowhere close to even a nominal level of significance. If you look at a small study in China, there’s an 85% reduction in the odds, suggesting that this is a tremendously effective intervention. Is it that this study is completely flawed because it was done in China? I think not. If you look carefully, its quality scoring may look pretty good, but
probably what’s happening is that there’s many, many more studies that have been done in China or in Eastern Europe,
or in other countries. And then, they are trying to get through the bottleneck
of getting published. And if you get a study that has completely negative results in the US, you know that it’s not so easy to get it published in the New
England Journal of Medicine. Well, probably you can get it published in some specialty journal. If you get a negative result in China, you will not be able to publish just anything, number one. Number two, you will not
get a financial bonus. There’s bonuses given by
the Chinese government institutions that could be up to $100,000 for a paper in nature,
and few hundred bucks for something that you can
publish in a more modest venue. And three, you run the
risk of being flogged with the whip because you
found a negative result. So you have to take into account what is the research environment in different communities,
and how that might shape the dissemination of
results from small studies that may seem to have extravagant results. On average, studies from
less developed countries had an inflation of the odd ratio for mortality of about 15%, which is larger than the
average treatment effect of the most effective treatments that we have to curtail
the burden of death. Here’s another example
from the TOPCAT study where investigators, and this from the New England Journal
of Medicine, realized that spironolactone behaved very differently in American and European sides versus sides in Russia and Georgia. And trying to go into more depth, they realized that the experience of the patients of Russia and Georgia was entirely different. Apparently, probably, these people had not even been treated
with spironolactone or whatever was the assignment
they were supposed to be. Further confounding by sponsor bias. There’s some types of clinical trials that almost always favor the sponsor. A couple years ago, we looked at 57 non-inferiority trials
with head-to-head comparisons. We took the largest ones and we found that 55 of them showed results that were favorable for the sponsor. The success rate was 96.5%. I would argue if an experimental design has a success rate of 96.5%, is that anywhere close to equipoise? Why do we need these studies? We can get rid of them. We can say well, this
study will be successful, the drug will look very nice, and move to the next step. What is wrong here? Is it that industry-sponsored trials, and this is becoming
more and more prevalent in the case of precision
interventions, are biased? Is it that the industry has evil people who are cooking up the data and disseminating false results? No, if you look at the quality of these trials based on our traditional checklist,
they look very nice. The problem is that the design is such that it’s trying to optimize the chances of getting a nice-looking result. So choice of comparators, choice of the non-inferiority margin, choice of the setting, choice
of the exclusion criteria as such that you’re almost doomed to get a nice-looking result. This is very difficult to decipher. It’s a whole science behind the science of how to always get nice-looking results. Further problem: can we trust the data? This is 329, and it’s not a submarine although it sounds like a U329. It’s a randomized trial that resurfaced after 15 years after its submersion. So when it had first appeared, it was the pivotal trial that suggested that in major depression in adolescents, paroxetine and imipramine,
two antidepressants, were very effective and very safe. 15 years later, independent investigators got hold of the individual-level raw data and they reanalyzed
the trial from scratch, and they concluded that both
paroxetine and imipramine are not effective and are not safe. Entirely the opposite conclusion. Can we trust the data? We want to trust the data. Unless we can trust what we read, we are really into deep trouble, because then how can we tell which cases are like 329 and which cases we can trust? Moving forward, here’s a larger evaluation of reanalysis of raw data
in individual trials. We tried to unearth every single case where a paper had been
published reanalyzing an original trial for the
very same clinical question, but it had been done in a separate publication than the original. We found 37 such cases in that paper that we published in JAMA in 2014. 35% of the time, the conclusions of the reanalysis were entirely opposite or very different compared
to the conclusions of the original analysis. The original had claimed
that the drug is effective. The reanalysis, that it is not effective. Or vice versa. The original had claimed that this is the subgroup, the characteristics of patients who need to be treated. The reanalysis said nope,
this is the characteristics of the patients who would
benefit from treatment. What is going on here? Is it that we have rogue reanalysts who are trying to make
a fame around their name and are trying to put the
original investigators into disgrace just by manipulating lots of fancy reevaluation
of the analytical space? Actually, these papers are
almost always published by the same original investigators who published the first paper, but it is happening in
an environment where, if you spend extra time to
reanalyze your own data, you cannot publish a second paper. You cannot say that I spent extra time and had a second look,
and I get the same result, and I get a second paper. Conversely, if you can
get a different result or if you can say that you
get a different result, then you can get a second paper. It sounds very confusing, but this is the incentive
structure that facilitates this feeling of a
reproducibility in that setting. Here’s a very different environment, where raw data are available routinely. Two journals, Plus Medicine and BMJ, have routinely made
availability of raw data a sine qua non, a
prerequisite to publication, and they encourage that
all data need to be shared for any trial that they have published in the last several years. So we communicated with the authors of all these clinical trials, and we asked them to
send us all the raw data from all the trials
that they had published in these two journals. We got 46% of them. Is that half full? Is it half empty? It’s close to half, though. And then we spent time to reanalyze all these clinical trials from scratch to see what results we got. That was not bad. It’s a pretty complex graph here, but all these points are
very close to the diagonal. If they were exactly the diagonal, it would mean that we got
exactly the same results. There were a few errors
here and there that we got. I mean, we got in touch
with investigators. They said yeah, thank
you for picking this up. But none of these were such that the conclusion of the
trial would be invalidated. So in a different environment,
in a different culture, where everything is supposed to be shared, where you have journals that have very strong enforced policies, where you have the most
transparent investigators who want to do that and
go through that policy, and you have even the subset of those who are saying here, go, take
my data and reanalyze them, almost everything looks fine. So is it that the culture is what matters? Or is it that we just
deal with different types of selection, biases
and selection processes? Another promise is that, since we have all these problems with the
traditional study designs that depend on availability
of randomization and huge expenses, maybe
we can replace them with routinely collected data. So there’s tons of routinely
collected data based analysis that try to estimate treatment effects and our methods are becoming
more and more sophisticated to try to incorporate
patient characteristics that define populations
that can be almost matched to a randomized equivalent. So these are results of
routinely collected data studies using propensity matching or propensity score adjustments
versus randomized trials that were done afterwards. So when the RCT study was
released and published, there was no randomized control trial so that one could compare
notes, but then subsequently, there were randomized trials published. And these are mortality outcomes. The average difference is a
31% difference in the odds. It’s huge. It’s like three times the
average effect for mortality that we see even for very
effective interventions. So if really, this is that bad, clearly that’s not a panacea for replacing the paradigm of randomized trials. Another solution is to
build prediction tools, prediction models, prediction algorithms into mapping the
individuals or the subgroups or the substrata that
would do better or worse or have different magnitudes
of treatment effect. I’m very fond of that. I’m doing quite a lot of
research in that space. I still hope that
something major will arise, but I think most of my papers
are probably not much worth. This is looking at the field of cardiovascular disease risk prediction and this is the number of articles that present new predictive models for cardiovascular disease. There is about 400 different models that have been presented in the literature for cardiovascular disease
and as you well know, when ACC/AHA decided to
release new guidelines, they looked at available predictive models and they said none of them is good enough, so they developed yet a new one that everybody feels that
is one of the worst models. It’s not calibrated, it’s
completely misaligned with risk levels and so forth. So how can we really find predictive tools that would be validated but also useful and that would be, there
would be some consensus that people would like to use? The alternative approach is that why should we have everybody
using the same model? Maybe each hospital, each
center can use their own model. So we have a new wave of lots of studies that take electronic health records and they develop a local model that actually can be
reevaluated and updated and upgraded and changed in real time as we get more information and more data. So this is an empirical
evaluation that we did with Ben Goldstein and
his colleagues at Duke. We looked at EHR-based predictive models and this is the sample
size that they used. As you see, some of them
go up to a million plus of individuals, which is very good. The problems of small
studies that we were facing with randomized trials clearly is gone. We are facing the exact opposite of just having over powered situations. The number of events, again, very nice. This is 10,000, sometimes
even more than that, and the number of variables sometimes we can include a thousand variables, even 10,000 variables in the model. Try to visualize and communicate to a physician or to a patient, this is a model with 10,000 variables that we have included. That would take probably
four years to explain. So you just say it’s
really good, trust me, it’s gonna be fine. How well do these models do? They don’t really do that well. I have to say that I was disappointed. I was expecting to see some AUCs or some reclassification rates that would be much better than what we see. On average, the AUCs were like .7. Like the oldest models that we had in the 1940s when we could only measure CBC and creatinine. Having lots of variables, I’m not saying it’s a bad thing, but most of these models
don’t really add up much in terms of predictive ability. I don’t need to remind
you that subgroup analysis have a tradition of leading
us down the wrong path. There’s a lot of discussion for many years even proceeding precision
medicine about using subgroup and effect modification
and stratified medicine. And the classical example
probably is from that paper in The Lancet where these
are the months of birth. So this is the zodiac cycle, horoscope, and so this is the absolute risk reduction from corroded endarterectomy in people with systematic stenosis. Clearly, huge subgroup differences. Heterogeneity p less than 0.0001, based on the zodiac sign, but obviously this is 0% likely to be true. If you look across the literature
of such subgroup claims, which are like the first step to inform some sort of stratification
towards personalization, most of the claims that have been made in the literature cannot be reproduced. We published a couple of
papers where we looked at all the sex based subgroup differences across all Cochrane meta-analyses
that we could identify in randomized controlled trials and we found very little that was reproducible across multiple trials. There were lots of sex differences if you looked at single trials. They were just very common. But something that was seen
again and again, very uncommon. How about if we combine lots of databases? Lots of randomized trials. We perform meta-analyses of raw data of individual-level data, and then try to identify closely validated
subgroup differences. This is another empirical analysis where we looked at all the meta-analyses of individual-level
data that have been done to date and these are the results for subgrouping variables
that seem to discriminate between different groups that have different behavior to treatment. These are the p-values
for individual-level subgrouping variables and for group level where the entire group has the same value. For example, the type of
dose that is being assigned. P-values of 0.05 or close to that and 0.01 or close to that really
not very nice looking. If you translate them to base factors, they translate to base factors
of something like 2.5 to 5. For basic perspective,
this is okay to mention, but is it likely to be true? No, not really. Probably a small minority
of those may be true. If you look at the magnitude
of the effect modification, so how much difference
in the treatment effect do you get in people who have
different types of covariates? The average magnitude is less than .2. .2 traditionally, in a
standardized different scale is the threshold for a small effect. So the effect modification
average is less than small. Eventually, we want to get
something that’s useful. This is what evidence-based
medicine was supposed to do. This is what precision medicine or precision health was supposed to do. But getting useful clinical
research is not easy. In that paper in Plus Medicine 2 year ago, I tried to come up with eight criteria or features of what we really want. And it’s the same regardless of whether you believe in evidence based medicine or in precision medicine. First, we need to have a problem. We need to have a problem to fix. If there’s no problem to fix, just creating problems, creating diseases that don’t exist, just
moving the definition so that everybody becomes sick and needs treatment,
that’s not a good idea. Second, we need context placement. We need to know what we
already have available in terms of information. Maybe we have none, or
maybe we have 522 trials, as we do in antidepressants. And then why do we need another one? We need information gain. Is that new study small, big, randomized, nonrandomized, going to tell us something? If not, or if it’s going
to tell us something only if it gets a
particular type of result, this is not a good way to decide to do it. Pragmatism, does it reflect real life? Or if it deviates, does it matter? Patient centeredness, does it reflect top patient priorities? Have we asked patients
what do they really care about with the background
disease that they have? Value for money, is the
research worth the money? There’s formal ways to assess
that with proper tools, but it’s very rarely done. Feasibility, can it be done? About 35% of randomized trials in surgery are abandoned because of utility. They are assuming that they can get 50 patients on board, but
after six months or one year, they have only enrolled four or five. And finally, transparency,
are the methods, the data, the analysis
verifiable and unbiased? And this is where reproducible research, open science comes into play. Where do you find these studies? If you look across the literature, most papers that you will circulating in respectable journals don’t meet more than a couple of these
criteria, even none. If you decide just to read New
England Journal of Medicine, Lancet, JAMA, like the top of the top, again, most papers will meet very few of these criteria and even though they will be better, on
average, in fulfilling some of these requirements,
most of the good studies would not necessarily be there. Do we need to speed up or
do we need to slow down? That’s another question. Facing these challenges,
one option is to say, well we have all these problems, thank you for mentioning them again, you’re such a bad guy to mention them. But the way to move forward is to just get as many
interventions out there. Get them licensed and we’ll sort it out. We’ll let the dust settle. We’ll do some studies after the fact and then the real winners will emerge. This is actually an idea that is not new. It has been going on
for over 10 years now. It’s very prevalent in
cancer, prevalent in HIV. I think HIV is a nice success story. I was at NIH when we ran the
pivotal clinical trial ACTG320 that showed that you can have a treatment that can completely change
the course of a disease, a huge success. And some other diseases have this kind of accelerated approach. What have we learned? First, we have learned that the studies that are supposed to
be done after the fact, they are not done. Once you have something licensed, then it will go mostly into the mode of being evaluated for yet new indications beyond the ones that it
has been already approved. Nobody will go back, or
very few people will go back and try to make sure that what we approved it for is really valid. These trials of using that
new treatment as background, meaning that they take it
for granted that it works, but then they build
additional new interventions that are also eager to get
licensed are happening very fast. Within one year, we are shifting to yet the new target being licensed. If you compare the time of the trials where the intervention is tested for the initially approved indication versus other indications,
there’s hardly any difference. People take it for granted that since it got licensed for something, it’s good for everything,
which is exactly the opposite compared to the acceleration happening with the precision mentality in mind, that we’re proposing accelerated approval because it can really
affect that particular type of individual with these
specific characteristics. The other mode that is
becoming more common is to approve based on
nonrandomized trial data at all. This is an analysis that we did with all the European
Medicines Agency approvals over the last 10 years
and we’re doing the same with all the FDA approvals. Roughly 10% of approvals
of new medications and biologics have absolutely
no randomized data. If you look at the effects sizes, some of them look pretty big. This is absolute risk difference
and this is odd ratios, and on average, you see some
fairly large odd ratios. So something like 12
odd ratios, really nice, but the effects sizes could
be all over the place. Sometimes they’re close to zero. There’s still drugs that get approved with nonrandomized data with effects sizes that suggest absolutely no benefit, but there’s some biologic rationale that people defend it with that, well it works on that mutation and we’ve seen this to be important so, therefore, it must be effective. Systematic reviews and meta-analyses are not going to help necessarily
much in that situation. We have lots of them. We have exceeded 100,000
published meta-analyses nowadays. It’s more of an epidemic that is evolving but I think that if you have this type of starting block of information, you’re unlikely to be useful. We start seeing some weird phenomenon. We see, for example,
China becoming a champion in meta-analyses because again, you can get some money out of each paper that you publish there. This is genetics. Genetic meta-analyses. Nothing was being published from China but in 2014 the represented about 17% of the global production in
English language journals, serious journals, and now they’re about 85% of
the global production. There’s lots of contractors that publish meta-analyses and tons of them. There’s a hundred
companies that you can pay and get a meta-anlysis, and
then if you’re the industry you can decide to publish it or not depending on what the results look like. And it will be very nice looking, you pay a little bit, but
that’s not a lot of money compared to other areas of
research and development. As a result, we get tons of meta-analyses. These are network meta-analyses which is the most difficult design to do, but in some cases, we have
up to 28 meta-analyses on the same topic like biologics for rheumatoid arthritis. None of them is exactly the same. They cover different treatments and none of them gives the same coverage. None of them covers more than 50% of the studies that have been published. And none of them agrees with each other. These are the results. Again, if you want to pay,
you can get a meta-analysis to give you the result
that would fit your agenda. This is one practical example, since I surely got you depressed with what I have been describing. Which is the best antidepressant to use? These are four network meta-analyses done by very good friends of mine. They’re the best
meta-analyses in the world. I know that because they’re my friends, so I want to boost them
before I destroy their work. Peroxetine, according
to this meta-analyses, is the best antidepressant. According to that one,
is one of the worst. Sertraline is the second best. Here, is the worst or
one of the worst here. Choice of how exactly is going to be done can make a difference. In a way, meta-analyses
can be the last step of a marketing tool that you can get the conclusions that you want to see. Among 185 meta-analyses of
antidepressants for depression, when an author, who was an employee of the manufacturer, was involved, there was
22 times higher chance of not having any negative statements. Actually, there was only one case where you had industry involvement and some negative statement. And I’m not even sure that these people are still employed by that company. Only 3% of meta-analyses are both decent and clinically useful. I think this is the best design, and when it works, it can help a lot. So evidence based medicine is not dead. When you identify these
3,000 meta-analyses papers, they can really be helpful
and they can tell you what to do for population averages and perhaps also for a little bit of individual effects, but
they’re not that common. Last possibility may be getting from the individual effects, much
larger population effects for the same pathway. The classic example here is PCSK9 inhibitors where seemingly, we have a
very nice success story. We started from a genetics discovery. We identified a gene that seemed to be very important in
familial hypercholesterolemia. And it also had variants that would affect the risk at a population level. Then, developed a drug that was based on that target and we see that wow, we can really bring LDL
down across the population. Was that really a success story? If you look at the price
of that intervention is so high, that clearly, it’s cost effect is not something that is desirable. And obviously, we still don’t know whether the benefits in lowering LDL and in some clinical
outcomes, would also translate to mortality benefits, but apparently, doesn’t seem to be so prominent. I would close by saying that we may need to reverse the paradigm. Maybe move from poor primary data, retrospective reviews, and this type of fragmented information
to perspective meta-analyses is the key type of primary research. Think about what are the problems that we want to solve and design a clinical agenda where
everybody working on that field will join forces worldwide, the data will be
incorporated prospectively. There will be individual data, but they will contribute towards the same overall analysis. We can design what type of
comparisons we want to test. We would design the next study based on what we have in the
composite evidence until now. I think it’s unlikely we will be able to fuel precision medicine or health in individuals until we can obtain large-scale coordinated evidence on large populations to start with. To conclude, evidence based medicine has been with us for a long time. It has acquired tremendous power, but most of the medical
evidence is either problematic, spurious, or false or has no utility for medical and shared decision making. The main utility of systematic review in meta-analyses has
been to reveal problems with the biomedical evidence. Precision medicine health aims to specify one of the main pillars of EBM to deal with
individuals, which is nice, but by definition is likely
to have minimal impact on life expectancy and other
major population outcomes. Still, a synergy between
large-scale evidence and precision approaches would be useful to tell us what we can learn from both. Expectations of replacing experimental, randomized evidence
with nonrandomized data need to be tempered. I’m not saying that it cannot happen, but most of the time, we
will need randomized trials. Conflicts of interpretation
need to be minimized for primary data, for trials, for meta-analyses, and for guidelines. And prospective building
of research agenda may become the gold standard, hopefully, for primary clinical research, both at the individual and the population level. Special thanks to some of my colleagues who contributed to some of the work that I shared with you today, and special thanks to all of you for tolerating me at 5:00 p.m. Thank you. (audience applauding) – Any questions? I can start with one. So John, you’ve been
thinking about these issues for several years now, and painting a little bit
of a depressing picture. Do you think that, in some ways, this movement towards precision health and big data will actually solve some of these issues in just
kind of a general sense? Or do you think it’s going to be fuel that’s just going to add to the fire and it’s gonna make these problems even many folds worse as we think about things
over the next decade? – I think it’s up to us. It can go either way. I think that if we realize the challenges and build on circumventing them, and generating more
openness, more transparency, reduced conflicts, larger databases with more accurate measurements, we have a chance of addressing some of the longstanding problems that we have not solved with
clinical research until now. Conversely, there’s a great opportunity of having more data, and
therefore more errors rather than more discoveries
and more useful observation. So I don’t want someone
to come out of this as oh, we’re doomed and nothing can be done. We have lots of opportunities. We have lots of tools. We have lots of possibilities. The question is how exactly are we going to use them to get something useful? These eight criteria could be some sort of guidance of how to use these new tools. – Essentially I’d like to build on the previous question a little bit. And this actually comes more
from your talk yesterday, but it relates to this topic as well. I work in implementation
science, implementation research. At the end of your talk yesterday, you outlined a number of
steps that could be taken. Things could be implemented. In a sense, your criteria are also things that could be implemented. I’d like your thoughts on
what it will take to do that, if you think of implementation
as behavior change, and you think about the scientific and medical research complex in this country, internationally,
as the substrate of people whose behavior needs to change. What will it take to actually
get that change to happen? – I think to achieve change, you need to have the major
stake holders aligned. And the major stakeholders are scientists, funders, journals, professional societies, institutions, and also the general public and patients could be quite influential. You don’t necessarily need to have all of them aligned to
do exactly the same thing but they need to be sensitized and they need to recognize
that these are issues that need to be tackled and it’s important to move in that direction. Once you have some of
these stakeholders moving, and the others will follow. If you have just one,
it’s not gonna happen. So no single journal alone
can change the world. No single scientist alone
can change the world. But if you have multiple stakeholders, you will see things happening. And every successful transformation, like registration for clinical trials, happened when you have multiple
stakeholders being aligned. So you have both journals saying that I’m not gonna publish
your paper unless it is, if it’s a trial, unless
it’s preregistered, and you’d have funders who would say that you need to do that, and
you had clinical trials gov saying that here is how to do it, and you have to do it. And you have regulations trying to incentivize and ask for it. So we need, I think, a critical mass for these changes to happen. Data sharing. It was very rare and this is why we only came up with 37 reanalyses that had been published in 2014. Right now, there’s more
than 10,000 clinical trials worth of data that someone
could access as raw data. It’s still a small portion
compared to roughly 1 million, but I think we’ve seen some action. Statistical tools analysis, again we see some evolution over time. I’m not necessarily pessimistic. I think we do see changes, we
do see some paradigm shifts. At the same time, our challenges
are becoming incrementally, and sometimes geometrically,
more pressing. Cause we have more data, more analyses, more people eager to create
these sort of results. – Can you comment a little bit about how heterogeneity can influence the results and how it can be incorporated into meta-analyses or trial and error? One of the thoughts that I found, one of the stories I found mind-boggling is for about 20 years, there was a drug that worked in a very
specific kind of lung cancer, and it might have been,
I forgot the exact name, it was acting the tiniest
domain of EGF percent, and it worked really well. And about 3% or 5% or
depending on ethnicity etc. of cancer patients, it failed
every single clinical trial, and therefore, for a long time, in Europe it was allowed to prescribe
in certain types of cancer, but in America, it was
only allowed to be taken by patients who had positive
reaction to it before. I believe that Tacinol changed. My worry is that your example
of your antidepressants or whatever, that there
may be tons of such drugs that work really well
in 3 or 5% of patients, but that are not, that are
failing clinical trials and the only reason this
particular one was used is because we understood the
mechanisms and it made sense. As long as we don’t
understand the mechanism there may be such heterogeneity all over, and we may just be throwing out all of those drugs because
they only work in the subset. – I take your point, and
I think that it’s likely that there are drugs that heterogeneity is masking effectiveness
for specific subpopulations and we don’t know the
mechanistic substratum that would be the answer to that. I don’t know how common that is, though. So, for example, cancer
is the one discipline that has probably made most progress today in terms of applications of these sort of personalized treatments that are based on this type of biological heterogeneity, where you have a mutation
well matched to the treatment. But you can see that in
the super umbrella trial, NCI-MATCH, only about 2.5%
of patients with cancer can be matched to such a mutation
that would be recognizable and you can have molecularly
targeted treatment. Even in a pretty successful paradigm, it’s still a small minority. If you look at, across all cancer trials at the moment in oncology, there’s about 150 trials that have
personalized designs in process. Not finished, but registered
in clinical trial gov. So basket trials and umbrella trials and personalized designs that might fit to that concept. As opposed to about 50,000 trials that just have the
typical bread and butter average treatment effect approach. We need to do more of those. So we need to go after this
type of matching biology, matching mechanism to
understand heterogeneity. But I don’t want to also
reach the other conclusion that in every case, we will
be able to find something. Antidepressants, for
example, have been out there for more than 50 years. We recently published a meta-analyses in Lancet with 522 trials
and 120,000 people. If you look at the literature, there’s hardly any biological marker that you can reliably use
to individualize treatment. It’s like a holy grail. We want to individualize and
there are some possibilities but nothing really that is as rigorous as a mutation linked to a specific biologic monoclonal antibody in cancer. So I think the challenges are different. Maybe some cases, we are just
looking down the wrong path. Maybe our whole thinking about what type of treatments we want
to go after is wrong. In other cases, maybe
we’re on more solid ground and we need to be open to challenge that. – One more question. Anyone? Right down here. – John, have you found
any sense of trials, for people who have taken advantage, especially with cancer trials, of the evidence of
differential expressions, splice isoforms, and key
steps and key pathways. There is this notion that our transcript analysis sometimes
called gene analysis. Which it’s not. And if you recognize at the base that evolution of developments is an excellent newfound
structure of genes, and splicing that goes on to produce more parts for this gene. And then you just lump them all together and assume they’re at all equivalent, which they are generally not. So I believe the (mumbles) and trying to confirm biomarker result, which is generally fruitless, or to reliably treat,
even for a mechanism, there’s a lot of variation in the response most aligned to this placing
variation that is neglected. – I think it’s a clear possibility. It’s a mechanism that I think has not been explored
to its full potential. I would argue it still needs to be matched to problems that remain unsolved. Coming back to the very first criteria of feature of what we
really want to get because, in terms of tumor
biomarkers, there’s about 2,000 new tumor biomarkers
proposed every year. And we had actually published a paper a number of years ago where we found that 99% of that
literature, is even higher than the average 96%,
claimed significant results. Even the few that don’t
claim significant results, I remember that 1% include a paper that had 125 p-values of that biomarker scattered throughout the text and tables, none of them close to even 0.05, but the conclusion of
the abstract was that this is a very important tumor biomarker as we have shown in our previous study. We have lots of leads and some of them are more exciting than others, but we don’t have a very rational way to prioritize leads that
would be more fruitful. And we just drown in a sea of tens of thousands of biomarkers, only a couple of dozen already adopted and used in clinic, and very big space of a gray zone that is unexplored and very fragmented. So I think that this is one such example where you may have a clear winner, but it’s just lost within
that space of noise. – That’s wonderful. So, maybe one more round of applause. (audience applauding)


3 thoughts on “IHPI Seminar: Precision health, big data and evidence-based medicine – contradictions or companions?

Leave a Reply

Your email address will not be published. Required fields are marked *