We reviewed four stories on the Swedish mammography study that appeared in the journal Cancer last week. Three of the four stories gave a pretty clear indication that there were methodological concerns about the Swedish research (of the four reviewed, only HealthDay offered no such hint):
• 4th paragraph of AP story: “The new study has major limitations and cannot account for possibly big differences in the groups of women it compares.”
• 1st paragraph of LA Times blog story: “Critics charged that the study was poorly designed and potentially vastly misleading.”
• 2nd sentence of NY Times story: “Results were greeted with skepticism by some experts who say they may have overestimated the benefit.”
But none of the stories did a very complete job of explaining those potential limitations. Because of the confusion that must be occurring in the minds of women — especially those in their 40s — this is a time in which journalism must rise to the need and do a better job of evaluating evidence and helping readers make sense of what appear to be conflicting findings.
I was in Chapel Hill, North Carolina, when the study was published and had the chance to talk about it with former U.S. Preventive Services Task Force member, and a recognized thought leader on issues of prevention and especially of screening tests, Dr. Russell Harris, Professor and Director of the Health Care and Prevention Concentration of the University of North Carolina (UNC) School of Public Health.
He sent me an email with the following analysis of the study. Read this carefully. I’m confident you’ll learn a great deal from his analysis:
First, the authors say that their primary purpose is showing that there is a reduction in breast cancer mortality due to screening for women ages 40-49. It is worthwhile to point out that the US Preventive Services Task Force (USPSTF) agrees that there is a reduction in mortality in this group. Recall that the systematic review by Nelson et al found a relative risk reduction of 16% for this group, with a number needed to screen (NNS) for 10 years (with NNS you always have to give a length of time for the intervention) of about 2,000. The Swedish study found a relative risk reduction of 26%, with a NNS of 1252. The key issue is not whether there is a benefit, but rather how large the benefit is. In addition, this study says nothing about the harms of screening, while the USPSTF spent much time and energy trying to get a handle on the magnitude of the harms. The decision about screening (whether a policy decision or an individual patient decision) should hinge on the balance between the magnitude of benefits and the magnitude of the harms. So, we can discuss the difference in magnitude of benefit between this study and the USPSTF, but this study won’t help us at all with the issue of the magnitude of the harms (including the experience of women with false positive results and the effects of overdiagnosis).
Now, to look at the Swedish study, it is also worthwhile to note that these investigators have an obvious point of view from the start. Their previous papers (especially Tabar and Duffy) have all come to the same conclusion – that screening is a good idea. The Swedish data, and the Norwegian data (in the Sept. 23, 2010 New England Journal of Medicine), are rich datasets and should be explored. However, in both cases it might make more sense for the investigators exploring these data to be clearly disinterested investigators who are not out to prove something. This becomes even more important as we get into the analysis required from such a dataset.
The research design of the Swedish study is a non-randomized trial, with primarily ecologic data (i.e., the data comes from large national databases of breast cancer diagnoses and deaths). When contrasting breast cancer mortality in large groups, two critical issues are (1) making sure the groups are comparable in all factors that might determine the outcome (breast cancer mortality). You want the groups, as much as possible, to differ only in that one group receives screening while the other does not. The other critical issue (#2) is that we could outcomes (breast cancer deaths) equally in both groups. If we count deaths unequally, we bias the results either in favor of or against screening.
In terms of the comparability of the groups, there are several problems:
• The areas that screened starting at 40 are better off than counties that started at 50, meaning that there are likely many other factors (such as treatment) that also are different in these groups. In terms of treatment, for example, the study period (1986-2005) witnessed an impressive improvement in breast cancer treatment. If these improvements in treatment occurred more in the study than the control areas, this could easily explain all of the difference between USPSTF and this study’s estimates of the relative risk reduction for women 40-49.
• Some areas switched approaches during the study period – sometimes starting at 40 and sometimes at 50 (and one even at 45). This required statistical “adjustments”. The best way to do these adjustments are not clear, and some might do them one way while others might do them another. The Norwegian study tried to take treatment into account – imperfectly, I am sure, but they probably got it at least partly right. The problem of how best to do these adjustments is a special problem for investigators who begin with a point of view. It would be very easy to adjust in a way favorable to their point of view and then justify themselves later.
In terms of which breast cancer deaths to count, this also is probematic:
• They (correctly) focused on deaths in the study group of women diagnosed during their 40s, whether they died then or years later. The issue with screening, after all, is whether diagnosis at an earlier time (in this case, during their 40s) ends up with better health outcomes than diagnosis later, after the women or physician finds a breast lump. The problem comes in which breast cancer deaths in the control group should be counted. To be comparable, you really want to count all of the deaths of women who would have been diagnosed had they been screened. But some of these control group women will, in the absence of screening, now be diagnosed in their 50s rather than 40s. But not all women diagnosed with breast cancer in their 50s in the control group would have been diagnosed by screening in their 40s. So which ones should we count and which ones not? In an RCT, one can determine this by waiting for a time after screening stops in both groups and then, when breast cancer cases in the control group “catch up” with cases in the screen group, you have comparability. But in the Swedish type of study design, there is no way to do that. What this means, then, is that further statistical “adjustments” must be done. The Swedish investigators added or subtracted (they don’t tell us how they did this) person-years from the denominator of the study group to adjust. But, as noted above, it is very easy to make those adjustments in a way that favors your point of view.
The end result is that it is really difficult to have confidence in the analyses of this group, and the results of this study. There are just too many ways that the adjustments could bias the final result into the desired direction.
So I think this study points out the many problems of this study design, and the potential for bias in the analysis. Again, I would call for an independent group without a preformed point of view to analyze such data. (Something similar happened with prostate cancer screening in the Tyrol study.)
UNC posted a video with Harris last December, talking about the U.S. Preventive Services Task Force recommendations from November 2009:
*This blog post was originally published at Gary Schwitzer's HealthNewsReview Blog*