The Journal of Thoracic and Cardiovascular Surgery
Volume 137, Issue 6 , Pages 1572-1573, June 2009

Reference values: No need for confusion

Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, Ohio

Article Outline

CTSNet classification: 2, 4

 

To the Editor:

I would like to comment on the discussion among Lim and Dusmet,1 Marra and colleagues,2 and Rice and Blackstone.3 There are several issues of confusion; I hope I can clarify some of these.

Sensitivity and specificity are measures of a test's inherent diagnostic performance. Sensitivity is the proportion of patients who test positive among patients with the disease; specificity is the proportion of patients who test negative among patients without the disease. Another common measure of diagnostic performance is the receiver operating characteristic (ROC) curve.4 An ROC curve illustrates a test's sensitivity and specificity for different criteria for defining positive and negative test results. For highly accurate tests, there is a point on the ROC curve that one can choose if high specificity is desired; the price, however, is low sensitivity. Similarly, one can choose very high sensitivity but at a price of low specificity. Lim and Dusmet's1 comment that “sensitivity truly starts at 50%” is incorrect; a test with low sensitivity (ie, <0.5) can have diagnostic value if the specificity is high.

Sensitivity and specificity are the basic measures of a test's ability, but they do not describe how well the test will perform for a particular patient population. In managing patients, physicians focus on what the test results tell them about their patient. They want to know the probability their patient has the disease after a positive test result (positive predictive value [PPV]) and the probability their patient does not have the disease after a negative test result (negative predictive value [NPV]). Predictive values depend not only on the sensitivity and specificity of the test but also on the probability of disease in similar patients (ie, prevalence of disease). In fact, when predictive values are reported in the literature, a subscript indicating the prevalence rate is often used. For example, remediastinoscopy may have an NPV of 0.85 in a sample with a prevalence rate of 0.32, which we write as NPV0.32 = 0.85. In a different population with a different prevalence rate, the NPV will change, for example, NPV0.05 = 0.98 or NPV0.50 = 0.72. Much of the controversy in these authors' correspondences is due to confusion between sensitivity and PPV, and between specificity and NPV. Sensitivity and specificity describe the test's inherent diagnostic abilities irrespective of the prevalence rate. PPV and NPV, on the other hand, tell us the likelihood of disease after the test is performed in a particular patient population with a particular prevalence rate. In determining the role of remediastinoscopy in restaging lung cancer, it seems that PPV and NPV are the important metrics and should be the focus of the discussion.

Lim and Dusmet1 and Marra and colleagues2 point out correctly that specificity is important for ruling in disease and sensitivity is important for ruling out disease. These relationships are due to the roles of these metrics in estimating PPVs and NPVs. A high specificity causes the PPV to increase, and a high sensitivity causes the NPV to increase, assuming, of course, that the prevalence of disease is held constant. As we have illustrated, predictive values are highly influenced by the prevalence of disease. Similarly, the measure of “accuracy” that Marra and colleagues report is also dependent on the prevalence of disease in the sample, and thus could be reported more appropriately as overall accuracy0.32 = 0.88.

There are several other issues in these correspondences that need clarification. First, neither Marra and colleagues2 nor Lim and Dusmet1 report a confidence interval (CI) for specificity. A reasonable 95% CI for specificity based on these data is 0.96 to 1.0.5 CIs for both sensitivity and specificity should be routinely reported. Contrary to Marra and colleagues' description of the meaning of a CI, it is not “the likelihood that another sample will provide the same result.” Rather, a CI describes a range of plausible values for the metric of interest, here specificity. Statistically speaking, we expect that 95% of CIs will contain the real, but unknown, true value of the metric (ie, specificity); 5% of CIs will not contain the true value. Statisticians use the data from a single sample to estimate the unknown value of the metric; 95% of the time the CI they construct contains the true value, although we do not know which value in the interval it is or which CIs contain the true value and which do not.

Second, it is important to consider the effects of patient and disease characteristics in estimating sensitivity and specificity. For example, the size of lesions is a critical determinant of sensitivity, as well as the comorbidities of patients. Some of the differences between estimates of sensitivity and specificity reported in the literature for remediastinoscopy could be due to these patient differences.

Third, when a diagnostic test does not yield a result, that is, the result is “uninterpretable,”6 it is critical that the frequency of this occurrence be reported. Marra and colleagues2 reported a 2% frequency for remediastinoscopy. They also included this frequency in the denominator of their estimate of overall accuracy; this gives the reader an honest estimate of the test's performance.

Last, I think Drs Rice and Blackstone's3 statement that screening tests usually have good specificity, whereas a test used to work up patients needs good sensitivity, is too narrow and does not describe many scenarios. In screening for breast cancer, for example, physicians look for tests with good sensitivity even if the false-positive rate is a bit high. Computer-aided detection systems are often used to improve sensitivity, usually at a cost of even higher recall rates. Without reasonable sensitivity, many screening programs cannot be cost-effective. Further workup of these patients demands higher specificity to prevent unnecessary invasive testing. The consequences of test errors and prevalence of disease must be weighed in each application to find the best test for a particular application.

Back to Article Outline

References 

  1. Lim E, Dusmet M. Remediastinoscopy: a statistical reinterpretation. J Thorac Cardiovasc Surg. 2009;137:254–255author reply 5-6
  2. Marra A, Hillejan L, Fechner S, Stamatis G. Remediastinoscopy in restaging of lung cancer after induction therapy. J Thorac Cardiovasc Surg. 2008;135:843–849
  3. Rice TW, Blackstone EH. Referent values and equipoise: editors notes. J Thorac Cardiovasc Surg. 2009;137:256–257
  4. Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New York: Wiley and Sons; 2002;
  5. Hanley JA, Lippman-Hand A. If nothing goes wrong, is everything all right? Interpreting zero numerators. JAMA. 1983;249:1743–1745
  6. Begg CB, Greenes RA, Iglewicz B. The influence of uninterpretability on the assessment of diagnostic tests. J Chronic Dis. 1986;39:575–584

PII: S0022-5223(09)00356-0

doi:10.1016/j.jtcvs.2009.02.031

Refers to article:

  • Remediastinoscopy in restaging of lung cancer after induction therapy

    Alessandro Marra, Ludger Hillejan, Sylvia Fechner, Georgios Stamatis
    The Journal of Thoracic and Cardiovascular Surgery April 2008 (Vol. 135, Issue 4, Pages 843-849)

  • Remediastinoscopy: A statistical reinterpretation

    Eric Lim, Michael Dusmet
    The Journal of Thoracic and Cardiovascular Surgery January 2009 (Vol. 137, Issue 1, Pages 254-255)

  • Referent values and equipoise: Editors' notes

    Thomas W. Rice, Eugene H. Blackstone
    The Journal of Thoracic and Cardiovascular Surgery January 2009 (Vol. 137, Issue 1, Pages 256-257)

The Journal of Thoracic and Cardiovascular Surgery
Volume 137, Issue 6 , Pages 1572-1573, June 2009