not logged in

European Urology

European Urology

Volume 62, issue 4, pages e69-e82, October 2012

Prostate Cancer

Stages of Prediction Model Comparison

Michael W. Kattan a lowast and Thomas A. Gerds b

Published online 6 May 2012, pages 597 - 599


Refers to article:

Comparison of Three Different Tools for Prediction of Seminal Vesicle Invasion at Radical Prostatectomy

Giovanni Lughezzani, Kevin C. Zorn, Lars Budäus, Maxine Sun, David I. Lee, Arieh L. Shalhav, Gregory P. Zagaya, Sergey A. Shikanov, Ofer N. Gofrit, Alan E. Thong, David M. Albala, Leon Sun, Angel Cronin, Andrew J. Vickers and Pierre I. Karakiewicz

Accepted 3 April 2012

October 2012 (Vol. 62, Issue 4, pages 590 - 596)

Article Outline

1. The three fundamental assessment levels

Risk prediction models can be assessed and compared on three fundamental levels: (1) discrimination, (2) calibration, and (3) utility. Unfortunately, it is rather difficult to identify commonly agreed and mathematically rigorous definitions of these terms. The unifying aim of all three levels is to quantify the quality of a risk prediction model. Discrimination tells us how well the model can distinguish between two patients that have different outcomes. Calibration measures the distance between predicted risk under the model, which can be incorrectly specified, and the expected event frequencies in the population. Utility measures the costs and benefits of clinical actions indicated by predicted risk.

To further understand the three terms, as they are often used, it is instructive to consider the differences among them. First, unlike calibration and utility, discrimination does not depend on event prevalence. Second, unlike utility, discrimination and calibration weigh cases (an event) and controls (event-free) equally. The first noted difference implies that a pure discrimination measure, like area under the curve (AUC) or concordance index, is blind to changes in prevalence. To illustrate the implications, consider two populations with different prevalences of seminal vesicle invasion (SVI). Suppose that the associations between the risk of SVI and a set of predictor variables are identical or at least comparable in the two populations. This would be reflected by similar odds ratios, which could be obtained, for example, by multiple logistic regression. The expected AUC of the model would be the same in the two populations. If the model were trained on a sample from population 1, it would show similar discrimination results in independent validation sets from population 1 and population 2; however, because the event prevalence in population 2 is different (eg, higher) than in population 1, the absolute risk predictions of the model trained in a sample from population 1 will not be calibrated for population 2. For example, suppose the true risks of SVI for two men from population 1, ages 50 and 51, are 2.1% and 2.2%, respectively, and for similar men in population 2, the risks are 4.3% and 4.9%, respectively. The model trained in a sample from population 1 will likely predict the risks of the two men in population 2 at around 2.1% and 2.3%. This means that the model can discriminate in population 2—the man at higher risk is given the higher predicted risk—but the model is not well calibrated in population 2, as 2.1% is quite unlike this man's true population risk of 4.3%.

In light of these remarks, we reach a different interpretation of the results from Lughezzani et al. [1]. The Gallina nomogram (AUC: 80.5%) outperformed Partin tables (AUC: 79.2%) in terms of discrimination. That means that the Gallina nomogram uses the information in the predictors slightly better than the Partin tables. However, the Gallina nomogram was derived from a multiple logistic regression model in a population with a considerably higher prevalence [2]. Traditional measures of calibration like the calibration plot and the Brier score agree that the Gallina nomogram is not calibrated in the data that Lughezzani et al. [1] used for validation. In light of the discrimination result, it would be of great interest to see if a calibrated version of the Gallina nomogram could outperform the Partin tables in terms of calibration. The Gallina nomogram could be calibrated by adjusting the intercept of the multiple logistic regression model behind it to the level of the prevalence in the population used for validation. To keep this calibration step independent of the actual validation sample, one could get an estimate of the SVI prevalence from a national registry or from the report of a different study of the same population.

It is notable that differences in event prevalence in two study populations can be due to real genetic or lifestyle differences but also can reflect different definitions of the diagnosis of the event (rater disagreement). In addition, the Brier score measures both discrimination and calibration. To see this, consider a null model such as a logistic regression model, without predictor variables, that predicts the frequency of the event in the training sample to every patient in the validation sample. Such a model is perfectly calibrated if the training and validation samples are representative for the same population. However, the discrimination of such a model is zero, and including important, discriminative predictors in the logistic regression model reduces the Brier score.

2. Choosing the prediction scale

Lughezzani et al. [1] also compare the European Society of Urologic Oncology (ESUO) criteria, a model that predicts either no risk or 100% risk. Details on how the ESUO criteria were developed are not given in the article. For the point that we want to make, suppose that the ESUO criteria correspond to dichotomizing the risk predicted by the Gallina nomogram. Certainly, information is lost by making this assumption (see Senn [3] and Royston et al. [4]). It is not surprising that the binary model performs less well than the continuous model, and in most situations, this observation is independent of how the continuous risk is dichotomized.

Lughezzani et al. [1] are mistaken when they blame the Brier score for the disappointing result. Before we give an explanation, note that the Brier score is a strictly proper scoring rule and a fundamental component of mathematical theory (see Savage [5] and Hand [6]). We have explained that the Gallina nomogram quite likely was developed in a population with a higher prevalence of SVI. Hence if we continue to assume that the ESUO criteria are obtained by comparing the Gallina nomogram prediction to a single threshold, then it is not surprising that the ESUO criteria appear to be miscalibrated in the validation study by Lughezzani et al. [1]. This is reflected in the high Brier score. Lughezzani et al. try to convince us that the Brier score should not be used to assess binary models, but a very simple binary procedure will improve the Brier score compared to the ESUO criteria: Predict no SVI for everyone. It is quite likely that an even better binary model (in terms of the Brier score) can be constructed by dichotomizing a recalibrated version of the Gallina nomogram or by dichotomizing the Partin tables.

3. Decision making, threshold models, and utility

A patient who has to make a decision, perhaps for or against surgery, will naturally ask the question, “How likely is it that I will benefit?” This question is answered by a well-calibrated model, and the higher the discrimination ability, the better. A well-calibrated model will give the patient a risk estimate that has the following intuitive interpretation: Among 100 patients that are like me, x will have the event and hence will benefit from the surgery. The risk predicted by the model for this patient is x/100. The patient will incorporate this information into a decision-making process that typically will involve other personal considerations. It is important to recognize that there are many different ways to make a decision, and few people will be able to write down how they do it. We can think of the next patient's decision-making process as a black box. If the risk predicted by a statistical model is put into the box, then it depends on what else is inside the box and how this information influences decision making.

In the decision curve analysis presented in Lughezzani et al. [1], it is assumed that the interiors of the black boxes of all patients can be described by a very simple single-threshold model: If the predicted risk is above a personal threshold, then decide for surgery, otherwise decide no surgery. This constitutes a crude simplification of a complex process and needs justification. It is not clear if the patient has made a decision for a threshold before or after seeing the predicted risk. In the validation data used by Lughezzani et al. [1], all patients underwent surgery. Consequently, the presented decision curve analysis is not based on real data, and this may limit its interpretation at the population level. Each threshold value in the decision curve analysis corresponds to a hypothetical population in which all patients use this threshold; however, the distribution of the personal thresholds in the study population is unknown. This is an important limitation of decision curve analysis.

Suppose the likely case that decision curve analysis of a recalibrated version of the Gallina nomogram crosses that of the Partin tables. If we are determined to choose between the tools, we would like to summarize the decision curve analyses according to the distribution of the personal thresholds in the population. To make this point clear, suppose there are only two types of patients: One-third of the population has a personal threshold of 1.0, and two-thirds have a personal threshold of 4.5. If the Gallina nomogram had higher net benefit at 1.0 and lower net benefit at 4.5, compared with Partin tables, then a decision maker would prefer the Partin tables. However, since it is unknown how many patients have what threshold, there is a problem.

4. The cost–benefit ratio

Recall that all considerations of the decision curve analysis are made under the assumption that all patients use a simple threshold model. A still simple but convincingly more realistic decision process is as follows: If the predicted risk is below a certain threshold, say 3%, then the value that the patient determines for no surgery does not matter. If the predicted risk is above this threshold but below a second threshold, say 80%, then the patient will let the decision be dependent on the absolute value of the predicted risk and on other personal considerations (eg, the costs or side effects of the surgery). If the predicted risk is >80%, then the patient will decide for surgery regardless of the costs and independent of what his personal a priori decision threshold was.

For those who do not want to work with a simple threshold model for personal decisions and still are interested in assessing the utility of the risk prediction model, we should mention that it is possible to introduce into the Brier score a potential imbalance between the benefits of the surgery performed for a diseased man and the costs of treating a healthy man [7]. Note first that the Brier score is the population average of the patients’ individual residuals, which are defined as the squared difference between actual binary outcome (0 = SVI is not present, 1 = SVI is present) and the risk of SVI predicted by the model. In its usual form, the Brier score weighs all residuals equally, relative to the sample size. The Brier score can be modified to put relatively more weight (eg, double weight) on residuals from men who are diseased (and therefore need surgery) than on residuals from those who are disease-free. It should be noted that the risk prediction model that is optimal with respect to such a cost–benefit-modified Brier score cannot be calibrated at the same time. This shows that utility and calibration are fundamentally different.

5. Conclusions

Risk prediction models can be assessed on one or all of the three fundamental levels: discrimination, calibration, and utility. The AUC summarizes discrimination only; the Brier score reflects both calibration and discrimination. Naturally, these traditional measures will pick the same winning model only if all models are calibrated. A recalibrated version of the Gallina nomogram will likely have similar performance to the Partin tables, hence it is likely that the decision curve analyses will yield crossing curves and thus cannot pick a clear winner.

Conflicts of interest

The authors have nothing to disclose.

References

  • [1] G. Lughezzani, K.C. Zorn, L. Budäus, et al. Comparison of three different tools for prediction of seminal vesicle invasion at radical prostatectomy. Eur Urol. 2012;62:590-596 Abstract, Full-text, PDF, Crossref.
  • [2] K.C. Zorn, U. Capitanio, C. Jeldres, et al. Multi-institutional external validation of seminal vesicle invasion nomograms: head-to-head comparison of Gallina nomogram versus 2007 Partin tables. Int J Radiot Oncol Biol Phys. 2009;73:1461-1467 Crossref.
  • [3] Senn SJ. Dichotomania: an obsessive compulsive disorder that is badly affecting the quality of analysis of pharmaceutical trials. Presented at: 55th Session of the International Statistical Institute; April 5–12, 2005; Sydney, Australia.
  • [4] P. Royston, D.G. Altman, W. Sauerbrei. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127-141 Crossref.
  • [5] L.J. Savage. Elicitation of personal probabilities and expectations. J Am Stat Assoc. 1971;66:783-801 Crossref.
  • [6] D.J. Hand. Construction and assessment of classification rules. (John Wiley, Chichester, UK, 1997)
  • [7] T.A. Gerds, T. Cai, M. Schumacher. The performance of risk prediction models. Biometrical J. 2008;50:457-479 Crossref.

Footnotes

a Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA

b Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark

lowast Corresponding author. Cleveland Clinic, 9500 Euclid Avenue, JJN3-01, Cleveland, OH 44195, USA. Tel. +1 216 444 0584; Fax: +1 216 539 4731.

Recommend this article

Currently this article has a rating of 0. Please log in to recommend it.