Comparison of Paired ROC Curves through a Two-Stage Test

The area under the receiver operating characteristic (ROC) curve (AUC) is a popularly used index when comparing two ROC curves. Statistical tests based on it for analyzing the difference have been well developed. However, this index is less informative when two ROC curves cross and have similar AUCs. In order to detect differences between ROC curves in such situations, a two-stage nonparametric test that uses a shifted area under the ROC curve (sAUC), along with AUCs, is proposed for paired designs. The new procedure is shown, numerically, to be effective in terms of power under a wide range of scenarios; additionally, it outperforms two conventional ROC-type tests, especially when two ROC curves cross each other and have similar AUCs. Larger sAUC implies larger partial AUC at the range of low false-positive rates in this case. Because high specificity is important in many classification tasks, such as medical diagnosis, this is an appealing characteristic. The test also implicitly analyzes the equality of two commonly used binormal ROC curves at every operating point. We also apply the proposed method to synthesized data and two real examples to illustrate its usefulness in practice.


INTRODUCTION
Evaluating the diagnostic performance of markers or classifiers becomes increasingly important when new diagnostic tools are developed or new markers are proposed to enhance existing diagnostic methods. The receiver operating characteristic (ROC) curve is a popular statistical used tool for this purpose. For a given diagnostic test, the ROC curve expresses the true positive rate (TPR) as a function of its corresponding false-positive rate (FPR). It demonstrates the tradeoff between TPR (sensitivity) and FPR(1-specificity).
Since ROC curves involve all possible threshold values for a decision, calibration of these values is embedded in ROC analysis. Beam and Wieand (1991) approached this problem by comparing the TPRs for fixed FPRs. However, it is more common in practice to use methods that consider the ROC curve in its entirety. Under the assumption that the test scores (or after a monotonic transformation) of both diseased and non-diseased subjects follow normal distributions, the ROC curve is characterized by two parameters: a slope and an intercept. A comparison of two ROC curves can be accomplished by developing a test of the equality of two slopes and two intercepts. This type of ROC curve is called a binormal ROC curve and has been studied by Metz et al. (1984Metz et al. ( , 1998; Cai and Moskowitz (2004), and Pepe (2004). A likelihood ratio test for the comparison of two ROC curves was developed by Metz et al. (1984) for paired data (i.e., in cases where each patient received both tests).
An alternative approach, which is frequently used in practice, is to consider the area under the curve (AUC) as a measure of accuracy. The AUC summarizes the overall validity of a diagnostic test and represents the probability that the test assigns a higher score to a diseased individual than to a healthy individual (Bamber, 1975). For paired experiments, Hanley and McNeil (1983) developed a test that accounts for the correlations induced by the paired design. DeLong et al. (1988) developed a fully nonparametric approach with nonparametric estimates of all covariances, which leads to a test statistic with an asymptotically standard normal distribution. More recent studies on the AUC include three permutation tests by Bandos et al. (2005), Bandos et al. (2006), and Braun and Alonzo (2008). Several software programs have been released that compare the differences between AUCs, including the pROC package in R and S+ by Robin et al. (2011).
However, statistical tests based only on the AUC are not appropriate when the two ROC curves cross each other and their AUCs are similar. Venkatraman and Begg (1996) suggested a distribution-free procedure to test the equality of two ROC curves at every operating point in the paired design for continuous markers. Their test is more powerful than the AUC test of DeLong et al. (1988) when two curves have similar AUCs. However, as pointed out by Bandos et al. (2005), this test is not as powerful as the conventional AUC test when one ROC curve is uniformly superior to another (i.e., one ROC curve is always located in a higher position than another), and it is computationally burdensome when the sample size is large. Although a general hypothesis test in which the null hypothesis represented equality of the entirety of the ROC curves was constructed, the test was unable to indicate which part of the difference between the ROC curves might occur so that the null hypothesis would be rejected.
The partial area under the ROC curve (pAUC) is a more practical index than AUC in situations when some specific range of FPR is of interest. For example, the area over a limited region of very low FPRs is of great interest to those investigating cancer screenings. However, the inference on pAUC is more difficult than on AUC. McClish (1989) and Thompson and Zucchini (1989) have proposed a parametric estimator of pAUC and its variance using a bi-normal model. Unfortunately, Walsh (1997) demonstrated that inferences derived from the bi-normal assumption are sensitive to mode mis-specification and to the location of decision thresholds. Wieand et al. (1989) have proposed a nonparametric method for pAUC and its asymptotic variance, but their method has not been effectively applied because of its mathematical complexity (He and Escobar, 2008). Zhang et al. (2002) suggested a much simpler nonparametric method than that of Wieand et al. (1989) for deriving variance of pAUC. However, their method was developed for ordinal data. Dodd and Pepe (2003) proposed estimating the variance of pAUC by bootstrap resampling for continuous data and Robin et al. (2011) built such a test of pAUC in their package "pROC", but it suffers from the same computational intensity as the permutation test.
By definition, the pAUC completely ignores the performance of a marker outside the specified interval. In some cases, it is useful to retain some information from the entire FPR range, with emphases on clinically interesting ranges. In this sense, we propose using a shifted area under the ROC curve (sAUC), which focuses on the differences of two groups that are bigger than a given level. sAUC is a special case of a modified area under the ROC curve defined by Yu et al. (2013). The sAUC has an interesting property: it associates with the pAUC within a range of low FPRs. This property is important when differentiating two ROC curves, especially when two diagnostic tests have similar AUCs while their ROC curves are quite different.
Considering this, we propose a two-stage test to determine the difference between two ROC curves under different situations. First, a conventional, nonparametric AUC test (as is found in DeLong et al., 1988) will be performed, and then a sAUC test will be conducted if the AUC test does not significantly reject the null hypothesis. The sAUC test is also fully nonparametric and easily implemented. If the conventional AUC test does not reject the null hypothesis and the sAUC test does, then it can be concluded that the diagnostic test/marker with a higher sAUC also has a larger partial AUC in the lower FPR range. The global nominal test size can be allocated differently in two stages based on the asymptotic joint distribution of the two test statistics.
The rest of this article is organized as follows: we define sAUC, demonstrate its properties, and establish a two-stage test in Section 2. Simulation studies are given in Section 3.1 and examples with real datasets are reported in Section 3.2. The discussions are provided in Section 4.

METHOD
Throughout the remainder of this article, we refer to the diagnostic tests, biomarkers, and classifiers as "markers" for simplicity. Suppose that Y and X are random scores for diseased and non-diseased subjects, respectively. Without loss of generality, a sample will be classified as a positive one if its score is greater than a threshold c. Then, TPR and FPR are defined as TPR(c) = Pr(Y > c) and FPR(c) = Pr(X > c), respectively, and AUC is defined as TPRðFPR À1 ðtÞÞdt. The partial AUC over a specific false positive rate range (a, b) is defined as ð b a TPRðFPR À1 ðtÞÞdt.

AUC and Shifted AUC
Following the above notations, AUC can be written as P(Y > X) (Bamber, 1975). Since AUC cannot differentiate two crossed ROC curves, we defined an auxiliary indexthe shifted area under an ROC curve (sAUC)-as where δ ¼ z 1Àα=2 σ X , with zα is the αth percentile of a standard normal distribution, and σx is the standard deviation of the non-diseased group. If we treat location-shifted scores for non-diseased subjects X + δ as scores from another non-diseased group, we can define a new "ROC curve" by Y and X + δ. We name it a shifted ROC curve. Suppose two markers, M 1 and M 2 , follow bi-normal distributions. Their ROC curves are shown in Fig. 1(a) and their shifted ROC curves in Fig. 1(b). The two AUCs are set to be equal, but M2 has a larger area under its sAUC. This demonstrates that sAUC can be related to pAUC with FPR in the low range.
As seen in Fig. 1, for both ROC curves and shifted ROC curves, M 2 has higher curves at the left side of the cross point. Furthermore, the cross point of the shifted ROC curves is shifted to the right from the cross point of the original ROC curves. Thus, sROC curves emphasize their differences more than ROC curves do when FPR is relatively low. More precisely, in Fig. 1, by simple arithmetic, the FPR at the cross point of the ROC curves and the FPR at the cross point of the shifted ROC curves are Φ(A) and ΦðA þ z 1Àα=2 Þ (see Lemma 2.3 in Yu et al. (2013)), respectively. This implies that the cross point of the shifted ROC curves is shifted to the right from the cross point of the ROC curves and the shift size is determined by the parameter δ or z 1Àa=2 σ X À Á . Suppose that two ROC curves (with equal AUCs) crossed at FPR = 0.5. In this extreme case, A = 0 and the shifted ROC curves crossed at FPR ¼ Φ z 1Àα=2 À Á . Then, the area of the right segment of the shifted ROC curve is less than 1 À Φ z 1Àα=2 À Á ¼ α=2. Therefore, if α is small, the sAUC will be close to the area of the left segment of the shifted ROC, as the right segment area becomes negligible. However, we should not specify α to be arbitrarily small. To ensure the sAUC is close to the left segment area, we expect the sAUC be at least as large as α, that is, Φð μÀz 1Àα ffiffiffiffiffiffiffiffi 1þσ 2 p Þ > α. (Here we assume that the scores are standardized such that the nondiseased scores have mean zero and variance 1, where µ and σ are the mean and variance of the diseased scores). Simple calculations show that α should satisfy Therefore, we should select the α to be as small as possible while satisfying the inequality 2.2. α = 0.05 is a good choice when µ > 0 and σ 2 > 0.5, and the corresponding δ = 1.96σ X . µ > 0 is a reasonable assumption since a practical AUC value should be greater than 0.5. Note that if σ 2 < 0.5, a smaller α should be selected. Since the variance of the diseased in our numerical study is larger than 0.5, we choose δ = 1.96σ X . The weighted average of AUC and its shifted AUC together defined the modified area under ROC curve(mAUC) mAUCðλÞ ¼ ð1 À λÞAUC þ λsAUC; (3) by Yu et al. (2013), where 0 ≤ λ ≤ 1, and a similar graphic illustration can be found therein. Note that sAUC = mAUC(1). Relationship with pAUC. So far, we have illustrated that the sAUC can be related to pAUC and the shift size could be specified to ascertain this relationship. Now, we demonstrate this relationship under bi-normal distributions.
Suppose that the scores of diseased and non-diseased subjects, or after monotonic transformations, follow N ðμ Y ; σ 2 Y Þ and N ðμ X ; σ 2 X Þ, respectively, for marker M 1 , and N ðμỸ ; σ 2 Y Þ and N ðμX ; σ 2 X Þ for marker M 2 , respectively. Let AUC i and sAUC i denote AUC and sAUC for marker M i , respectively, i = 1, 2. We can give the analytical formula for AUC and mAUC for each marker: for example, for M 1 , and p . If AUC 1 = AUC 2 , then sAUC 1 > sAUC 2 , ROC 1 ðΦðAÞÞ 1 ROC 2 ðΦðAÞÞ, where Φ(A) is the FPR of the cross point of two ROC curves and μỸ ÀμX σỸ Þ. The notation ROC 1 ðtÞ 1 ROC 2 ðtÞ denotes that the ROC curve of M 1 is above the curve of M 2 when 0 FPR t. The proof is a special case of the result described in Yu et al. (2013).

Nonparametric Test Using AUC and sAUC
Suppose that there are n diseased subjects and m non-diseased subjects. Let Y i and X j denote the risk scores of the i-th and j-th subjects in the diseased and the non-diseased groups, respectively. As a special case of mAUC, a nonparametric estimator of sAUC would be defined as In order to construct the covariance of s d AUCs for paired designs, we use the asymptotic property of the linear combination of empirical AUCs under the paired designs of DeLong et al. (1988). First, we establish the asymptotic distribution for the linear combination of m d AUC. Then, the distribution for s d AUC follows as a special case.
Theorem 2.1 Let m and n be the sample sizes for the diseased and the non-diseased groups, respectively, and θ;θ; L, andL be defined as above. Let N = m + n. Then, if lim N !1 m=n is bounded and non-zero, the asymptotic distribution of linear combination of m c AU CðλÞs is as follows where S 10 and S 01 are 2p Â 2p matrices, with the (r, t)th elements S rt 10 and S rt 01 , respectively (r, t = 1, 2,..., 2p).
Because L 0θ ¼L 0 A, which is a linear combination of empirical AUCs, Theorem 2.1 directly follows DeLong et al. (1988). In particular, if L ¼ ð1; À1Þ 0 and λ = 1(0), then L 0θ becomes the difference of the two s c AU CsðAUCsÞ. According to Theorem 2.1, we can perform a large-sample z-test using L 0θ and construct a tolerance interval of L 0 θ.

A Two-Stage Test
Before defining our two-stage test, we argue that neither a single AUC test nor a single sAUC test is appropriate for detecting the difference between the two ROC curves. As mentioned earlier, AUCs can be similar when two ROC curves are quite different; the same is true for sAUCs. Generally speaking, a ROC curve cannot be characterized by a single parameter. Even for "binormal" ROC curves, the curve is decided by two parameters, a slope and an intercept (Cai and Moskowitz, 2004;Metz et al., 1984Metz et al., , 1998Pepe, 2004). Therefore, to declare the equivalence of two ROC curves, we require at least two equations. For this reason, single AUC and sAUC tests are both invalid in certain situations (see Section 3.1 for a detailed explanation).
However, a sAUC could be used as an axillary index of an AUC when the AUC test is not informative. The following theorem implies that identical sAUCs and AUCs together guarantee the identical ROC curves when normal distributions are satisfied.
Theorem 2.2 Suppose there are two markers, M 1 and M 2 . Diseased and non-diseased scores follow N ðμ Y ; σ 2 Y Þ and N ðμ X ; σ 2 X Þ, respectively, for marker M 1 , and NðμỸ ; σ 2 Y Þ and N ðμX ; σ 2 X Þ, respectively, for marker M 2 . Let δ = cσ X andδ ¼ cσX , where c is a positive constant such as 1.96. The following two declarations are equivalent.
The proof is given in the supplemental material. Theorem 2.2 explains that, if both scores of diseased and non-diseased subjects follow (or after a monotonic transformation follow) normal distributions, identical sAUCs and AUCs simultaneously guarantee the identical binormal ROC curves, and vice versa.
The two-stage, two-sided testing procedure that we propose is composed of two nonparametric tests, as follows: 1. Delong's AUC test: H 10 : AUC 1 = AUC 2 vs. H 11 : AUC 1 ≠ AUC 2 . The test statistic is where the standard errors (se) of d AUC 1 À d AUC 2 and s d AUC 1 À s d AUC 2 are obtained from the denominator of the left side of equation 2.8. The marginal distributions of Z 1 and Z 2 (determined by Theorem 2.1) are both standard normal under H 10 and H 20 , respectively. We illustrated the proposed two-stage test in Fig. 2. If H 10 is rejected, we can conclude that the two markers have different AUCs and therefore different ROC curves. If we fail to reject H 10 but succeed in rejecting H 20 , then we can conclude that the two markers have similar AUCs, but the one with the larger s d AUC has a larger partial AUC in a range of low FPRs. Or, to be more precise, if both markers are normally distributed, then by Corollary 2.5 in Yu et al. (2013), the one with the larger sAUC will also have a larger partial AUC at FPR 1 À d AUC.

Asymptotic Joint Distribution of Two Test Statistics.
Our procedure is a two-stage hierarchical test. The global test size α is defined as the probability of incorrectly rejecting the null hypothesis H 10 or H 20 . The size α can be split into two parts in various ways; different splitting methods lead to different testing powers. To approximate the global test size α, we require the following theorem, which indicates that the asymptotic joint distribution of Z 1 and Z 2 is a bivariate normal distribution.
We consider two markers: let {Y i }, i = 1,..., n and {X j }, j = 1,..., m denote the first marker scores of the diseased and non-diseased subjects, respectively. Similarly, let fỸ k g, k = 1,..., n and fX l g, l = 1,..., m for the corresponding scores of the second marker. Define φ ð1Þ ðY i ; X j ;Ỹ k ;X l Þ ¼ ðY i ; X j Þ À ðỸ k ;X l Þ and φ ð2Þ ðY i ; X j ;Ỹ k ;X l Þ ¼ ðY i ; X j þ δÞÀ ðỸ k ;X l þδÞ; where (y, x) was defined in equation 7. Denoteδ ¼ cS X andδ ¼ cSX with S X and SX for the sample standard deviations of Xs andX s, and c is any given positive constant. Then, we have the following theorem whose proof is given in the supplemental material.
Moreover, we can estimate the covariance components by extending the estimation method proposed by DeLong et al. (1988) consistently approximated by S rs 1000 ; S rs 0100 ; S rs 0010 , and S rs 0001 , respectively. Let S rs 0100 ; S rs 0010 , and S rs 0001 are defined in a similar way. Suppose that the two markers W i and V j are independent but obtained in a paired design where W i ¼ ðY i ;Ỹ i Þ and V j ¼ ðX j ;X j Þ (i = 1,...,n and j = 1,..., m). Define φ ð1Þ ðW i ; V j Þ ¼ ðY i ; X j Þ À ðỸ i ;X j Þ and φ ð2Þ ðW i ; V j Þ ¼ ðY i ; X j þ δÞ À ðỸ i ;X j þδÞ; where (y,x) was defined in equation 7. Denoteδ ¼ cS X andδ ¼ cSX with S X and SX for the sample standard deviations of Xs andX s, and c is any given positive constant. If this is the case, then we have the following theorem: Theorem 2.4 Let {W i } and {V j } be independent samples from distributions F and G, respectively, with W i ¼ ðY i ;Ỹ i Þ and V j ¼ ðX j ;X j Þ (i = 1,...,n and j = 1,...,m). Let N = n + m. If n/N is bounded and non-zero, as N ! 1, then the joint distribution of ðs c AU C 1 À s c AU C 2 À ðs c AU C 1 À s c AU C 2 ÞÞ is asymptotically bivariate normal with a mean of zero and covariance matrix Σ = (σ rs ) where σ rs ¼ ðN =nÞ rs 10 þ ðN =mÞ rs 01 : Here rs ab is the covariance between φ ðrÞ ðW i ; V j Þ and φ ðsÞ ðW i 0 ; V j 0 Þ, r, s = 1, 2. a = I(i = i′) and b = I (j = j′), where I(·) is the indicator function.
Theorems 2.3 and 2.4 are valid whether s c AU C is defined assuming known (δ) or estimated (δ) shift sizes. Proofs for Theorems 2.3 and 2.4 are given in the supplemental material. Similarly, we can estimate the covariance components by extending the estimation method proposed by DeLong et al. (1988), where rs 10 and rs 01 could be consistently approximated by S rs 10 and S rs 01 , respectively. Let According to Theorem 2.4, the joint distribution of Z 1 and Z 2 follows a standard bivariate normal distribution with a correlation γ obtained using the above covariances. Then, the overall test size α is decided by α ¼ 1 À prðjZ 1 j<c 1 ; jZ 2 j<c 2 jH 10 ; H 20 Þ where f ðx; yjγÞ is the pdf of the standard bivariate normal distribution with the correlation γ. The test sizes of the AUC and sAUC tests are denoted by α 1 and α 2 , respectively. We can approximate γ using the estimated asymptotic covariance components S rs 10 and S rs 01 byγ Simulation studies indicate that this asymptotic estimator,γ, leads to a test size similar to the nominal size (see Table 1).

NUMERICAL STUDY
To investigate the proposed two-stage test, we verified that the empirical test size was close to the nominal size and then compared the powers of the two-stage test (TwoT) and four single stage tests, including Delong's AUC test (AUCT), sAUC test (sAUCT), Note. Two markers follow the distribution N(u, σ 2 ) for diseased subjects with their correlation ρ. The empirical test size is the rejection rate over 5000 simulations.

Simulation Study
Consider two markers M 1 and M 2 whose correlation coefficients are ρ = 0 or 0.5. Suppose that the scores for non-diseased individuals are generated from N(0,1), and the scores for diseased ones follow N ðu i ; σ 2 i Þ for the marker M i , where i = 1, 2. To examine the test size, allow the two markers to be identically distributed. For the power comparison, we first studied four scenarios: in scenario S 1 , the two corresponding ROC curves crossed each other while having an identical AUC; in scenario S 2 , one marker was uniformly superior to the other; in scenario S 3 , two ROC curves crossed and mAUC(0.5)s were similar; and in scenario S 4 , two ROC curves crossed and the sAUCs were similar. Examples of the ROC curves in the four scenarios are drawn in Fig. 3.
The empirical test sizes are shown in Table 1; in all cases, the global nominal size α is set to 0.05 or 0.1 and α 1 = α 2 . The empirical test size was the rejection rate of 5000 simulations. The empirical test sizes were similar to the nominal levels. The critical values for the two-stage test procedure are listed in Table 2 for the different global test sizes and different correlations between Z 1 and Z 2 .
The power comparison of the two-stage test procedure with the AUC, sAUC, mAUC(0.5), and permutation tests are listed in Tables 3-6, for scenarios S 1 -S 4 , respectively. Generally speaking, no single test outperformed the others in all of the scenarios. The single-stage tests using AUC, sAUC, and mAUC(0.5) were invalid in scenarios S 1 , S 3 , and S 4 , respectively, and the two-stage test was very powerful in all scenarios even though it did not obtain the highest power in some scenarios. The permutation test was far less powerful than the two-stage test in scenarios S 1 and S 2 . In all scenarios, correlated markers led to higher powers than independent markers for all tests. This was consistent with the finding of Venkatraman and Begg (1996) and Bandos et al. (2005). The difference between the correlated markers of the AUCs (sAUCs) was less variant than that of the independent markers; therefore, the Z-statistics of the AUC (sAUC) became larger than those of the independent markers and the power (i.e., the rejection rate) correspondingly increased.
In scenario S 1 (Table 3), the power of the AUC test was close to the nominal test size when the compared AUCs were equal. This means that the AUC test was not appropriate for detecting different ROC curves in this situation. In this case, the permutation test was more powerful than the AUC test but less powerful than the three mAUCbased tests. The sAUC test displayed the highest power in this scenario and the two-stage test gave slightly less power than the sAUC.
In scenario S 2 , all three tests based on the AUC or sAUC detected the difference between two ROC curves effectively. The mAUC(0.5) test had the highest power; the two-stage test had a similar power. As seen in Table 4, the permutation test was not as efficient as the conventional AUC test, which had been pointed out by Bandos et al. (2005). Finally, the permutation test was less powerful than all of the other tests.
In scenario S 3 (Table 5), the mAUC(0.5) test was less efficient than the others. The twostage test and the permutation test performed similarly and had higher powers than the others. In scenario S 4 (Table 6), the sAUC test was extremely inferior to the others.

COMPARISON OF ROC CURVES THROUGH A TWO-STAGE TEST
We next considered the ROC curves of three scenarios when the underline scores are asymmetrically distributed: mild skewed, moderately skewed, and bimodal generated from mixture normal distributions. Two compared markers have scores in the form of 0.4X 1 + 0.6X 2 and 0.6X 1 + 0.4X 2 for the diseased, respectively, where X 1 and X 2 are from N ðμ 1 ; σ 2 1 Þ and N ðμ 2 ; σ 2 2 Þ, respectively. The non-diseased follow standard normal distributions. The distribution parameters and the power for each test are given in Table 7. All sAUC-based tests outperform AUC test and the permutation test, and the two-stage test still achieves the second-most optimum power.
In summary, the two-stage test attained a high power in all scenarios, while each of the four single-stage tests was extremely uninformative in centain specific scenarios.
Additionally, our two-stage test had a more meaningful characteristic than the other tests. In scenario S 1 , we noticed that marker M 2 with a larger sAUC had a higher partial ROC curve when the FPR was less than 1 − AUC. For instance, in Figure 3(a), the sAUCs were 0.2 and 0.36 for M 1 and M 2 , respectively, and both of the AUCs were equal to 0.70. The M 2 had a higher partial ROC curve when the FPR was less than 0.3, as expected based on Corollary 2.5 in Yu et al. (2013). This characteristic is meaningful, especially when the feature selection method is based on the AUC and shares many markers with similar AUCs. In these cases, our test procedure could be applied to differentiate the markers.
Our two-stage test incorporates the single-stage AUC test and the sAUC test; in each scenario explained above, either the AUC or sAUC tests were very powerful. This might be the reason that the two-stage test had uniform, powerful performances. The twostage test is a broad testing framework that consists of the AUC and sAUC tests, examining the "path" between them. That is, if the α2 increases from 0 to α, the twostage test would shrink from the AUC test to the sAUC test. Assigning different sizes in each stage seems to be better, depending on the scenario. The following are guidelines to split the test size based on these findings: In most practical situations, before implementing the test, people are unsure about which scenario the two compared ROC curves belong to; if they were sure, the two-stage test would not be necessary. Therefore, we highly recommend the proposed two-stage test with identical sizes for each stage.  Note. Two markers have follow distribution N ðu i ; σ 2 i Þ for diseased subjects, respectively (i = 1, 2) with their correlation ρ. The power is the rejection rate over 5000 simulations.

Real Examples
Two examples from published studies are provided with our two-stage test in this section. The first one was conducted by Venkatraman and Begg (1996). In the study, two techniques for diagnosing melanoma were compared. Seventy-two lesions were examined: 21 from a diseased group and 51 from a non-diseased group. A clinical scoring scheme without a dermoscope and a dermoscopic scoring scheme were used. The purpose of the analysis was to determine whether the dermoscope contributed to the diagnostic information gathered. The null hypothesis was that the dermoscope contributed no useful information. Further details and data can be found in Venkatraman and Begg (1996).
The test used by Venkatraman and Begg depended on a permutation randomization step. As was mentioned, the two diagnostic scores for each individual were derived from different systems and were not exchangeable, meaning that they needed a second randomization step to construct their test. As such, the test was very complex and computationally burdensome.
We applied our two-stage test and gathered the following data. We plotted the empirical ROC curves and the smoothed binormal ROC curves for both tests using the "pROC" package (Robin et al., 2011) of the R program, changing the coordinate labels to FPR vs. TPR (the original plot is coordinated with specificity vs. sensitivity) in Fig. 4. The empirical AUCs were 0.906 and 0.900 for Test X and Test Y, respectively, and the corresponding sAUCs were 0.437 and 0.374, respectively. The sample correlation between Test X and Test Y was 0.84. The two Z-statistics (Z 1 and Z 2 ) of our two-stage test were 0.148 and 1.041, respectively. The asymptotically estimated correlation coefficientγ between Z 1 and Z 2 was 0.54. Referring to Table 2, both the AUC and sAUC tests were nonsignificant at α = 0.1; therefore, the null hypothesis was not rejected, indicating that the dermoscope contributed no useful information to the diagnosis of melanoma, which was consistent with the previous results.
In Example 2, yeast data from the University of California, Irvine (UCI) machine learning repository (Asuncion and Newman, 2007) was used. The dataset contained 1484 samples, 8 features, and 10 classes. In order to best illustrate a two-class classification problem, we used the cytosolic or cytoskeletal (CYT) and mitochondrial (MIT) classes as one group and the other eight classes as the other group. Here, CYT and MIT

COMPARISON OF ROC CURVES THROUGH A TWO-STAGE TEST
were treated as the "healthy" group, and the other classes were treated as the diseased group. All eight attributes were real numbers. We studied the discrimination characteristics of two linear models. The first one was the linear discriminant analysis (LDA) classifier. The second was the single feature "alm," the score of the ALOM membrane spanning region prediction program (for the details of this dataset, please refer to the UCI machine learning repository Web site). We referred to this single feature model as SFM. We used the first two-thirds of the samples as a training set; the rest were the testing set. For the testing set, the empirical AUCs were 0.739 and 0.775 for SFM and LDA, respectively. Their empirical sAUCs were 0.294 and 0.174, respectively. The sample correlation between the two models was 0.64. When testing these two markers with our two-stage test procedure, the two Z-statistics were −1.703 and 5.013, respectively, and their correlation coefficient wasγ ¼ 0:5. This implies that their AUCs were not significantly different according to critical values shown in Table 2; however, their sAUCs were significantly different at a = 0.01. This suggests that the single feature "alm" had higher ROC curves than the full LDA classifier when the FPR was 0.26 or lower, as seen in Fig. 5.

DISCUSSION
The proposed two-stage test provides a powerful comparison of two correlated ROC curves for a broad range of scenarios, especially when comparing two crossed ROC curves with similar AUCs. This procedure consists of an AUC and sAUC test. We derived the asymptotic joint distribution for these two test statistics. Constructed using two nonparametric tests based on Z-statistics, our test outperforms a conventional nonparametric AUC test (DeLong et al., 1988) in terms of power; it outperforms a conventional permutation-based ROC test (Venkatraman and Begg, 1996) in terms of power and its computational efficiency. Furthermore, we have demonstrated that a single-stage test based on either the AUC or the mAUC was not powerful in certain situations. On the other hand, the proposed two-stage test displayed a high level of power in broad scenarios. If two binormal ROC curves were compared, our two-stage test would be equivalent to testing the identity of the entire range of ROC curves.
In this article, we use identical test sizes at each stage for simplicity. However, we conclude that different allocations of the test size at each stage may lead to different overall power, depending on how the two ROC curves are shaped. If there is prior information regarding the curves, it is possible to allocate the test size differently. Otherwise, equal allocation is the best option. Our definition of sAUC relies on a positive shift, δ, which is set to be δ ¼ 1:96σ X . For the ordinal scale data, this setting might be inappropriate. We will pursue this problem in the future.
We argue that the sAUC is not invariant to a monotonic transformation, such as logarithmic transformation, although it is invariant to location and/or scale transformation. Suppose that we compare two markers, where one is the logarithmic transformation of the other. Then, the two sAUCs could be different. In such cases, our two-stage test might increase Type I errors. This criticism could be alleviated if we note that such comparisons are rare in practice. In addition, since these two markers have exactly the same ROC curves, we could also avoid such comparisons by plotting the curves before performing the test.
It is well-known that specifying a data-driven upper limit of the false positive rate in pAUC is not trivial. However, if a pre-specified region of the ROC curves is of interest, the statistical test about the partial area under the ROC curve (pAUC) is also available for comparing two correlated ROC curves. In such cases, our test may not as powerful as a pAUC-based test such as the method proposed in Wieand et al. (1989). However, in the sense of detecting differences between entire ROC curves in varying situations, the twostage test could be more useful.