## Abstract

Comparison studies between physiological tests are often unsatisfactory for assessing their ability to distinguish between subjects. We recommend a simple but comprehensive protocol, using duplicate testing, that compares tests using*1*) the discriminant ratio (DR) between the underlying between- and within-subject SDs,*2*) correlation coefficients adjusted for attenuation due to test imprecision, and*3*) unbiased estimation of the underlying linear relationship between test results. The following five alternative methods for assessing glucose tolerance were compared: fasting plasma glucose (FPG) as a single sample or as the mean of three 5-min samples (FPG_{3}); the 1- and 2-h glucose during a low-dose intravenous glucose infusion (CIG); and the 2-h plasma glucose from a 75-g oral glucose tolerance test (OGTT). All tests had similar DRs ranging from 2.6 to 4.2. The adjusted correlation between FPG and CIG tests approached unity, and those between OGTT and other tests were ∼0.9, showing that FPG_{3} provides similar information to the OGTT. FPG concentrations of 6.0 and 7.1 were found equivalent to the 1985 World Health Organization OGTT thresholds for impaired glucose tolerance and diabetes (7.8 and 11.1 mmol/l).

- plasma glucose
- precision

complex physiological responses, such as glucose tolerance, may be assessed by many different methods. For instance, the control of plasma glucose may be measured by the fasting plasma glucose (FPG), by the response to an oral or an intravenous glucose challenge, by the percentage of glycated hemoglobin (HbA_{1c}), or by the plasma concentration of fructosamine. Different clinical tests are used to assess the same physiological variable because of different clinical circumstances, the application of advances in understanding and technique, or as a result of historical circumstances or fashion. The tests may assess differing or equivalent aspects of a physiological characteristic and may express their results in different units of measurement.

In this paper, we recommend a simple but comprehensive methodology for comparing different tests of a continuous physiological variable, which may be applied irrespective of the scales of measurements used. For such comparisons, it is important to include assessment of both the between- and within-subject variation of each test, as this allows comparison of the ability of tests to discriminate between different subjects, determination of the underlying correlation between tests after adjusting for attenuation due to within-subject variation, and unbiased estimation of the underlying relationships between the results of the different tests.

Omission of any of these components in a comparison study will provide inadequate information for choosing a particular test for a particular situation. The imprecision, or the within-subject variation, of a test is of little use by itself and must be considered in relation to the range of the test results. The “coefficient of variation,” which relates imprecision to the midpoint of the range, is often inappropriate as it does not take account of the dynamic range of a test (1) and is not always comparable between different scales of measurement. The present paper introduces the concept of “discrimination,” i.e., the ability to distinguish between individual subjects within a specified range of interest. This can be expressed as the discriminant ratio (DR), which is defined here as the ratio of the underlying between-subject SD (SD_{B}) to the within-subject SD (SD_{W}). The DR has a defined distribution, and DRs for different tests can be compared statistically.

The correlation between different tests must be included in any comparison. Test imprecision will diminish, or “attenuate,” the measured correlation coefficients in a well-described way (11), and it is important to correct for this so as to be able to assess the underlying “true” correlation, as this represents the degree to which the tests are assessing the same physiological trait. This attenuation adjustment can be expressed in terms of the DRs of the respective tests.

Finally, the comparison of tests must, if possible, relate measurements by one test to those by another. Where the relationship is linear, the gradient of the relationship obtained by least squares regression is underestimated (“regression dilution”) when both tests are subject to appreciable measurement error. An unbiased technique for deriving the relationship is therefore necessary, and, while many methods are available, a suitable method (27) is recommended here.

The assessment of glucose tolerance is a specific area where several methods are available to an investigator, for instance, by the measurement of steady-state plasma glucose or the response to a glucose challenge, either oral or intravenous. The standard test of glucose tolerance has, until recently, been the oral glucose tolerance test (OGTT; see Ref. 33), but, in practice, this test is not often performed. This is partly because of the inconvenience of a 2-h test and partly because of the marked variability of the OGTT. The poor reproducibility of the test (12, 21, 24, 26), which has a reported coefficient of variation of the order of 15–40%, due in part to the variable rate of gastric emptying, has predictable effects on reclassification of subjects on repeat tests, with a change in status on repeat testing in 30–60% of cases of impaired glucose tolerance (IGT; see Refs. 9, 13, 26, 28, 29) and predictable regression to the mean (26). The simple measurement of FPG has been suggested as a preferable measure (4, 8, 15, 20), and a continuous intravenous infusion of glucose (CIG) can assess glucose tolerance and give simultaneous measures of pancreatic β-cell function and insulin resistance (16). The FPG thresholds for diabetes have been set using outcome data (4). We evaluate and compare the performance of this and other tests throughout the physiological range.

This paper outlines the concepts and components of a physiological comparison study and uses as an example the assessment of glucose tolerance by either *1*) the FPG (single sample or mean of 3 consecutive samples),*2*) the 1- and 2-h responses to a CIG, and *3*) the standard 2-h response to the OGTT, in repeated tests in 30 subjects spanning the range of glucose tolerance.

## METHODS

### Statistical Methods

This paper considers the following three aspects relating to the assessment and comparison of different tests for measuring an underlying physiological variable such as glucose tolerance:*1*) the ability of a test to discriminate between different subjects and comparison of discrimination between different tests;*2*) the correlation between pairs of tests, adjusting for bias due to within-subject variation. Such variation attenuates measured correlation coefficients so that they underestimate the underlying true correlation. This is important in assessing the degree to which different tests purporting to assess the same underlying trait differ with respect to systematic between-subject factors as opposed to random within-subject variation; and*3*) in cases where the relationship between a pair of tests appears to be a linear, unbiased estimation of the underlying line of equivalence between them.

Each of these aspects is required for a comprehensive comparison study and is based on a combination of well-recognized and novel concepts. Our approach is considered on a conceptual level at this point, with a more rigorous statistical treatment reserved for appendixes i-iii.

*Discrimination between subjects*. All physiological measurements *1*) are subject to imprecision, which may derive from biological, sampling, and analytic sources and *2*) relate implicitly to variables taking on values within a particular range of interest. The performance of a particular physiological test will depend on the relationship between both of these characteristics. Absolute measurements of the imprecision of the test are only meaningful in relation to the range of values to which that test will be applied. The smaller the former is in relation to the latter, the greater is the ability of a test to discriminate between individual subjects. In the context of measurements being obtained from a series of individuals representing the physiological spectrum of interest, we propose a novel, simple index of discrimination, the DR, the ratio between the SD of the underlying subject means (SD_{U}), and the SD of repeated measurements on the same subject (SD_{W}).

The discrimination of a test is not a universal property but will relate to the spectrum of values in the population being studied. Hence absolute DR values are not comparable between different populations, but they are essential when comparing the practical application and performance of different tests in the same population.

#### Underlying SD_{U}.

Because a physiological test may be applied to a variety of possibly nonuniform populations, it is important to assess the test in relation to its expected range of application. Subjects in a comparison study should be chosen to represent and to span the range rather than be randomly selected from particular populations of interest, and this range is characterized statistically by the SD_{U}. The measured SD (SD_{B}) will overestimate the underlying SD_{U} due to the presence of within-subject variation, and it is important to adjust for this, using a standard formula, to yield an unbiased estimate of the SD_{U}.

#### SD_{W}.

To relate simply to the between-subject variation, we must be able to assume a common within-subject variation for all of the subjects in the study. This property is called “homoscedasticity” and can be checked by simple plots of the data (see appendix ). Lack of homoscedasticity can often be rectified by an appropriate numerical transformation of the test results.

For homoscedastic data, the common within-subject variance is simply the mean of the individual within-subject variances.

*DR*. As outlined above, the DR is defined as the ratio SD_{U}/SD_{W}. In a comparison study where *k*replicate measurements are performed in each subject, the measured SD_{B} is calculated as the SD of the subject mean values (calculated from the*k* replicates). The standard mathematical adjustment to yield SD_{U} is
so that the DR is calculated simply as
This result may also be obtained by an analysis of variance (ANOVA) approach, using a fixed effects model, and this is presented inappendix
. The appropriate equation is then
where MS_{B} is the between-subject mean square, MS_{W} is the within-subject mean square, and *k* is the number of replicate tests in each subject, as above.

Assuming that the within-subject variation is normally distributed, we have used an analytical approach to derive confidence limits for estimated DR values and to test for the significance of differences between the DRs of different tests. We present these in appendix , where we also discuss our methodology in relation to alternative measures of test “reliability,” in particular the intraclass correlation coefficient (ICC; see Refs. 5, 30).

*Correlation between pairs of tests*. Two tests designed to assess a complex physiological characteristic, such as degree of glycemic control, may use different methodologies, neither of which may perfectly represent the characteristic in question. The results of the tests may differ in systematic ways, independently from their random within-subject variation. The degree to which the tests measure the same characteristic may be assessed by the correlation between their results in a set of subjects. In the absence of within-subject variation, the degree to which their correlation falls short of unity would represent the extent to which the tests either fail to measure precisely the same characteristic or to measure aspects of the underlying characteristic that are differentially influenced by other factors that vary between subjects. It is this underlying true correlation that we are interested in here.

Test imprecision, however, further attenuates the observed correlation so that imperfect correlation will usually be due to a combination of the systematic between-subject factors discussed above and the presence of within-subject variation. A study comparing tests must distinguish between these two components, and this can be achieved using a standard formula that corrects measured correlations for attenuation (11). Such a correction requires knowledge of both the within- and between-subject variation of the measured values. Because the DR incorporates both of these elements, the correction can be expressed in terms of the DR where (seeappendix ). The corrected correlation coefficient, therefore, represents the degree to which the tests represent the same physiological characteristic, independent of test imprecision.

*Unbiased estimation of a linear relationship*. Where two tests represent the same characteristic, their results will often be found to be linearly related (although this may require numerical transformation). The determination of this relationship will be important in relating the results of one test to those of the other. Within-subject variability will lead to a noisy relationship, and a statistical approach is necessary to estimate the underlying linear equation. Although least-squares linear regression is often used for this, it is not appropriate here since it assumes that the explanatory variable is free from noise. When this is not the case, linear regression will underestimate the slope of the equation, a well-recognized effect termed regression dilution.

There is no perfect method of estimating the true relationship in these circumstances, but the method chosen here is that of least perpendicular distances corrected for scale differences, which has been shown to perform well in relation to other methods (27). The equations for this are presented in appendix .

*Design of comparison studies*. There is no single measure that can be used to compare physiological tests. Assessment and comparison of test discrimination and the determination of their underlying correlations and inter-relationships are equally important components. These all require simultaneous consideration of both between-subject and within-subject variation within the physiological range of interest. A comparison study should, ideally, assess both of these factors by using replicate tests in subjects chosen to span that range.

### Experimental Protocols

*Subjects*. Thirty white Caucasian subjects were studied, consisting of 10 normoglycemic subjects, 9 subjects with IGT, and 11 with type II diabetes according to 1985 World Health Organization (WHO) definitions (33). All subjects were on a weight-maintaining diet and had not changed their medication for 4 wk before the tests. Subject characteristics are presented in Table1 by glucose tolerance group.

*Protocols*. Each subject was studied on four occasions within a 6-wk period. After a 12-h fast, subjects went to the hospital and sat on a bed for the duration of the tests. Tests were each performed on two occasions in the same subject and in random order.

*FPG and CIG*. Two cannulas were inserted in the same arm. One, for blood sampling, was placed at the wrist or on the dorsum of the hand, which was heated by an electrical blanket to “arterialize” the venous blood. The other cannula, for infusion of glucose, was placed in an antecubital vein. A blood sample was taken at *time −10 min*, and the plasma glucose concentration at this single time point was termed FPG_{1}. Blood samples were also taken at *times −5* and*0 min*, and the mean of the plasma glucose at the three time points was termed FPG_{3}. At *time 0*, a continuous 5 mg ⋅ kg ideal body wt^{−1} ⋅ min^{−1}infusion (22) of 10% glucose was started and continued for 2 h. One-hour CIG and 2-h CIG glucose were defined as the means of the three plasma glucose concentrations in blood samples taken at 50, 55, and 60 min and at 110, 115, 120 min, respectively.

*OGTT*. A single cannula was placed in an antecubital vein for blood sampling. Fasting blood samples were taken at −10, −5, and 0 min. At 0 min, the subject consumed a 75-g glucose drink, and blood samples were taken at 30, 60, 90, and 120 min.

### Biochemical Assays

Plasma glucose was determined by a hexokinase-based method (Boeringer Mannheim UK, Lewes, UK) on a centrifugal COBAS MIRA autoanalyzer (Roche, Welwyn Garden City, UK).

## RESULTS

### Estimates of Glucose Tolerance

Values for glucose tolerance using FPG_{1}, FPG_{3}, 1-h CIG, 2-h CIG, and 2-h OGTT are presented in Table 2 as median and ranges. Fasting and CIG measures were homoscedastic, and the within- and between-subject variations are illustrated as plots of difference (first test − second test) vs. mean (of the 2 tests) in Fig.1. The 2-h OGTT was found to have within-subject variation increasing with mean values, and this was corrected by log transformation, with the transformed data presented as difference vs. mean plots in Fig. 2 and as medians and ranges for the whole group in Table 2. The underlying SD_{U} and SD_{W} and the DRs of the tests are also presented in Table 2.

DR values for the five measures, together with the one SE range of the estimates, are illustrated in Fig. 3. Although the lowest DR was the FPG_{1} and the highest the 2-h CIG, there was no significant difference between them on the overall statistical test (
= 6.2,*P* = 0.19, using *Eq.10
* in appendix
). Consideration was given to the exclusion of a subject whose inter-test difference in the 2-h CIG was four SDs from the mean of the rest of the group (see Fig. 1). In the absence of an identifiable reason for the large difference between his two test values, this subject was included in the analyses presented here, although, if he were excluded, the DR of the 2-h CIG would rise to 6.1, significantly greater than the DRs of the other tests.

Table 3 shows the Pearson correlations between the tests, both before and after adjustment for attenuation. The correlations were calculated between subject means of duplicate tests, to give the best estimate of the underlying relationships between tests. Adjusted correlations between the fasting and both of the intravenous measures were high, approaching one. Those between the 2-h OGTT results and the other tests were somewhat lower (∼0.9), indicating that there was some biological discordance in their relationship, independent of their within-subject measurement error.

Figure 4 shows the scattergram between the 2-h plasma OGTT (on a logarithmic scale) and the FPG. It also illustrates the line of equivalence, derived as explained above, and the linear regression line of the 2-h OGTT on FPG. The dilution effect of the within-subject variation on the regression line can be seen clearly. Table 4 gives coefficients for the unbiased linear equations relating the test values, and Table5 gives the points on the various scales that are equivalent to the 1985 WHO OGTT thresholds for IGT and diabetes.

SD_{W} of the logarithm of the 2-h OGTT may be interpreted in relation to a standard interval in the range of glucose tolerance, for instance, the interval between the thresholds for IGT (7.8 mmol/l) and for diabetes (11.1 mmol/l). This interval is 0.153 on a logarithmic scale (*base 10*), whereas the SD_{W} is 0.060. The difference between two individual measurements at either end of this interval would not be significant at the 5% level (*P* = 0.07), given the imprecision of the 2-h OGTT. In other words, it would not be possible to confidently distinguish between two such individual values. The same applies to all other tests examined, using the unbiased equivalent values to these thresholds given in Table 5, apart from the 2-h CIG plasma glucose, for which the two measurements at opposite ends of the equivalent interval would differ significantly (*P* = 0.013).

## DISCUSSION

This study has shown that different methods of assessing glucose tolerance were broadly comparable in a range of subjects spanning normal glucose tolerance, IGT, and type II diabetes. This assessment involved the following three separate components:*1*) comparison of the discrimination of the tests, i.e., their ability to distinguish between different subjects, *2*) determination of the degree to which different tests measure the same underlying physiological property, and *3*) estimation of the underlying relationship between the test results. The assessment of the within-subject imprecision of each test is a fundamental requirement for this evaluation, so that comparison studies must involve at least duplicate measurements in all subjects in order to determine both the between- and within-subject measurement variation of each test in the same group of individuals. This is best done in subjects who represent a clinically meaningful range of glucose tolerance. To ascribe a single value to within-subject imprecision requires homoscedasticity, and numerical transformation of results may be necessary to achieve this, as illustrated by the 2-h plasma glucose from the OGTT. A measure of imprecision is important for the assessment of changes within subjects, such as over time or after interventions (10).

The determination of imprecision alone is not adequate, however, for the assessment of the practical value of a test. The standard methods of assessing imprecision, including the coefficient of variation, have little meaning on their own, without reference to the range of measurements to which they are being applied. The ability to distinguish between individuals within this range is here termed the discrimination of the test and is assessed using the DR.

The concept of discrimination should be distinguished from the ability of a test to categorize patients by an external gold-standard dichotomy. This is a particular concern in the field of clinical chemistry, for instance, when a biochemical test is being assessed for its ability to detect the presence of a malignancy. Receiver operating characteristics (34) have been used for this purpose. However, they are not suitable for the assessment of test results as a continuous scale of measurement or for the comparison of tests without reference to an external categorization. In the field of glucose tolerance, categories have been defined on the basis of thresholds in a continuous scale of measurement for the OGTT based on external criteria, but this is a notoriously imprecise test (7), and it would not therefore be appropriate to assess possible alternative tests using a categorical approach based on these thresholds. The use of the DR provides a means of comparing how well the subjects studied can be reliably distinguished by different tests, which is an important component of a comprehensive comparison of imprecise tests. For instance, in many research studies using continuous variables, the statistical power to distinguish between groups of subjects or to determine correlations between variables will depend on discrimination.

A very similar concept, referred to as reliability, has been assessed, particularly in the psychological literature, using the ICC. This relates the covariance of replicate test results to the combined between-subject and within-subject variance and is algebraically related to the DR. However, in the context of discriminating between different subjects, it is not as easy to grasp as the DR, which has a direct intuitive relevance when considering a test’s application. Furthermore, it is not as easy to derive statistical tests comparing different ICC values. A recently described method for the comparison of two ICCs does not extend to the comparison of more than two tests and was only validated for studies employing repeated tests in 100 subjects or more (2).

The choice of a methodology for comparing different tests is intimately related to its theoretical basis and in particular to the availability of statistical criteria for assessing such a comparison. Much of the theoretical discussion of the ICC and treatments of comparative tests between ICCs have been based on random effects models in which the individuals in which the tests are being performed are assumed to be drawn at random from a known normally distributed population. This presupposes a particular population in which the tests are being applied. Complex populations (for instance those consisting of subgroups) would require appropriately complex statistical models. Unfortunately, even the simplest random effects models are hard to treat analytically, as in the case of the ICC quoted above. Our approach has been to concentrate on evaluating tests across a particular physiological spectrum of interest. In the example presented here, for instance, glucose tolerance represents a physiological and pathological unity irrespective of the its distribution in particular populations. For this purpose, we take the view that it is appropriate to perform a comparison study in subjects selected to span the range of interest, which may be analyzed by a fixed-effects statistical model. Such an analysis is presented here for the DR, allowing the derivation of straightforward expressions for the SD and confidence intervals of the DR and the evaluation of the statistical significance of differences between the DRs of different tests.

Although the comparison of the DRs of different tests in the same study is valid, the DR in itself is not a universal characteristic of a test, as it will depend on the choice of subjects on which the comparison is performed. When the subjects cover a greater range of values, the DR will be larger, and vice versa, and the DR calculated when subjects are selected to span a range of interest will generally be larger than if subjects had been chosen randomly from the same population. However, it is a fundamental property when it comes to a test’s practical application, and, unlike imprecision, it can be used as a basis for comparison between tests. The DRs of the five tests examined here were not significantly different, in spite of the increased complexity and expense of the CIG and the OGTT.

Two perfectly precise tests assessing the same physiological variable would be perfectly correlated. Departures from perfect correlation can be due to the following two factors:*1*) underlying differences not directly related to the variable of interest. These will manifest themselves as systematic differences between subjects; for instance, in the assessment of glycemic control, which is determined by the FPG and OGTT in qualitatively different ways, the underlying correlation may fall short of unity because of the influence of factors such as the effect of gastric emptying and the influence of intestinal incretin effects that differ between the two methodological approaches;*2*) diminution of the underlying correlation may also arise from within-subject variation, and this is a well-described statistical effect termed “attenuation,” which may be adjusted for by standard techniques to enable the estimation of the underlying correlation (11). This adjustment depends on both test imprecision and the degree of variation between subjects and so can be expressed in terms of the test DRs.

Adjustment for attenuation will establish the degree to which the underlying correlation differs from unity due to the factors detailed in *factor 1* above.

In this paper, the adjusted correlation coefficients between the fasting glucose and the CIG approached unity and were only slightly lower between these tests and the 2-h OGTT. Although additional factors unrelated to the homeostatic control of plasma glucose, such as variable gastric emptying, would contribute to this, the relatively high overall intercorrelations and the simplicity and cheapness of the FPG would recommend this as the measure of choice for the assessment of glucose tolerance.

When two tests have an underlying structural relationship between their measurements (or transformations of these) that is linear, it can be instructive to determine the equation of the “line of equivalence.” Linear regression, although often used, is unsatisfactory since it assumes perfect precision in the independent variable and is subject to regression dilution. There have been many approaches to deriving an unbiased estimation, as comprehensively reviewed by Riggs et al. (27), and the “weighted least squared perpendicular distance” approach (Riggs’ “PW” method) has been used in this paper. In the present study, the assumption of linearity, within the limits set by the imprecision of the tests, was supported by visual inspection of plots of the relationships (data not shown). The FPG threshold concentrations recommended by the American Diabetes Association (4), based on studies of the prevalence of retinopathy in three distinct populations, were confirmed as equivalent to the established OGTT thresholds for IGT and diabetes.

We also calculated the SD of log_{e}(DR) estimates for different numbers of replicate tests, using the Taylor series expansion, subject to a constraint on the total number of tests performed. For a study comparing two methods using a total of 60 tests, the power to detect a difference between DRs of 2.5 and 4.0 is 58% using 30 subjects and two replicate tests, rising to 72% with 20 subjects and three tests, an increase in power of 24%. Further increasing the number of replicates gives smaller increases in power, e.g., for four tests in 15 subjects the power is 78%, with even smaller gains for more than four replicates. There would appear to be some advantage in using three, rather than two, replicate tests in each subject, but there is little advantage in increasing the number beyond three given the need to obtain sufficient subjects to adequately cover the range of interest.

This study showed that, with OGTT done under carefully controlled conditions, with a reproducibility that is somewhat better than reported elsewhere, it was not possible to distinguish between the WHO thresholds for IGT and diabetes at a 5% significance level, and this was also the case for FPG, even when the mean of three samples at 5-min intervals was assessed (the between-sample variation being relatively small in relation to the between-day variation). It is therefore not surprising, with two thresholds close together, that repeat measurements often give change of status. Improved classification could be achieved by taking the mean of determinations on more than one day. However, although these classifications may be useful for epidemiological purposes, for practical purposes the actual OGTT or FPG value is more informative (31, 32).

In summary, we have outlined a comprehensive but simple methodology for the comparison of imprecise tests, encouraging*1*) comparison of test discrimination, expressed as the DR,*2*) the evaluation of the degree of agreement between tests based on correlation coefficients adjusted for attenuation, and *3*) in the case of a linear relationship between test results (or their mathematical transformations), the use of an unbiased method for estimating the underlying equation. For such a comparison study, it is important to determine the within-subject variation of each test as well as the variation between subjects. Application of these methods to various tests of glucose tolerance demonstrated similar discrimination, acceptable agreement, and an unbiased estimation of the FPG values equivalent to those of the 2-h OGTT. The latter agree closely with the outcome-derived thresholds currently being recommended by the American Diabetes Association. However, because the thresholds for IGT and diabetes are within measurement error and cannot be reliably distinguished, the absolute 2-h OGTT or FPG is more informative than the categorization.

## Acknowledgments

We are grateful for the assistance from Dr. Sue Manley and Nuala Walravens.

## STATISTICAL METHODS

This section presents a more detailed mathematical treatment of the concepts outlined in methods.

### Discrimination Between Subjects

*Statistical model*. We consider the comparison of different tests, each measuring the same physiological variable. Each test is performed *k*times on each of *n* subjects, with the order of the tests being randomized for each subject.

Considering first a single test in isolation, an appropriate model is
Equation 1where*X _{ij}
*is the result of the test performed for the

*j*’th time on the

*i*’th subject, μ is the overall mean value of the variable in question on the scale of the current test, and α

_{i}is the true value of the

*i*’th subject, measured as a deviation from the mean (thus Σ

_{i}

_{=1,}

_{n}α

_{i}= 0); ε

_{ij}represents day-to-day variation, which includes both biological and assay variation; the ε

_{ij}are assumed to be independent, normally distributed random variables with mean zero and variance ς

^{2}.

*Equation 1*is a standard one-way ANOVA.

The assumption of constant variance (or homoscedasticity) of the error term, ς^{2}, can be checked graphically. If *k* ≥ 5, we can calculate the quartiles of the *k*replicate test results for each subject and plot log(interquartile range) against log(median) (14). If 2 <*k* < 5, plot the SD of the*k* replicates against the mean for each subject (23), and if *k* = 2 plot the differences (1st − 2nd replicate) between the pairs of tests against the subject means (6). If the assumption of homoscedasticity holds, the plotted measure of variation [log(interquartile range), SD, or difference] should be approximately constant across the range of subjects. If there appears to be a systematic relationship between the measure of variation and subject medians or means, this can often be removed by mathematically transforming the results of*X _{ij}
*. A common case in physiological measurements is where the SD increases in direct proportion to the mean, when a log transformation of the

*X*stabilizes the variance and log(

_{ij}*X*) can then be used in place of

_{ij}*X*in

_{ij}*Eq. 1*. Other transformations can be considered for different relationships between the subject SDs and means (14, 23).

It is also possible to check the assumption that the ε_{ij} have a normal distribution by plotting the ordered residuals from fitting *Eq.1
* against standard normal deviates in a “normal probability plot” (3). However, the ANOVA procedures used here are fairly robust to moderate departures from normal distribution and can be used without such sophisticated checking, provided homoscedasticity of variance holds and the data do not exhibit marked skewness.

In these experiments, subjects are selected to span a range of glucose tolerance and are not chosen randomly from a prespecified population. The subject effects α_{i} are therefore considered as “fixed” rather than “random” effects.

*DR*. As a measure of discrimination between subjects, we define the true DR, Δ, as the ratio of the underlying SD_{B} to the SD_{W}
Equation 2Unbiased estimates of the between- and within-subject variances are given by (MS_{B} − MS_{W})/*k*and MS_{W}, respectively, where MS_{B} and MS_{W} are the between- and within-subject mean squares from a standard one-way ANOVA, i.e.
Equation 3
Equation 4and M_{i} = Σ_{j}
_{=1,}
_{k}
*X _{ij}
*/

*k*and M = Σ

_{i}

_{=1,}

_{n}M

_{i}/

*n*, the subject and overall means.

We then estimate Δ empirically as the ratio of the between- to within-subject standard deviations Equation 5The DR is algebraically related to the ICC, which is commonly used as a measure of the reliability of tests (5, 30) Equation 6

However, the methodology developed for ICCs is in the context of a random effects model, rather than the fixed effects model used here, so published results for SDs and confidence intervals cannot be used. The DR gives a measure that is intuitively closer to the idea of discrimination between subjects, whereas the ICC is a measure of correlation. In addition, for tests with good discrimination, ICC values tend to cluster unhelpfully close to their upper limit of one. Furthermore, there is no simple practicable test available for the comparison of ICCs from different tests in a random effects model. A recently described method for the comparison of two ICCs does not extend to the comparison of more than two tests and was only validated for studies employing repeated tests in 100 subjects or more (2). We have derived straightforward expressions for the SD and confidence intervals of the DR in a fixed effects model and a test for the equivalence of DRs in a comparison study.

*Confidence limits for DR*. Confidence limits for the DR can be found by noting that
Equation 7where*F*
_{0} = MS_{B}/MS_{W}is the standard *F* statistic from the one-way ANOVA. *F*
_{0}has a noncentral *F* distribution with degrees of freedom ν_{1} =*n* − 1 and ν_{2} =*n* × (*k* − 1) and noncentrality parameter
λ can be estimated by (*n* − 1) × *k* × DR^{2}, and a 95% confidence interval for Δ is
Equation 8where*F*
_{L} and*F*
_{U} are the lower and upper 2.5% of the noncentral *F*(17).

Noncentral *F* tables are not widely available (18), and a reliable approximation to*F*
_{L} and*F*
_{U} can be made using a central *F* distribution (25)
where
and
are the lower and upper 2.5% of a central*F*ν,ν_{2}distribution and
*Comparison of DRs*. We derived a test for the equivalence of several DRs by assuming the following model
Equation 9
*X _{ijh}
*is now the result of the

*h*’th test performed for the

*j*’th time on the

*i*’th subject, μ

_{h}is the mean value of the variable in question on the scale of test

*h*, and α

_{ih}is the true value of the

*i*’th subject measured using test

*h*(Σ

_{i}

_{=1,}

_{n}α

_{ih}= 0 for each test

*h*); ε

_{ijh}are again assumed to be independent, normally distributed random variables with zero mean and variance .

Using this model, the DRs for each test are statistically independent. We used simulations (see appendix
) to show that the distribution of log_{e}(DR) is approximately normal if the model assumptions hold. We then used Cochran’s theorem (19) to show that the statistic*Q* has a χ^{2} distribution with*r* − 1 degrees of freedom, where
Equation 10

The DRs are unequal at a significance level of 0.05 if*Q* exceeds 95% of the
distribution.

We derived an expression for s_{h}, the estimated SD of L_{h}, from the mean and variance of the noncentral *F*distribution using a Taylor series expansion; details are given inappendix
.

*Alternative models*. The models we have used, given by *Eqs. 1
* and *
9
*, for observations from the comparison study have been deliberately chosen for their relative simplicity. Although some of the algebra is intricate, all of the calculations we have presented can be easily implemented using spreadsheet software and do not require the use of specialized statistical packages. However, some of our model assumptions do warrant further discussion.

First, the choice of a fixed rather than a random effects model is unusual in this kind of context. However, subject selection in our study is clearly nonrandom in that we have deliberately chosen roughly equal numbers of normal glucose tolerance, IGT, and diabetic subjects. Even within each of these subpopulations, sampling is unlikely to be random as subjects are sought to span the range of interest as evenly as possibly, which is likely to result in oversampling from the extremes of the distribution. Such a sampling scheme is likely to produce a DR that is higher than that which would be obtained from a random sample from the population, and its use is restricted to comparison with other tests in the same study. It is not appropriate to formally compare test DRs that have been derived from different populations.

A population consisting of clearly defined subpopulations might be best treated using a mixed model, or even structural equation modelling. This would require more sophisticated analytical techniques and a larger scale of comparison study than we have presented in this paper. In the particular example presented here, however, the “subgroups” are not clearly separable but are arbitrarily defined by thresholds in a continuous spectrum. In this situation, which is relatively common in physiology, the approach taken here would be adequate, relatively simple, and practical. A formal comparison of the use of more complex models and the simple approach made here is beyond the scope of this paper.

The assumption of independence of the error terms ε_{ijh} is unlikely to be completely true since most biological measurements exhibit some degree of “autocorrelation,” i.e., correlation between successive measurements made on the same subject. In the context of these studies, where repeat measurements are almost always made on different days and often several days or even weeks apart, the magnitude of such autocorrelation is likely to be small compared with the total within-subject variation in which we are interested. Furthermore, accurately estimating autocorrelation coefficients would be difficult in relatively small studies, and the degree of mathematical complexity would increase such that specialized statistical methodology and software would be needed, rendering the procedures inaccessible to many researchers. However, our methodology might not be appropriate where repeat measurements were made within the same day or where other biological reasons existed for suspecting nonnegligible autocorrelation.

### Correlation Between Pairs of Tests

The nature of the relationship between a pair of tests can be examined graphically by plotting the subject means for the first test against those for the second. In many cases, particularly after transformations to ensure homoscedasticity, the relationship will be approximately linear, and the degree of correlation can be assessed using the Pearson product-moment correlation coefficient,*r* (3).

In the model of *Eq. 9
* for*r* = two tests, we are interested in the correlation between the underlying subject means α_{i}
_{1}and α_{i}
_{2}. However, in the presence of within-subject variation, the sample correlation coefficient, i.e., the correlation between the two sets of observed subject means, underestimates the true correlation between the tests; this effect is known as attenuation and means that, even if the true subject means α_{i}
_{1}and α_{i}
_{2}were perfectly correlated, the correlation between the observed subject means would be less than unity because of the random fluctuations due to within-subject measurement variation.

Standard results from measurement error theory (11) show that the correlation between two measurements, both of which are subject to error, is attenuated by the factor
where κ_{1} and κ_{2} are the reliability coefficients of the two tests. From *Eq.6
*
where DRM is the DR of the means M_{i},*i* = 1,...,*n*, of the*k* replicate measurements on each subject, rather than of the individual measurements themselves.

Taking the mean of *Eq. 1
* over the*k* replicates yields
where ξ_{i} is normally distributed with mean of zero and variance ς^{2}/*k*. Thus, in *Eq. 2
*, ς must be replaced by ς/
, which we estimate by
, and from*Eq. 5
*
Equation 11Thus
The Pearson correlation coefficient *r*can be adjusted for attenuation by dividing it by η, i.e.
where*r*
_{adj} is the adjusted *r*.

In cases where the relationship between the tests is clearly nonlinear, the Spearman rank correlation coefficient*r*
_{s} should be used in place of *r* to assess the comparability of the tests. However, there is no universal formula for the attenuation of*r*
_{s} in the presence of measurement error.

### Unbiased Estimation of Linear Relationship

In the case where the relationship between a pair of tests is linear, it may be useful to obtain unbiased estimates of the gradient and intercept of the line. Linear regression gives biased estimates because it only considers errors in the dependent variable, and clearly both tests here are subject to error; the gradient is always underestimated, and regression of subject means from *test 1* on those from *test 2*clearly gives a different relationship to that of *test 2* on *test 1*.

The method that we have chosen to estimate the linear relationship between the subject mean measurements from *test 1* and those from *test 2* is that of “perpendicular least squares, properly weighted.” This essentially minimizes the sum of the squared perpendicular distances between the observed data and the fitted line, but with an adjustment that makes the method invariant to linear transformations of the measurement scales. If M_{i}
_{1}and M_{i}
_{2},*i* = 1,...,*n*, are the subject means (over the *k* replicate tests), then the estimated gradient is
where S_{11} = Σ_{i}
_{=1,}
_{n}(M_{i}
_{1}− M_{1})^{2}; S_{22} = Σ_{i}
_{=1,}
_{n}(M_{i}
_{2}− M_{2})^{2}; and S_{12} = Σ_{i}
_{=1,}
_{n}(M_{i}
_{1}− M_{1}) × (M_{i}
_{2}− M_{2}). M_{1} and M_{2} are the overall means for each test, i.e., M_{h} = Σ_{i}
_{=1,}
_{n}M_{ih}/*n*,*h* = 1, 2, and θ =
/
.

The
and
are estimated from their respective MS_{W}, so that we estimate θ by
The intercept is then estimated as
This method is described and contrasted with other methods by Riggs et al. (27), where it is shown to perform well under a range of values of θ when the correlation between the M_{i}
_{1}and M_{i}
_{2}is fairly high (above ∼0.5) and θ is estimated fairly precisely. Such conditions are likely to apply in these experiments: the M_{i}
_{1}and M_{i}
_{2}are measuring the same underlying physiological variable so the correlation will be high, and ς_{1}and ς_{2}, and hence θ, are directly estimated from the repeat measurements using each test.

## SIMULATIONS

We used simulations to examine the distribution of the DR and log_{e}(DR) and to check the accuracy of the Taylor series formula for the SD of log_{e}(DR), given the form of model described by *Eq. 1
*. These were performed for all combinations of the following values of*n* (number of subjects),*k* (number of replicate tests), and Δ (the true DR)
For each of these combinations of *n*,*k*, and Δ, the following procedure was performed.

*1*) An arbitrary subject mean μ was chosen, along with a set of *n* equally spaced subject effects α_{i}chosen symmetrically around zero so that Σ_{i}
_{=1,}
_{n}α_{i} = 0.

*2*) The ς^{2}, the within-subject variance, was calculated as
*3*) For each*i* = 1,...,*n* and*j* = 1,...,*k*, a random observation ε_{ij} was generated from a normal distribution with mean zero and variance ς^{2}.*X _{ij}
*were then generated from

*Eq. 1*.

*4*) The DR and hence log_{e}(DR) were calculated from the*X _{ij}
*using

*Eqs. 3-5*.

*5*) The SD of log_{e}(DR) was calculated from the Taylor series approximation (*Eq. 17
* ofappendix
), using the noncentrality parameter λ evaluated from the DR estimate at the current step of the simulation.

*6*) *Steps 3*–*5* were repeated 500 times, yielding a distribution of 500 values for each of DR, log_{e}(DR), and SD of log_{e}(DR).

*7*) The distributions of DR and log_{e}(DR) were checked for normality using the Shapiro-Wilk test and were plotted as histograms.

*8*) The true SD of log_{e}(DR) was estimated from the simulated distribution of log_{e}(DR).

*9*) The distribution of Taylor series estimates of the SD of log_{e}(DR) was compared with the true SD by plotting the median, upper and lower quartiles, and 5 and 95% against *n*for different values of *k* and Δ.

Examination of *P* values from the Shapiro-Wilk test showed some evidence that log_{e}(DR) was not quite normally distributed (slightly >10% of the *P*values examined were <0.05, but there was no apparent relationship between low *P* values and*n*, *k*, or Δ). However, this was a marked improvement over the DR itself, for which >50% of the *P* values were <0.05. Histograms also showed the distribution of log_{e}(DR) to be symmetric, whereas that of DR was markedly positively skewed (data not shown). Log_{e}(DR) was deemed to be sufficiently close to normal for use in the χ^{2} test for equality of DRs. Plots showed that the median Taylor series estimate for the SD of log_{e}(DR) was generally within ±10% of the true value for *n* ≥ 10. However, for *k* = 2, the SD could be overestimated by as much as 20% for 10 ≤*n* ≤ 20. The distribution of SDs is positively skewed and, for *n* ≥ 10, the 5% SD was at most 10% below the true value. Overestimation could be more marked, but even the upper quartile SDs were within +25% of the true value (data not shown). Because the χ^{2} test statistic is a function of the reciprocal of the SD, the test is conservative with respect to overestimates of the SD, i.e., one is unlikely to wrongly reject the null hypothesis (no difference between the DRs), but the test may not be particularly sensitive to genuine differences for small values of*n*, especially if*k* = 2.

Estimates of the SD become very inaccurate for*n* < 10, and the approximation should not be used in this range. However, we would not recommend performing an evaluation study of this kind on such a small number of subjects, since the objective is to characterize the performance of the tests over a reasonable range of the variable of interest.

## SD OF LOG_{E}(DR) USING TAYLOR SERIES APPROXIMATION

We derived an estimate of the variance (and hence the SD) of log_{e}(DR) using a first-order Taylor series expansion. From *Eq. 7
*
where*F*
_{0} = MS_{B}/MS_{W}has a noncentral *F* distribution with degrees of freedom ν_{1} =*n* − 1 and ν_{2} =*n* × (*k* − 1) and noncentrality parameter
Let LDR = log_{e}(DR)

Expanding LDR as a Taylor series in*F*
_{0} about its mean
gives, to first order
Equation 12where LDR′ is LDR evaluated at
and d(LDR)/d
is also evaluated at
. Hence
Equation 13where var indicates variance. Now
Equation 14where DR′ is DR evaluated at
. From general properties of the noncentral *F*distribution (17)
Equation 15
Equation 16

Substituting for DR′ from *Eq. 15
*into *Eq. 14
* and for d(LDR)/d
from*Eq. 14
* and var(*F*
_{0}) from*Eq. 16
* into *Eq.13
* gives
Equation 17which is evaluated by noting that
and replacing Δ^{2} by DR^{2}, i.e.

## Footnotes

Address for reprint requests: J. C. Levy, Diabetes Research Laboratories, Radcliffe Infirmary, Woodstock Rd., Oxford OX2 6HE, UK.

This study was done with aid of grants from Servier and the Alan & Babette Sainsbury Trust.

- Copyright © 1999 the American Physiological Society