## Abstract

Parameter reproducibility is necessary to perform longitudinal studies where parameters are assessed to monitor disease progression or effect of therapy but are also useful in powering the study, i.e., to define how many subjects should be studied to observe a given effect. The assessment of parameter reproducibility is usually accomplished by methods that do not take into account the fact that these parameters are estimated with uncertainty. This is particularly relevant in physiological and clinical studies where usually reproducibility cannot be assessed by multiple testing and is usually assessed from a single replication of the test. Working in a suitable stochastic framework, here we propose a new index (*S*) to measure reproducibility that takes into account parameter uncertainty and is particularly suited to handle the normal testing conditions of physiological and clinical investigations. Simulation results prove that *S*, by properly taking into account parameter uncertainty, is more accurate and robust than the methods available in the literature. The new metric is applied to assess reproducibility of insulin sensitivity and β-cell responsivity of a mixed-meal tolerance test from data obtained in the same subjects retested 1 wk apart. Results show that the indices of insulin sensitivity and β-cell responsivity to glucose are well reproducible. We conclude that the oral minimal models provide useful indices that can be used safely in prospective studies or to assess the efficacy of a given therapy.

- pearson correlation coefficient
- intraclass correlation coefficient
- parameter estimation
- minimal models

an important use of models in physiology and medicine is their ability to estimate parameters that are not directly measurable; a structural model of the system is postulated, containing parameters that reflect some physiological function and are estimated from experimental data (3). An example is provided by the intravenous glucose tolerance test (IVGTT), oral glucose tolerance test (OGTT)/meal tolerance (MTT) glucose, and C-peptide minimal models that provide estimates of insulin sensitivity and β-cell responsivity, respectively (4, 5). These model parameters can be estimated only with a certain confidence interval provided by parameter estimation, usually as percent coefficient of variation (%CV), if the measurement error is taken into account.

In designing protocols, it is important to consider parameter reproducibility. This is useful not only in longitudinal studies, where parameters are assessed to monitor the time course, e.g., of β-cell function due to the disease progression or to the effect of therapy, but also in powering the study, i.e., to assess how many subjects should be studied to observe a given effect. In other words, since a subject can undergo physiological variation from one visit to another, the question we want to answer is whether, despite the physiological variations that may arise and the “technical” variations due to measurement error, a test is reproducible or not.

The assessment of parameter reproducibility is usually accomplished by methods, e.g., Pearson correlation coefficient, intraclass correlation coefficient (ICC), and Bland-Altman analysis, that do not take into account the fact that these parameters are estimated with uncertainty. It is true that most of reproducibility indices can be modified in such a way that the most precise parameter estimate contributes more than the less precise ones to the reproducibility assessment. However, we will see that this approach does not satisfactorily solve the problem of quantifying reproducibility. In addition, it is well known that the standard methods used to assess reproducibility (e.g., Pearson correlation coefficient and ICC) are not very robust to the presence of outliers in the data.

This problem is particularly relevant in physiological and clinical studies where usually reproducibility cannot be assessed by multiple testing and is usually investigated from a single replication of the test, say two IVGTTs or MTTs. The literature on the subject is surprisingly scarce in the metabolic literature, and researchers without a very solid statistical background usually ignore the problem. For instance, focusing on IVGTT, OGTT/MTT glucose, and C-peptide minimal models, Ferrari et al. (6) have assessed insulin sensitivity from an IVGTT in 15 healthy subjects on two occasions. They reported an average difference in S_{I} of 8.9%, a mean coefficient of intraindividual variation of 14.4%, and a significant correlation between the measures of the two visits of 0.78, and based on that, they concluded that insulin sensitivity estimated with the minimal model is reproducible. In a previous work (4), we assessed the reproducibility of MTT β-cell responsivity and insulin sensitivity by calculating intraindividual variation of each index with the simple formula abs(index^{1},index^{2})/mean(index^{1},index^{2}) and reported a reproducibility between 20 and 30%. None of the above studies has considered that minimal model parameters bear some uncertainty.

Working in a stochastic framework, here we propose a method to quantify reproducibility that exploits the knowledge of parameter uncertainty and is particularly suited to handle the normal testing conditions of physiological and clinical investigations. The method is first assessed in simulation by comparing its performance with those of classical techniques available in the literature. Then, the new approach is applied to quantify reproducibility of insulin sensitivity and β-cell responsivity from MTT data obtained in the same subjects retested 1 wk apart.

## MATERIALS AND METHODS

### Subjects and Protocol

Eight healthy subjects (4 women and 4 men, mean age 44 ± 2 yr, height 172 ± 4 cm, weight 85 ± 8 kg, body mass index 29.7 ± 2.3 kg/m^{2}, lean body mass 48 ± 5 kg) underwent to two identical standardized mixed-meal tests on two occasions (data already published in Ref. 10). The mixed meals (10 kcal/kg, 45% carbohydrate, 15% protein, and 40% fat) consisted of scrambled eggs, Canadian bacon, 100 ml of water, and Jell-O containing [1-^{13}C]glucose (as part of separate study) consumed within 15 min. The beaker containing the Jell-O was rinsed twice with 20 ml of water, and the rinse solution was consumed. Blood was sampled from the arterialized venous site at −30, −20, −10, 0, 5, 10, 15, 20, 30, 40, 50, 60, 75, 90, 120, 150, 180, 210, 240, 260, 280, 300, and 360 min for measurement of plasma glucose concentration, whereas insulin and C-peptide were measured at −30, 0, 30, 60, 90, 120, 150, 180, 240, 300, and 360 min only. More details on the protocol can be found in Ref. 10. Individual data are shown in Fig. 1.

### Oral Minimal Models

β-Cell responsivity indices were estimated from plasma glucose and C-peptide concentrations by using the oral C-peptide minimal model (Fig. 2, *right*) (4, 5), assuming that insulin secretion is made up of two components. The dynamic component represents secretion of promptly releasable insulin and is proportional to the rate of increase of glucose concentration through a parameter (Φ_{d}; 10^{−9}) that defines the dynamic responsivity index. The static component derives from provision of new insulin to the releasable pool and is characterized by a static index (Φ_{s}; 10^{−9} min) and by a delay time constant (*T*; min). Total β-cell responsivity (Φ_{t}; 10^{−9} min) is calculated as a composite of Φ_{d} and Φ_{s}.

Insulin sensitivity (S_{I}; 10^{−4} dl·kg^{−1}·min^{−1} per μU/ml) was estimated from plasma glucose and insulin concentrations measured during the test by identifying the oral glucose minimal model (Fig. 2, *left*) (4, 5).

### Assessment of Reproducibility

#### Correlation coefficient and t-test.

The method used in Ref. 6 to assess reproducibility of a model parameter consisted of calculating the Pearson correlation coefficient and performing a paired *t*-test between the S_{I} indices estimated in the same subjects on the two occasions. Briefly, given two vectors, *X* and Y, whose i-th components *X(i)* and *Y(i)* contain the estimates of the same parameter obtained exploiting two different experiments performed on the *i*-th subject, the Pearson correlation coefficient is given by
(1)
where *cov*_{XY} is the covariance between *X* and *Y*, with σ_{X} and *σ*_{Y} the standard deviation of *X* and *Y*, respectively. This coefficient assumes values between −1 and 1. The closer that *ρ* is to 1, the stronger the correlation is between the two variables. In practice, *ρ* > 0.7 is usually considered a good correlation. However, a high value of *ρ* does not guarantee reproducibility, since, as already observed in Refs. 1 and 8, *X* and *Y* can be very well correlated even if *X* is systematically higher or lower than *Y*. For instance if *X* = 2*Y*, one would obtain *ρ* = 1, but obviously the parameter is not reproducible. Thus, to claim reproducibility, one also has to demonstrate that there is not a systematic over- or underestimation. To do that, one can use the paired Student's *t*-test (or the Wilcoxon signed-rank test for not normally distributed variables), which assesses whether the means of *X* and *Y* are statistically different from each other.

#### Concordance correlation coefficient.

Another solution is to resort to the concordance Ccorrelation coefficient, *ρ*_{c} (8):
(2)
where *μ*_{X} and *μ*_{Y} are the means of *X* and *Y*, respectively.

It has been proven that *ρ*_{c} = *ρ*·C_{b,} with C_{b} being a bias correction term that measures how far the best fit line deviates from the 45° line.

#### Intraclass correlation coefficient.

A popular method used in statistics to measure reproducibility is the ICC. It is worth noting that a variety of parametric and nonparametric statistical methods aiming to assess the agreement between variables are classified as ICC's methods (see, e.g., Ref. 9 for a critical review). They are based on different models/assumptions and will provide different results if not properly employed. Guidelines on how to choose the most appropriate method to answer a specific question are reported in Ref. 9. For sake of brevity, here we report only one possible formulation of the ICC method.

In its original formulation, ICC is similar to Pearson correlation, but the key difference between ICC and the Pearson correlation is that *X* and *Y* are pooled to estimate the mean (*μ*) and variance (*σ*^{2}):
(3)
(4)
(5)

In this way, the difference in means of *X* and *Y*, if any, is automatically taken into account. For instance, if *X* = 2*Y*, then *ICC* < 1, and the value depends on the mean and variance of *X*.

### Weighted Indices

All of the above indices, in their original formulation, do not take into account uncertainty in individual parameter estimates. However, they can be modified to properly weigh each parameter estimate, based on its precision.

In particular, if one defines the weighted covariance as
(6)
then the weighted correlation coefficient is
(7)
and the weighted concordance correlation coefficient is
(8)
where *μ*_{Xw} is the weighted mean of the vector *X*, *σ*_{Xw}^{2} is the weighted variance, and *σ*_{Xw} is the weighted standard deviation.

Similarly, *ICC* can also be modified as
(9)
where *μ*_{w} and *σ*_{w}^{2} are the weighted pooled mean and variance of the vectors *X* and *Y*.

It is worth noting that, even if the incorporation in the above formulas of the information on parameter uncertainty is rather straightforward, most of the statistical softwares do not offer this option, and if they do (e.g., SAS), most of the users, not being familiar with the issue, do not employ it.

In addition, the choice of the optimal weights (*w*_{i}) is not always trivial. One possibility is to set *w*_{i} as the inverse of the standard deviation of the estimated parameters (assuming them to be Gaussian):
(10)
where *CV*_{Xi} is the coefficient of variation as provided by the estimation routine for the parameter *X(i)*.

### The Similarity Index

To take into account that *X* and *Y* are estimated from noisy data with a certain confidence interval, we resort to a suitable stochastic framework.

For *i* = 1,2,…,*N*, we use *θ(i)* to denote independent and identically distributed random variables. In particular, *θ(i)* represents one of the parameters of the model that is estimated by fitting the model to data coming from the *i*-th subject. Let us assume that the statistical distribution of *θ(i)*, denoted by *p[θ(i)]*, is known, e.g., thanks to population studies or defined by a noninformative prior, e.g., including only information on nonnegativity. Let *A(i)* and *B(i)* denote the data sets used to estimate *θ(i)* and collected during the two experiments. In addition, *p[θ(i)|A(i)]* and *p[θ(i)|B(i)]* indicate the a posteriori probability density of *θ(i)* conditional on *A(i)* and *B(i)*, repsectively. According to Bayes rule, they are
(11)
where *p[A(i)|θ(i)]* and *p[B(i)|θ(i)]* are the likelihood functions.

Hereafter, to simplify the notation, let us denote *p[θ(i)|A(i)]* with *p*_{A}^{i}*(θ)*.

The key idea is that the similarity index, hereby denoted by S, should be a function that *1*) takes values on [0,1] and *2*) acts on the 2*N* probability density functions: {*p*_{A}^{i}*(θ)*, *p*_{B}^{i}*(θ)*}_{i = 1}^{N}.

Note that the experimental data A(*i*) and B(*i*) are all stochastic quantities whose randomness can derive not only from their relationship with θ(*i*) and the measurement noise but also on the type of inputs that are adopted (e.g., intravenous vs. oral administration). Note also that the statistics of all of these quantities do not depend on *i*, so in the sequel we can skip the dependence on *i* when computing some expectations.

The distance (*D*) between the two probability density functions *p*_{A}^{i}*(θ)* and p_{B}^{i}*(θ)* is defined as follows. Think first of *A(i)* and *B(i)* as two fixed deterministic vectors. Then we compute
(12)
where we assume that the realizations from *p*_{A}^{i}*(θ)* and *p*_{B}^{i}*(θ)* have the same sign. *Equation 12* defines a map between any possible couple of data sets to the real line. Now, if we consider *A(i)* and *B(i)* as random vectors, *D* becomes a random variable. In particular, given its definition, the mean of *D*, i.e., *E[D]*, can be interpreted as the estimated coefficient of variation between the repeated estimates (CV^{BRE}). Note also that such expectation does not depend on *i* since we are assuming that the joint distribution of *θ(i)*, *A(i)*, and *B(i)* does not depend on *i*. Now, we will use this expectation to obtain an index that allows us to assess whether, on average, the different experiments lead to similar values of the parameter of interest. Such similarity index is denoted by S and is defined as
(13)
where we stress that *E* denotes the expectation with regard to all of the randomness underlying the experiment [hence, also with regard to the data sets *A(i)* and *B(i)*], *α* is the value of the expectation of *D* at which the similarity is halved, and *n* is a parameter governing how fast *S* decreases with the increase of *D* expectation. The choice of *α* and *n* will be discussed in the following section.

Consider now a real scenario where we have performed a couple of experiments for each of *N* subjects. In this way, we have obtained *N* realizations of the random variable *D*. Thus, one can compute the similarity index as an approximation of *Eq. 13*, where *E[D]* is replaced by the sample mean. This leads to the following equation for the similarity index:
(14)
Notice that in the above expression the *A(i)* and *B(i)*, defining the posterior densities *p*_{A}^{i}*(θ)* and *p*_{B}^{i}*(θ)*, are no more random variables, as when computing *E[D]*, but they are set to their observed values. Hence, each term inside the summation now depends on the value of *i*.

It is useful now to discuss how each single term *D* [*p*_{A}^{i}*(θ)*, *p*_{B}^{i}*(θ)*], defined in *Eq. 12* and present in *Eq. 14*, can be computed in real applications. If we assign a noninformative prior to *θ(i)*, from *Eq. 11* we see that the posterior densities become proportional to the likelihood functions. Using a normal approximation, *p*_{A}^{i}*(θ)* and *p*_{B}^{i}*(θ)* can be thought of as Gaussians with the mean defined by the maximum likelihood estimates *θ̂*_{A} and *θ̂*_{B}, i.e., the maximizers of *p*_{θ}^{i}*(A)* and *p*_{θ}^{i}*(B)*, and with variances *σ*_{A}^{2} and *σ*_{B}^{2}, e.g., computed by using the Fisher information matrix. All of these quantities can be obtained by parameter estimation softwares, e.g., SAAM II (1) or Matlab routine lsqnonlin. Then, a Monte Carlo approximation of *Eq. 11* is
(15)
where *x*_{j} and *y*_{j} are independent realizations from Gaussians of means and variances given by (*θ̂*_{A}, *θ̂*_{B}) and (*σ*_{A}^{2}, *σ*_{B}^{2}), respectively, whereas M is a sufficiently large value, e.g., *M* = 1,000. Often, because of physiological constraints, parameters are known to be nonnegative. In such a case, one can just use in *Eq. 15* the realizations of *x*_{j} and *y*_{j}, which are positive. This corresponds to assume that the posterior densities are truncated Gaussians.

The Matlab code for the calculation of the index is available for free download at www.dei.unipd.it/∼dallaman/similarity. Briefly, the function takes as input four column vectors, the first containing *θ̂*_{A}*(i)* for *i* = 1..*N*, the second containing the corresponding percent coefficient of variation (%CV), the third containing *θ̂*_{B}*(i)* for *i* = 1..*N*, and the fourth containing the corresponding %CV, and gives as output the value of *S* calculated for *α* = 0.5 and *n* = 2 (see *Tuning of parameters α and n)*.

## RESULTS

### Simulation

#### Tuning of parameters α and n.

Parameter α was set to 0.5, since it is reasonable to expect that *S* = 0.5 when the mean CV^{BRE} is 50%. For what concerns the choice of parameter *n*, values between 2 and 4 have been considered; *n* = 1 was discarded since in that case *S* does goes to zero too slowly.

#### Behavior.

The value of the new index *S* clearly depends on the precision with which the parameter *θ* is estimated from the two data sets. This information, usually expressed as %CV, is relevant in the generation of the probability density functions *p*_{A} and *p*_{B}. Thus, it is of interest to investigate how S changes as a function of the estimated CV. To do that, we calculated *S* from 100 pairs of parameter estimates of *θ(i)* at different CV levels (from 1 to 100%) in the case of perfect, good, average, modest, and bad reproducibility (Fig. 3). Figure 3 shows the *S* vs. %CV for *n* = 2. As expected, *S* decreases as far as CV increases.

#### Robustness.

To assess the robustness of the new similarity index (calculated for *n* = 2, 3, and 4) vs. the standard methods (Pearson correlation coefficient + Student's *t*-test, concordance correlation coefficient and ICC, both accounting or not for parameter uncertainty), we generated 1,000 pairs of parameter estimates of *θ(i)* from a bivariate log-normal distribution (representing the whole population) and CV of 20% in three cases: *1*) good reproducibility, *2*) average reproducibility, and *3*) bad reproducibility. The value of the CV was chosen to mimic what is usually achievable in practice with not too frequently sampled experiments.

For each case, we randomly extracted 30 pairs 100 times and calculated all the indices for the 100 random extractions. To be considered a robust estimate of the reproducibility, considering the 100 realizations of the subsampling, the index should be accurate and precise; i.e., it should provide on average a value similar to the one obtained in the whole population (reference): (16) and the distribution of the estimated values around the mean should be as narrow as possible: (17)

Accuracy and precision (as defined in *Eqs. 16* and *17*) are always very desirable properties for a metric. In fact, if data are poorly reproducible, a desirable property of a good method is its provision of a low value of the metric for most of noise realizations. This is guaranteed only if the dispersion of the estimated values around the mean is narrow. The simulation results are summarized in Table 1.

#### Comments.

In the case of good reproducibility, all of the methods perform quite well, presenting accurate and rather precise estimates of the reproducibility. In the case of average reproducibility, the method based on Pearson correlation coefficient + Student's *t*-test is very accurate but shows low precision (between 86 and 130%); *ρ*_{c} and *ICC* are a bit less accurate (5.93 and 7.22%, respectively) than Pearson correlation and not precise (95 and 96%, respectively); on the other hand, *S* shows good accuracy (from 2.04 to 4.29%) and precision ranging from 40 to 79%. Of note, the best performance are achieved with *n* = 2. The weighted indices performed somewhat better than their nonweighted counterparts, albeit not significantly.

Finally, also in the case of bad reproducibility, the new index *S* (particularly for *n* = 2) is more robust than the other methods, especially for what concerns the precision (64 vs. ∼2,000%).

These results suggest that *S*, by exploiting the information on parameter uncertainty in a stochastic framework, is less sensitive to the presence of outliers and/or to the particular random extraction.

The above results prove that the new similarity index is a robust measure of the reproducibility of parameters estimated from noisy experimental data.

### The Oral Minimal Model Method: Reproducibility of β-Cell Responsivity and Insulin Sensitivity

Oral C-peptide and glucose minimal models have been identified using data measured in the eight subjects in the two visits. The model fitted the data well in all occasions (not shown). Estimates of Φ_{t}, Φ_{d}, Φ_{s}, and S_{I} for *visits 1* and *2* are reported in Table 2 together with their CVs.

Pearson correlation coefficient, Student *t*-test *P* value, concordance correlation coefficient, ICC, and the newly defined index *S* are reported in Table 3. All of the methods agree in reporting a good reproducibility of Φ_{t} and S_{I}; in particular, *S* was 0.95 for Φ_{t} and 0.80 for S_{I}, indicating a between visit CV of 12 and 25%, respectively. A bit less reproducible was Φ_{d} (*S* = 0.58, indicating a between visit CV of 42%). This could be due to the fact that, in this experiment, C-peptide was not measured at *t* = 10 and 20 min after the meal, as is usually recommended (4, 5), making Φ_{d} estimates less precise (see Table 2) and accurate. Disagreement among reproducibility indices was found for parameter Φ_{s}. Pearson correlation coefficient (*ρ* = 0.37), concordance correlation coefficient (*ρ*_{c} = 0.33), and *ICC* (*ICC* = 0.29) would suggest a low reproducibility, whereas *S* was 0.73, indicating a between visit CV of 30%. This is likely due to subject s104, which, despite a much lower glucose profile in *visit 2* than *visit 1*, shows similar C-peptide concentrations during the two visits (Fig. 1). This allows us to suspect that some measurement issue may have occurred during one of the two visits. As a matter of fact, if this subject is discarded from the analysis, Pearson correlation coefficient, concordance correlation coefficient, and ICC would be 0.69, 0.33, and 0.50, respectively, whereas *S* would be 0.76. This confirms what was found in simulation, i.e., that the *S* index is less sensitive to outliers than other methods.

## DISCUSSION

Parameter reproducibility is an important prerequisite in prospective studies, e.g., for monitoring disease progression or treatment efficacy and experiment design, e.g., defining the number of subjects necessary for testing a given effect size. In physiological investigations, parameters are often accessible indirectly; i.e., they are estimated by identifying a model from noisy data. This makes the parameter being known with a certain confidence interval, which should be taken into account in assessing parameter reproducibility. Surprisingly, the literature on this important topic is scarce. The most popular method used to assess reproducibility is the *ICC*; however, this does not take into account parameter uncertainty in its original formulation. As shown in this article, *ICC*, Pearson correlation coefficient, and concordance correlation coefficient can be easily modified to weigh parameters based on their precision. However, this does not significantly improve the robustness of the methods. These drawbacks have motivated the development of a different index of reproducibility that has a different structure with regard to all the ones present in the literature. Being based on the coefficient of variation concept, it includes in a more suitable way parameter uncertainty.

In particular, the new index has been obtained by putting the problem in a suitable stochastic framework. The starting point is the calculation for each subject of the two posterior probabilities of the parameter of interest, which can be obtained from two different experiments. Such two densities thus embed all of the information (uncertainty) on the parameter of a subject extracted from a single experiment. The new index (*S*) then measures reproducibility by exploiting a suitable distance applied to all of the couples of posterior distributions obtained from the experimental data. This approach is proven to be particularly suited to assess parameter reproducibility in physiological and clinical investigations. The index has good properties: it takes values in the [0,1] interval, with 1 denoting perfect reproducibility, it is more accurate and precise than other metrics, and it is robust to outliers (as demonstrated in simulation). Another advantage of the index is that it is inversely related to the coefficient of variation between the repeated measures, thus making the interpretation of the results straightforward.

The similarity index has then been applied to the oral minimal models, which provide indices of insulin sensitivity and β-cell responsivity (4, 5). The reproducibility of the indices, already calculated in (4) with a simple formula, has been reassessed with the new metric and with the literature correlation and *t*-test, concordance correlation coefficient, and ICC. Results show that the indices of insulin sensitivity and β-cell responsivity to glucose are well reproducible. Nevertheless, the sampling schedule in Toffolo et al. (10) was not the one recommended for the oral minimal model method (4, 5), since the purpose of that study was to assess glucose fluxes and not parameter reproducibility. The only parameter that suffered somewhat was the dynamic component of β-cell responsivity, which will certainly improve with the recommended sampling schedule (0, 10, 20, 30, etc., instead of 0, 30, etc.).

We conclude that the minimal models provide useful indices that can safely be used in prospective studies or to assess the efficacy of a given therapy.

## GRANTS

This study was supported by the Italian Ministero dell'Università e della Ricerca, FIRB 2008.

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the authors.

## AUTHOR CONTRIBUTIONS

C.D.M. and G.P. conception and design of research; C.D.M. and M.R. analyzed data; C.D.M., G.P., and C.C. interpreted results of experiments; C.D.M. and M.R. prepared figures; C.D.M. drafted manuscript; C.D.M., G.P., M.R., and C.C. approved final version of manuscript; G.P. and C.C. edited and revised manuscript.

- Copyright © 2015 the American Physiological Society