Standardization of Reticulocyte Values in an

Hematopathology / STANDARDIZATION OF RETICULOCYTE VALUES
Standardization of Reticulocyte Values in an Antidoping
Context
Michael J. Ashenden, PhD,1 Ken Sharpe, PhD,2 Rasmus Damsgaard, MD,3 and Lisa Jarvis, PhD4
Key Words: Recombinant human erythropoietin; Reticulocytes; Athletes; Blood tests; Doping; Erythropoiesis
DOI: 10.1309/1FAM1VT3N76GJGXV
Abstract
The lack of standardization of reticulocyte results
hinders the ability of sports authorities to recognize the
telltale fluctuations over time that are typical for
athletes using illegal blood doping to improve their
performance. Therefore, the aim of the present study
was to devise a tenable approach for antidoping
authorities to quantify instrument bias. We evaluated
reticulocyte data derived during a 42-week period from
210 hospital patient blood samples measured in
duplicate simultaneously on up to 11 hematology
analyzers located in a single laboratory.
We found that square root transformation of
reticulocyte values enabled quantification of
interinstrument bias by using the mean reticulocyte
value of a cohort of approximately 54 subjects as a de
facto calibration agent. We also demonstrated that
measurement precision associated with low reticulocyte
values was not inferior to that associated with higher
values.
816
816
Am J Clin Pathol 2004;121:816-825
DOI: 10.1309/1FAM1VT3N76GJGXV
An independent examination commissioned by the
World Anti-Doping Agency, Montreal, Canada, recommended that blood testing be used in conjunction with urine
testing to deter the use of recombinant human erythropoietin
(rHuEPO) in sports.1 In addition to the cost saving offered by
the use of blood as a screening tool for subsequent urinalysis,
the rationale for this recommendation included the capacity
to target athletes for follow-up testing on the basis of “suspicious” blood profiles.
For several weeks after ceasing use of rHuEPO, athletes
will have elevated hemoglobin levels in tandem with
suppressed reticulocyte production.2 When hemoglobin (g/L)
and reticulocyte (%) values are substituted into an algorithm
developed to detect athletes who have stopped using
rHuEPO, an elevated OFF-hr model score (OFF-hr score =
hemoglobin value – 60√reticulocyte value) can discriminate
athletes who recently have stopped using rHuEPO from
nonusers. 3 However, the benefits associated with the
enhanced “reach back” offered by scrutinizing hematologic
parameters are tempered by concerns about the standardization of reticulocyte results, for which no universally
accepted reference method exists.4 Issues pertinent to antidoping efforts include interchangeability of results derived
from different platforms, the absence of a universal calibration material, and the precision of reticulocyte assays at low
reticulocyte counts. Low reticulocyte counts are a hallmark
of previous rHuEPO use, so that imprecision at such levels
will dilute the ability of authorities to recognize signs of
blood manipulation.
Thresholds for OFF-hr model scores have been published
that enable practitioners to recognize the probability associated with unusual deviations from expected scores.3 However,
© American Society for Clinical Pathology
Hematopathology / ORIGINAL ARTICLE
these thresholds were derived using data collected only on the
ADVIA 120 platform (Bayer, Tarrytown, NY), so that if reticulocyte values derived from other platforms are substituted
into the equation, intermethod bias might compromise the
integrity of this approach. The absence of a reticulocyte calibration material that can be used by all methods prevents
technicians from quantifying intermethod bias.
Fresh blood is the ideal material as a calibrator for reticulocyte counts4; however, the poor stability of reticulocytes that
undergo maturation and manufacturer-dependent calibration
procedures that cannot be modified easily by the user 4
confound the use of a whole-blood calibration material. Even
if universal calibration material were available, it is unlikely
that the precision reported for stabilized cells would replicate
the precision of the instrument when measuring whole blood,
because stabilizing the reticulocytes to confer a useful shelf
life inevitably would alter cellular characteristics. This has
been shown to influence the different dye-detection-algorithm
scenarios used by the manufacturers of reticulocyte platforms
in an unpredictable manner (L.J., unpublished observations).
The aims of the present study were as follows: (1) evaluate
the characteristics of multiple reticulocyte platforms to assess
whether values derived using different platforms conceivably
could be substituted into the OFF-hr model without compromising the integrity of this approach; (2) explore the precision
of low vs high reticulocyte values; and (3) develop an approach
to adjust for bias in results from a specific instrument before its
utilization in an antidoping setting.
Materials and Methods
Parallel Evaluation of 11 Reticulocyte Analyzers
We obtained blood samples from patients in a
Minneapolis, MN, hospital. Five patients were selected
randomly each week (unspecified sex), yielding a total of
210 samples during a 42-week period. Samples were
obtained by venipuncture using K3EDTA tubes, and data
were derived from analysis of these samples on 11 reticulocyte analyzers that were maintained at R&D Systems,
Minneapolis ❚Table 1❚. With a few exceptions, each sample
was measured in duplicate on each instrument. Owing to
sample volume limitations, it was not possible to perform
these measures on each sample using all instruments (on
average, each sample was measured in duplicate on 8.8
instruments). Reticulocyte percentages or square root–transformed reticulocyte percentages are reported for all analyses
as indicated.
Instruments were compared according to the method of
Bland and Altman,5 in which the difference between (the
average of duplicates from) 2 machines is plotted against the
average of the 2 methods for each data point. We used the
square root of reticulocyte percentages for our comparisons.
The precision of each instrument is ascertained from the
between-duplicate SD (of square root percentages), given as
√
Σni=1 (x i1– x i2)2
2n
where xi1 and xi2 are duplicate readings on the ith sample and
n is the number of samples analyzed on a given instrument.
In addition to the aforementioned evaluations, we also
examined different sources of variation for each instrument.
Because 5 blood samples were analyzed on any given instrument at a single time point and this process was repeated
multiple times throughout the 42-week period, it was
possible to quantify 3 components of variation: variation
between duplicates, variation between samples within the 42
weeks, and the variation between weeks. By comparing the
between- and within-weeks variability, it is possible to test
the stability of the instrument throughout the year. If an
instrument is not stable, the variability between weeks will
❚Table 1❚
Instruments Used in the Comparison and the Approach Used for Staining of Reticulocytes*
Instrument No.
1
2
3
4
5
6
7
8
9
10
11
Make/Model
Bayer ADVIA 120
Abbott CellDyn CD3200
Abbott CellDyn CD3500
Abbott CellDyn CD3700
Abbott CellDyn CD4000
Becton Dickinson ReticCount FACSCount
Bayer H*3
Beckman Coulter STKS
Sysmex XE2100
ABX Pentra 120 Retic
Beckman Coulter GENS
Approach
Stain
Detection
SV
SV
SV†
SV†
F
F†
SV†
SV†
F
F
SV
Oxazine 750
New methylene blue
New methylene blue
New methylene blue
CD4K 530
Thiazole orange
Oxazine 750
New methylene blue
Stromatolyser-NR
Thiazole orange
New methylene blue
Absorbance
Light scatter
Light scatter
Light scatter
Fluorescence
Fluorescence
Absorbance
VCS
Fluorescence
Fluorescence
VCS
F, fluorescence; SV, supravital; VCS, volume, conductivity, and light scatter.
* Manufacturer locations are as follows: Bayer, Tarrytown, NY; Abbott, Abbott Park, IL; Becton Dickinson, Franklin Lakes, NJ; Sysmex, Kobe, Japan; ABX, Montpellier, France;
Beckman Coulter, Miami, FL.
† Depicts semiautomated or manual method.
Am J Clin Pathol 2004;121:816-825
© American Society for Clinical Pathology
817
DOI: 10.1309/1FAM1VT3N76GJGXV
817
817
Ashenden et al / STANDARDIZATION OF RETICULOCYTE VALUES
be greater than what can be accounted for by the variation
between samples, within weeks.
Precision of Reticulocyte Values
To test the hypothesis that the precision of reticulocyte
measurement increases with the reticulocyte level, we first
determined the median value of each of the 210 blood
samples across the 11 instruments. We then designated each
blood sample as having a low (n = 66; range, <1.3%),
medium (n = 77; range, 1.3%-2.0%), or high (n = 67; range,
>2.0%) reticulocyte percentage. Because SD is the most
useful measure of precision, we calculated the within-sample
SD (ie, between duplicate readings) for each sample across
all instruments. This analysis was performed on the raw data
set and on the data after they had been transformed using log
and square root transformations.
Quantifying Instrument Bias
Based on the research used to derive the OFF-hr model,
when an ADVIA 120 machine is used to measure reticulocytes in a cohort of male endurance athletes residing at sea
level, the (square root–transformed) reticulocyte mean will
be 1.12 (SD, 0.165, based on n = 192 elite male endurance
athletes6). However, because it is highly improbable that any
instrument will be identical to the “average” ADVIA 120
used to derive these values (data were collected using 13
ADVIA 120 machines), the reported sample mean will vary
with the bias of the particular instrument used, relative to the
expected mean of 1.12. Therefore, comparing the reported
mean value derived from the sample group of athletes with
the value 1.12 enables the technician to quantify instrument
bias. The precision of the estimate of the sample mean will
be influenced by the number of athletes tested and by the SD
for the sampled population. We define a group of statistically
sufficient size comprising elite male endurance athletes
tested at sea level as a “samplator.”
An important caveat is the requirement that the SD of
the samplator be comparable with the SD of the 192 elite
male endurance athletes from which the expected values
were derived. This is crucial because a SD beyond the
expected 95% interval for a given sample size would reduce
confidence that the samplator was comparable to the standard, disease- and drug-free population used in the original
study. We recommend a preliminary empiric evaluation of a
potential samplator data set, eg, via a dot plot, so that the
technician can be satisfied the general distribution of square
root reticulocytes is typical (ie, not bimodal or including
unusual outliers). Having satisfied this preliminary screen,
descriptive statistics should be calculated.
There is no plausible reason why a samplator population
should have a lower than expected SD for square root reticulocytes; therefore, if the calculated SD is less than the lower
818
818
Am J Clin Pathol 2004;121:816-825
DOI: 10.1309/1FAM1VT3N76GJGXV
limit, the data should be discarded and the process repeated. If,
however, the SD of the samplator exceeds the upper limit,
outliers should be removed to ascertain whether this brings the
samplator SD back within the expected range. Outliers should
be removed via the box plot criteria: square root reticulocyte
values that are more than 1.5 times the interquartile range
beyond the interquartile limits are excluded from the sample.7
If this does not reduce the SD to within the acceptable range,
the data should be discarded and the process repeated. Inability
to obtain a samplator with a SD within the expected range
points to a possible fundamental anomaly in the instrument.
To demonstrate the adherence of actual data to this
theory, data collected from competitors by the International
Ski Federation during the 2001-2002 World Cup season were
evaluated. Samples from all athletes were tested on a single
machine (Sysmex R-500, Sysmex, Kobe, Japan); however, in
many cases, fewer than 50 athletes were tested on a single
day. Therefore, to provide meaningful sample sizes, when
necessary, data obtained during consecutive days from
athletes at the same competition venue were combined to
give sample sizes near 50 (2 events with 73 and 103 samples
were not altered). Means and SDs of square root–transformed
reticulocyte percentages were compared with the expected
mean ± SD value of 1.12 ± 0.165 derived from 192 elite male
endurance athletes measured at sea level on the ADVIA 120
platform.6 We calculated the impact on OFF-hr model scores
by multiplying the instrument bias by –60 (because OFF-hr =
hemoglobin value – 60√reticulocyte value, a lower reported
mean would increase notional OFF-hr model scores).
Results
Comparison Between Instruments
The Bland and Altman analysis, plotting the difference
between samples analyzed on 2 platforms against the average
value of the respective methods, showed that the ADVIA 120
platform typically reported a higher reticulocyte value than
the candidate instruments we evaluated, whether the candidate instrument relied on fluorescence ❚Figure 1❚ or nonfluorescence ❚Figure 2❚ to detect reticulocytes. The single exception to this tendency was the FACSCount platform, which
read higher than the ADVIA 120 in our hands. Candidate
instrument–ADVIA 120 agreement, as reflected by the SD of
the differences between machines, demonstrated that the
CD3200, CD4000, Sysmex XE2100, and FACSCount
showed most consistent agreement with the ADVIA 120
machine in our hands.
With regard to candidate instrument–candidate instrument comparisons, the CD4000–Sysmex XE2100–FACSCount triumvirate showed notable intermethod agreement,
© American Society for Clinical Pathology
Hematopathology / ORIGINAL ARTICLE
ADVIA v CD4000
ADVIA 120 Minus
CD4000
1
SD = 0.172
0
–1
0
1
2
3
Average of the 2 Methods
(Square Root of Percentage)
ADVIA v FACS
CD4000 v FACS
1
SD = 0.187
CD4000 Minus
FACS
–1
0
1
2
3
0
–1
0
Average of the 2 Methods
(Square Root of Percentage)
ADVIA v Sysmex 2100
0
1
2
3
0
–1
0
CD4000 Minus
ABX Pentra 120
ADVIA 120 Minus
ABX Pentra 120
0
1
2
2
3
0
–1
0
3
Average of the 2 Methods
(Square Root of Percentage)
0
1
2
2
3
FACS v ABX Pentra 120
1
SD = 0.261
0
–1
1
Average of the 2 Methods
(Square Root of Percentage)
CD4000 v ABX Pentra 120
1
SD = 0.259
0
1
SD = 0.104
Average of the 2 Methods
(Square Root of Percentage)
ADVIA v ABX Pentra 120
–1
FACS v Sysmex 2100
1
SD = 0.098
Average of the 2 Methods
(Square Root of Percentage)
1
3
FACS Minus
Sysmex 2100
0
–1
2
CD4000 v Sysmex 2100
1
SD = 0.143
CD4000 Minus
Sysmex 2100
ADVIA 120 Minus
Sysmex 2100
1
1
Average of the 2 Methods
(Square Root of Percentage)
3
Average of the 2 Methods
(Square Root of Percentage)
Sysmex 2100 v ABX Pentra 120
SD = 0.242
Sysmex 2100 Minus
ABX Pentra 120
0
SD = 0.119
FACS Minus
ABX Pentra 120
ADVIA 120 Minus
FACS
1
0
–1
0
1
2
3
Average of the 2 Methods
(Square Root of Percentage)
1
SD = 0.200
0
–1
0
1
2
3
Average of the 2 Methods
(Square Root of Percentage)
❚Figure 1❚ Comparison of estimates of (square root of) reticulocytes in ~100 samples for instruments incorporating fluorescent
reticulocyte enumeration (except the ADVIA 120, which uses absorbance). The plots are according to the Bland and Altman
method,5 including mean difference (broken line) and standard deviation of the differences (upper left corner of graph) for each
separate comparison. For proprietary information, see the text.
and results derived from within this subgroup had the
smallest SDs of any comparisons (Figure 1). Furthermore,
whereas the FACSCount gave higher reticulocyte values on
average than the CD4000 or Sysmex XE2100, the latter 2
machines showed virtually no intermethod bias.
Precision of Different Instruments
Tests for between-week variation in excess of that
accounted for by within-week variation gave statistically
significant results for the ADVIA 120 and ABX Pentra
120 instruments. Time series plots (not shown) confirmed
that the ADVIA 120 was noticeably aberrant during weeks
34 to 39, and the laboratory confirmed that this instrument
was recalibrated around week 40. Primarily because the
ADVIA 120 is the instrument of central interest, data from
it for weeks 34 to 39 were omitted from all subsequent
analyses. For the ABX Pentra 120, the problem was much
greater, with erratic behavior observed throughout the
study period.
The between-duplicates SD (of square root reticulocyte
percentages) was considerably larger for the ABX Pentra 120
(SD = 0.081) than for most of the other instruments for which
values ranged from 0.028 (CD4000) to 0.050 (ADVIA 120).
The 2 exceptions were the Bayer H*3 (SD = 0.057) and the
Beckman Coulter STKS (SD = 0.076). Values for the remaining
instruments were as follows: 0.033 (Sysmex XE2100), 0.036
(CD3500), 0.041 (FACSCount), 0.044 (CD3700), 0.046
(CD3200), and 0.048 (Beckman Coulter GENS).
Precision of Reticulocyte Values at Low, Medium, and
High Reticulocyte Percentages
In ❚Table 2❚, the SD of duplicate readings for the same
blood sample are reported using 3 approaches to present reticulocyte data. The average values across all 11 instruments
Am J Clin Pathol 2004;121:816-825
© American Society for Clinical Pathology
819
DOI: 10.1309/1FAM1VT3N76GJGXV
819
819
Ashenden et al / STANDARDIZATION OF RETICULOCYTE VALUES
ADVIA v CD3200
ADVIA 120 Minus
CD3200
1
SD = 0.163
0
–1
0
1
2
3
Average of the 2 Methods
(Square Root of Percentage)
CD3200 v H*3
0
1
2
0
–1
3
0
Average of the 2 Methods
(Square Root of Percentage)
ADVIA 120 Minus
STKS
SD = 0.258
0
–1
0
1
2
3
1
–1
0
2
2
0
–1
3
0
Average of the 2 Methods
(Square Root of Percentage)
CD3200 Minus GEN-S
ADVIA 120 Minus
GEN-S
0
1
1
SD = 0.248
3
Average of the 2 Methods
(Square Root of Percentage)
1
0
1
2
2
3
H*3 v GEN-S
SD = 0.145
0
–1
1
Average of the 2 Methods
(Square Root of Percentage)
CD3200 v GEN-S
SD = 0.195
0
H*3 v STKS
1
0
ADVIA v GEN-S
–1
3
SD = 0.207
Average of the 2 Methods
(Square Root of Percentage)
1
2
CD3200 v STKS
CD3200 Minus STKS
ADVIA v STKS
1
1
Average of the 2 Methods
(Square Root of Percentage)
H*3 Minus STKS
0
SD = 0.173
H*3 Minus GEN-S
–1
1
3
Average of the 2 Methods
(Square Root of Percentage)
1
STKS v GEN-S
SD = 0.206
STKS Minus GEN-S
SD = 0.207
CD3200 Minus H*3
ADVIA 120 Minus H*3
ADVIA v H*3
1
0
–1
0
1
2
3
Average of the 2 Methods
(Square Root of Percentage)
1
SD = 0.179
0
–1
0
1
2
3
Average of the 2 Methods
(Square Root of Percentage)
❚Figure 2❚ Comparison of estimates of (square root of) reticulocytes in ~100 samples for instruments incorporating
nonfluorescent reticulocyte enumeration (the ADVIA 120 utilizes absorbance). The plots are according to the Bland and Altman
method,5 including mean difference (broken line) and standard deviation of the differences (upper left corner of graph) for each
separate comparison. For proprietary information, see the text.
demonstrated that the SDs of log-transformed reticulocyte
percentages decreased from low to medium to high values
(Table 2). This tendency was uniform, with few exceptions (1
value for instruments 2, 4, 8, and 10 did not adhere to this
trend). The trend was the same, but opposite, for raw data in
which the SD between duplicates for high values was about
double that of low values. SDs of square root–transformed
values were consistent across the 3 levels. For illustrative
purposes, coefficients of variation (CVs) were calculated using
raw reticulocyte percentages, yielding values of 9.1%, 7.6%,
and 5.9% for low, medium, and high values, respectively.
Samplator Data
When results obtained from groups of athletes tested at
a moderate altitude (~1,600 m) were compared with values
derived from groups tested at sea level, the mean value of
square root–transformed reticulocyte percentages for the
820
820
Am J Clin Pathol 2004;121:816-825
DOI: 10.1309/1FAM1VT3N76GJGXV
altitude groups were, on average, 0.05 units higher; P = .017.
The SDs of 2 cohorts measured at altitude and 1 cohort
measured at sea level were found to lie outside the expected
95% interval for the respective sample sizes (Figure 2, upper
panel). Only 1 of these groups (n = 103) was found to
contain outliers, and, in this case, the adjustment reduced the
SD to within the expected range. For a sample of 54 subjects,
the lower and upper limits of the 95% interval are 0.134 and
0.196, respectively.
❚Table 3❚ shows the difference between the reported
mean value for square-root transformed reticulocyte percentages and the expected value of 1.12 (the gap between the
data points and the solid line, ❚Figure 3B❚). This difference is
an estimate of the bias associated with this particular instrument and would be the recommended adjustment to allow
for candidate instrument–ADVIA 120 bias had any one of
these groups been used as the samplator. Depending on
© American Society for Clinical Pathology
Hematopathology / ORIGINAL ARTICLE
❚Table 2❚
Within-Sample SD (Between Duplicate Readings) for Samples Classified as Having Low, Medium, or High Reticulocyte Percentages*
Log
Instrument No. Low
1
2
3
4
5
6
7
8
9
10
11
Average SD
*
0.077
0.114
0.060
0.092
0.053
0.074
0.154
0.120
0.062
0.125
0.150
0.098
Raw
Square Root
Medium
High
Low
Medium
High
Low
Medium
High
0.076
0.057
0.059
0.051
0.047
0.047
0.106
0.121
0.056
0.165
0.096
0.080
0.062
0.061
0.048
0.067
0.034
0.043
0.099
0.121
0.042
0.103
0.044
0.066
0.096
0.099
0.053
0.082
0.059
0.138
0.088
0.156
0.064
0.132
0.082
0.095
0.133
0.094
0.096
0.083
0.080
0.109
0.133
0.197
0.084
0.243
0.125
0.125
0.195
0.174
0.141
0.192
0.104
0.149
0.167
0.285
0.117
0.214
0.127
0.170
0.042
0.050
0.027
0.042
0.027
0.049
0.054
0.062
0.031
0.063
0.050
0.045
0.049
0.036
0.037
0.032
0.030
0.035
0.057
0.074
0.034
0.097
0.054
0.049
0.054
0.051
0.040
0.056
0.029
0.039
0.061
0.091
0.035
0.073
0.037
0.051
Low, <1.3%, mean, 1.08%; medium, 1.3%-2.0%, mean, 1.66%; high, >2%, mean, 2.97%; raw, SD of the values as percentages; log and square root, SD when a log or square
root transformation is taken of raw readings before calculating SD, respectively. Instrument numbers corresponds with the instrument numbers in Table 1.
which group was used, results from the Sysmex R-500
instrument would have influenced OFF-hr model scores by
–1.7 to 4.8 units (Table 3).
Discussion
Measurement of Reticulocyte Counts
Reticulocytes are transitional RBCs, between nucleated
RBCs and mature RBCs, which contain some stainable
remnant messenger RNA. With the exception of the
FACSCount flow cytometer, which separates reticulocytes
from mature RBCs using a “gate,” all analyzers evaluated in
the present study use a cluster analysis technique to identify,
classify, and separate RBCs from reticulocytes. Separation is
complicated because messenger RNA represents a transient
quantity that degrades within about 3 days after the reticulocyte is released into circulation from the bone marrow.
Therefore, there is a continuum between reticulocytes and
mature cells rather than a separate population of each. What
is classified as a reticulocyte is somewhat arbitrary, and the
algorithms used by manufacturers to classify reticulocytes
typically are proprietary information and unavailable for
general scrutiny. What is clear is that each algorithm uses a
somewhat different approach to establish the cutoff point
between reticulocytes and other blood cells.
As a preliminary step to ascertain whether it is feasible
to interchange reticulocyte results from different platforms,
❚Table 3❚
Sample SD and Mean for Square Root–Transformed Reticulocyte Percentages of Groups of Elite Male Cross-Country Skiers
Tested During the 2001-2002 International Ski Federation World Cup Season Using a Sysmex R-500*
No. of Samples
Sea level
48
61
43
39
34
32
Altitude (~1,600 m)
103
(100)†
48
73
52
52
Collection Date
SD
Mean
Mean – Expected
∆OFF-hr
Model Score
November 23
November 24
November 25
November 26-27
March 9-10
March 15-16
0.158
0.168
0.124
0.154
0.127
0.179
1.084
1.060
1.046
1.046
1.096
1.040
–0.036
–0.060
–0.074
–0.074
–0.024
–0.080
2.2
3.6
4.4
4.4
1.4
4.8
December 14
December 14
December 15-16
January 4
January 5-6
January 8
0.191
0.169
0.208
0.182
0.170
0.171
1.149
1.142
1.132
1.116
1.072
1.095
+0.029
+0.022
+0.012
–0.004
–0.048
–0.025
–1.7
–1.3
–0.7
0.2
2.9
1.5
Mean – expected, difference between the reported mean and 1.12 (the estimated mean for elite male endurance athletes tested at sea level on an ADVIA 120 platform); ∆OFF-hr
model score (hemoglobin value – 60 √reticulocyte value), difference of the Sysmex-derived score and the value expected had the sample been measured on an ADVIA 120
platform. Sysmex R-500, Sysmex, Kobe, Japan; ADVIA 120, Bayer, Tarrytown, NY.
† Values derived after 3 outliers are removed.
*
Am J Clin Pathol 2004;121:816-825
© American Society for Clinical Pathology
821
DOI: 10.1309/1FAM1VT3N76GJGXV
821
821
Ashenden et al / STANDARDIZATION OF RETICULOCYTE VALUES
A
0.225
Sample Mean (Square Root
of Percent Reticulocytes)
Sample Standard
Deviation (Square Root
of Percent Reticulocytes)
B
0.205
0.185
0.165
0.145
0.125
0.105
30
40
50
60
70
80
90 100 110
Sample Size
1.25
1.20
1.15
1.10
1.05
1.00
0.95
30
40
50
60
70
80
90 100 110
Sample Size
❚Figure 3❚ Sample standard deviations (A) and means (B) of square-root reticulocytes for cohorts of elite male cross-country
skiers tested during the 2001-2002 International Ski Federation World Cup season. Solid lines depict the respective values (SD,
0.165; mean, 1.12) derived from a sample of 192 elite male endurance athletes tested at sea level (circles) on the ADVIA
platform.6 Broken lines depict the upper and lower limits of the expected 95% interval for different sample sizes. Triangles
depict groupings of athletes tested at altitude (~1600 m above sea level); encircled data points represent the group (n=103)
before and after removal of outliers.
it is insightful to evaluate the general agreement between
different methods. In our hands, the intermethod comparison plots revealed that the CellDyn 3200, CellDyn 4000,
FACSCount, and Sysmex XE2100 demonstrated sufficiently consistent results with the ADVIA 120 to recommend them as being candidate platforms with results that
potentially could be interchanged with ADVIA 120 data.
The criteria applied during this subjective evaluation were
primarily the uniformity (depicted by the low SD of differences) and the absence of large disparities between samples
measured simultaneously on the candidate instrument and
an ADVIA 120. However, a prescient secondary consideration was the consistency of the intermethod agreement
among the candidate instruments themselves. As depicted
in Figure 2, the grouping of intermethod comparison values
among the CellDyn 4000, FACSCount, and Sysmex
XE2100 platforms was notably tighter than any other
candidate instrument–candidate instrument comparison.
Furthermore, each of these platforms uses fluorescent
stains, which are known to demonstrate enhanced sensitivity in the identification of low reticulocyte numbers
compared with supravital stains.8 This further predisposes
their use in an antidoping setting in which low reticulocyte
values are of primary interest.
Adjusting for Instrument Bias
It has been assumed that the manufacturing process and
standardization of how each instrument model enumerates
reticulocytes ensures that non–ADVIA 120/ADVIA 120
agreement is relatively reproducible across all machines of a
822
822
Am J Clin Pathol 2004;121:816-825
DOI: 10.1309/1FAM1VT3N76GJGXV
given model (as distinct from bias due to errors in calibration).
Having established that a particular instrument model shows
sufficient generic agreement with the ADVIA 120 platform
to be considered a potential substitute, it is necessary to
adjust for instrument-specific bias that might arise from
errors in calibration (OFF-hr model thresholds were set
based on the significance of deviations from the population
mean value; therefore, instrument bias has the potential to
invalidate the rationale for these thresholds). Several options
exist, including development of regression equations and
adjustments based on means and SDs of samples measured
on both platforms. However, adjustment using such generic
approaches relies on the presumption that the randomly
selected candidate machine is calibrated identically to the
“sister” instrument on which the adjustment factor was
derived. Because reticulocyte calibration typically is factoryset and problematic for an independent technician to verify,
this assumption seems contentious.
By using the samplator concept, bias between the
specific machine and the ADVIA 120 platform can be established, provided that a sufficiently large number of subjects
are tested to be confident that the resulting sample mean will
closely approximate the mean of the population from which
the sample was drawn. Because a sample population of male
endurance athletes is readily available to antidoping authorities (virtually all blood testing is carried out on athletes
competing in endurance events), this approach provides a
convenient and objective means of adjusting for interinstrument bias, despite the absence of a universal calibration
material. A paper calibration of this difference will enable
© American Society for Clinical Pathology
Hematopathology / ORIGINAL ARTICLE
the (adjusted) value to be substituted into the OFF-hr model
without compromising the model’s integrity.
Statistical theory shows that the SE of the sample mean
is SD/√n, where n is the sample size and SD is the population SD. Because it is not tenable to adjust reticulocyte
percentages by less than 0.1 (values generally are reported to
only 1 decimal place), and a 0.1 difference on the percentage
scale translates into a difference of about 0.045 on the square
root scale (for values around the expected mean value of
1.12), it would not be sensible to estimate the samplator
mean to a greater accuracy than 0.045. Because 95% confidence intervals are given by the mean ± twice the SE, a
samplator of 54 athletes with the expected SD of 0.165
represents the upper limit required for a precise estimate of
the population mean, since (0.165/√54) × 2 = 0.045. Therefore, if 54 (or thereabouts) male endurance athletes were
tested at sea level using a candidate machine, any difference
between the group mean reported by the candidate machine
and the expected mean of 1.12 (ie, the expected value if the
athletes’ samples had been measured on an average ADVIA
120) represents the candidate instrument–ADVIA 120 bias.
A retrospective evaluation of data collected by the International Ski Federation under field conditions (the analyzer typically was relocated to each competition venue) revealed that
samples of around 50 male endurance athletes yielded SDs that
were comparable with the expected SD of 0.165. However, the
reported mean value tended to be higher when blood samples
were analyzed in venues approximately 1,600 m above sea
level, which we speculate is attributable to the physiologic
effect of altitude on reticulocyte production (the 0.05 difference
was comparable to the 0.08 difference found for altitudes of
1,730-2,220 m).6 This underlines the importance for the instrument bias to be established using blood samples collected from
athletes at sea level, to ensure that the reported mean is directly
comparable with the ADVIA 120–derived value of 1.12 appropriate for male endurance athletes at sea level.
Depending on which sea level data were chosen, the
instrument-specific bias as indicated by the samplator
approach would have yielded a bias between 0.02 and 0.08
(on the square root scale) for the Sysmex R-500, which
would have resulted in OFF-hr scores being altered by 1 and
5 units, respectively. To put this in context, a male endurance
athlete will have an OFF-hr model score of approximately
80, with 1 in 10 athletes exceeding a score of 104.6 and 1 in
1,000 athletes exceeding a score of 125.6.3 It is noteworthy
that a change in the OFF-hr model score of 1 to 5 units also
could be the result of a bias of 0.5 g/dL or less (≤5 g/L) in
the hemoglobin assay. According to the most recent College
of American Pathologists survey results,9 the SD of hemoglobin measurements on any given sample typically is
between 0.1 and 0.3 g/dL (1.0 and 3.0 g/L). This would give
a 2 × SD range of ± 0.2 to 0.6 g/dL (2-6 g/L). When
contrasted with these values, our results suggest that were
the samplator concept used to quantify instrument bias, the
magnitude of any subsequent error contributed by the reticulocyte component of the OFF-hr model would be comparable to that associated with the measurement of the hemoglobin concentration.
Convenience dictates that the collection of blood
samples from sufficient numbers of subjects to constitute a
reasonable samplator be undertaken infrequently. To
correctly emulate the approach described herein, blood
samples should be obtained and analyzed at a single, sea
level location, ideally on the same day (or on 2 consecutive
days). It is acceptable to use sample sizes in excess of 54,
and technicians must not subjectively eliminate data points
or select subsets of available data in an effort to achieve
favorable results. We propose that the samplator be run a
minimum of once yearly on candidate machines, while
simultaneously obtaining quality control values from instrument-specific controls as the samplator is run. This enables
the technician to use the samplator to establish candidate
instrument–ADVIA 120 bias, while in parallel using instrument-specific reticulocyte controls to monitor instrument
drift over time. An additional samplator would need to be
run only if values drifted beyond a designated range. For
heightened control sensitivity, each laboratory should establish its own mean and acceptable range instead of using the
generic reference range provided by the control lot manufacturer (which typically has limits that are appropriate in a
clinical setting but not for the heightened sensitivity desirable in an antidoping scenario).
Interpretation and Precision of Reticulocyte Data
Whether the reticulocyte count is reported as a
percentage of all RBCs (percent reticulocytes) or as a number
per volume of blood (× 103/µL [× 109/L]), the distribution of
values in the general population is skewed, with an extended
tail reflecting people with high reticulocyte values (associated
with numerous causes, including environmental, genetic,
and/or pathologic circumstances). Previous empiric evaluation of data sets has demonstrated that a square root transformation of the reticulocyte percentages leads to (almost)
constant variance and values that are close to being normally
distributed,3 and this was borne out by our present results in
which SD values of duplicates were consistent regardless of
the reticulocyte percentage only if square roots were
analyzed. However, an appraisal of the counting technique
used by flow cytometers yields an additional rationale for the
appropriateness of the square root transformation. Assuming
an accurate count of the number of reticulocytes among a
fixed number of RBCs (eg, 35,000 cells, which is approximately the number of cells counted by contemporary hematology analyzers), the distribution of the count from sample to
Am J Clin Pathol 2004;121:816-825
© American Society for Clinical Pathology
823
DOI: 10.1309/1FAM1VT3N76GJGXV
823
823
Ashenden et al / STANDARDIZATION OF RETICULOCYTE VALUES
sample is likely to have a distribution that is close to binomial. When there is a large sample (35,000 is very large) and
a small probability that any 1 cell is a reticulocyte (up to 3%
or 4% of a population is small in this context), the binomial
distribution is closely approximated by a Poisson distribution.
For a Poisson distribution, the mean is equal to the variance,
and statistical theory shows that taking square roots will lead
to (almost) constant variance (which implies constant SD)
and values that are close to being normally distributed.
The legitimacy of square root transformations for
percentage reticulocyte data has implications on several
levels. First, it demonstrates unequivocally that in an antidoping setting, in which authorities seek to recognize a value
that deviates substantially from an expected normal value,
optimal resolution is provided when the units evaluated are
the square root of the percentage of reticulocytes. While
initially this novel approach might entail some mental
gymnastics for practitioners accustomed to evaluating reticulocytes as an absolute count or a percentage, it demands no
greater dexterity than required to alternate between absolute
counts and percentage values, and once the data have been
transformed, the interpretation of results is straightforward.
Second, our results refute the belief that “the measurement precision associated with low reticulocyte counts is
inferior to the precision associated with higher reticulocyte
counts.” Although precision typically is reported as a CV, a
more useful measure of precision is the SD. Our results
demonstrated that SD is reasonably constant between low
and high reticulocyte values after they have been square root
transformed, revealing that the precision of low reticulocyte
values is not inferior to the precision of high values (the
generality of this observation was demonstrated by the
consistency of between-duplicates SD, even across different
platforms). The promulgation of reduced precision at low
reticulocyte levels is based on the universally reported
finding that CVs increase as reticulocyte values decrease.
CVs are most useful for data for which a log transformation
is appropriate, because under these circumstances, the CV is
expected to be constant regardless of the reticulocyte
percentage.10 However, if, as shown by our data, a log transformation is not appropriate for reticulocyte percentages,
then the CV also is inappropriate. The reason that lower CVs
(and a correspondingly false indication of precision) typically are reported for higher reticulocyte counts is that, even
though the SD tends to increase as the mean increases, division of the larger SD by its larger mean generates a smaller
CV for reticulocyte data.
Substituting Non–ADVIA 120 Data Into Blood Models
To establish whether a candidate instrument can be
designated as adequate for potential use in the context of
deriving valid OFF-hr model scores, we propose that the
824
824
Am J Clin Pathol 2004;121:816-825
DOI: 10.1309/1FAM1VT3N76GJGXV
following criteria be applied on a case-by-case basis
(including ADVIA 120 instruments as well) rather than
generically across a given instrument model. First, because
lack of access to another ADVIA 120 makes derivation of a
Bland-Altman comparison problematic for many technicians, an alternative approach is required to lend credence to
the presumption that what the candidate machine recognizes
as a reticulocyte is comparable and consistent with what the
ADVIA 120s (used in the OFF-hr model derivation) detected
as reticulocytes. This criterion can be satisfied by establishing that the SD of between-sample variation of the candidate machine is within the range of 0.134 and 0.196 (see
SDsamp; ❚Appendix 1❚). Second, the precision of the candidate
machine should be equal (or superior) to the precision of the
ADVIA 120s used in the model derivation. This can be
accomplished by running samples (eg, the samplator blood
samples) in duplicate and confirming that the between-duplicate SD of the square root–transformed reticulocyte percentages (see SDdup, Appendix 1) is equal to or less than 0.05
(the value for the ADVIA 120s used in model derivation).
The imprecision of the ADVIA 120 platform is inherently
factored into the thresholds for the OFF-hr model; therefore,
provided that the candidate instrument’s precision is within
these bounds, any machine-specific variation should not
(unduly) influence the integrity of the model.
We occasionally encountered readings that were divergent from the sample’s designated “true” value (ie, the
median value for that blood sample measured across multiple
instruments). Although such variation already is incorporated
into the OFF-hr model thresholds, which were derived using
a single measurement of each blood sample, when an
athlete’s OFF-hr score exceeds a nominal threshold, it seems
prudent to repeat the analysis to ratify the initial value. The
average of multiple replicates provides a more precise estimate of the sample’s true value and, therefore, enhances the
specificity of the OFF-hr model approach.
Summary
We have identified 3 key findings from this research.
First, results derived from several candidate platforms are
sufficiently comparable with the ADVIA 120 method to
allow values to be substituted into the OFF-hr model.
Second, there exists a robust approach to adjust for the bias
of a candidate instrument. Third, measurement precision
for low reticulocyte values is not inferior to that for normal
levels. These findings pave the way for antidoping authorities to harvest hematologic information from a host of
diverse sources. Because the OFF-hr model bestows a
reach-back of several weeks to detect previous rHuEPO
use, frequent blood testing and the commensurate fear of
© American Society for Clinical Pathology
Hematopathology / ORIGINAL ARTICLE
❚Appendix 1❚
Samplator Protocol Recommended Approach to Quantify and Adjust for Instrument Bias
1. Blood samples to be obtained from at least 54 male endurance athletes, at sea level at 1 location. All samples preferably obtained on
the same day, but spread over 2 consecutive days at most. Reticulocyte percentages should be measured in duplicate.
2. Calculate the within-sample (between duplicates) SD using the formula:
n
2
SDdup = Σ i=1 (x i1– x i2)
2n
where xi1 and xi2 are the square roots of the duplicate reticulocyte percentage readings on the ith sample and n is the number of
samples. Should the value of SDdup exceed 0.05, the instrument should be checked and the procedure repeated.
3. If SDdup is 0.05 or less, for each sample, calculate sri = (x1i + x2i ) /2, the average of the duplicate (square root transformed) reticulocyte
values.
n
–2
4. Calculate SDsamp = Σi=1 (sri –sr)
n –1
the between-samples SD, where sr is the mean of the sris.
5. If SDsamp is within the range 0.134 and 0.196, use the difference between sr and 1.12 as the paper correction for all subsequent (square
root–transformed) percentage reticulocyte readings. For example, if sr = 1.084, then 0.036 (1.12 – 1.084) should be added to all subsequent
(transformed) values (which will have the effect of reducing the OFF-hr model score [hemoglobin value – 60√reticulocyte value]), whereas if
sr = 1.149, then 0.029 should be subtracted from all subsequent (transformed) values.
6. If SDsamp is less than 0.134, the instrument should be checked, and the whole procedure repeated.
7. If SDsamp is more than 0.196, use the box plot criteria (on the n sri values) to identify outliers in the sample. If outliers are identified and
SDsamp is reduced to lie within the range (0.134-0.196) when the outliers are omitted, use the average of the remaining sr values to
determine the bias correction, as in 5. If no outliers are found or if the removal of outliers does not result in an acceptable value for
SDsamp, the instrument should be checked and the whole procedure repeated.
√
√
detection this invokes should be a powerful deterrent
against athletes manipulating their blood to seek an illegal
performance advantage.
From the 1Science and Industry Against Blood Doping Research
Consortium, Gold Coast, and 2Department of Mathematics and
Statistics, the University of Melbourne, Melbourne, Australia;
3Copenhagen Muscle Research Centre, Copenhagen, Denmark;
and 4R&D Systems, Hematology Research, Minneapolis, MN.
Supported by the World Anti-Doping Agency, Montreal,
Canada.
Address reprint requests to Dr Ashenden: Science and
Industry Against Blood Doping (SIAB), Gold Coast QLD 4217,
Australia.
Acknowledgments: We thank Jennifer Bauer, Hematology
QC Manager, and the staff technologists at R&D Systems, and
Robin Parisotto for preliminary discussions concerning the
manuscript, and we acknowledge the United States Anti-Doping
Agency, whose conference precipitated the intellectual
discussions leading to this collaboration.
References
1. Peltre G, Holmann H. Validation of the Urine Test for
Recombinant Human Erythropoietin. Montreal, Canada: World
Anti-Doping Agency; 2003.
2. Parisotto R, Wu M, Ashenden MJ, et al. Detection of
recombinant human erythropoietin abuse in athletes utilising
markers of altered erythropoiesis. Haematologica. 2001;86:128137.
3. Gore CJ, Parisotto R, Ashenden MJ, et al. Second-generation
blood tests to detect erythropoietin abuse by athletes.
Haematologica. 2003;88:333-344.
4. Buttarello M, Bulian P, Farina G, et al. Flow cytometer
reticulocyte counting. Am J Clin Pathol. 2001;115:100-111.
5. Bland J, Altman D. Statistical methods for assessing
agreement between two methods of clinical measurement.
Lancet. 1986;1:307-310.
6. Sharpe K, Hopkins W, Emslie K, et al. Development of
reference ranges in elite athletes for markers of altered
erythropoiesis. Haematologica. 2002;87:1248-1257.
7. Velleman PF, Hoaglin DC. Applications, Basics and Computing
of Exploratory Data Analysis. Boston, MA: Duxbury Press;
1981.
8. National Committee for Clinical Laboratory Standards.
Methods for Reticulocyte Counting (Flow Cytometry and
Supravital Dyes); Approved Guideline. H44-A. Wayne, PA:
NCCLS; 1997.
9. College of American Pathologists. 2003 Surveys: Participant
Summaries for Hematology Automated Differentials FH1-B,
FH2-B, FH3-B, FH4-B, FH6-B, FH8-B, FH9-B, FH10-B.
Northfield, IL: College of American Pathologists; 2003.
10. Armitage P, Berry G. Statistical Methods in Medical Research,
Oxford, England: Blackwell Scientific Publications; 1987.
Am J Clin Pathol 2004;121:816-825
© American Society for Clinical Pathology
825
DOI: 10.1309/1FAM1VT3N76GJGXV
825
825