Reliability of Computerized Neurocognitive Tests for Concussion Assessment: A Meta-Analysis
Although widely used, computerized neurocognitive tests (CNTs) have been criticized because of low reliability and poor sensitivity. A systematic review was published summarizing the reliability of Immediate Post-Concussion Assessment and Cognitive Testing (ImPACT) scores; however, this was limited to a single CNT. Expansion of the previous review to include additional CNTs and a meta-analysis is needed. Therefore, our purpose was to analyze reliability data for CNTs using meta-analysis and examine moderating factors that may influence reliability. A systematic literature search (key terms: reliability, computerized neurocognitive test, concussion) of electronic databases (MEDLINE, PubMed, Google Scholar, and SPORTDiscus) was conducted to identify relevant studies. Studies were included if they met all of the following criteria: used a test-retest design, involved at least 1 CNT, provided sufficient statistical data to allow for effect-size calculation, and were published in English. Two independent reviewers investigated each article to assess inclusion criteria. Eighteen studies involving 2674 participants were retained. Intraclass correlation coefficients were extracted to calculate effect sizes and determine overall reliability. The Fisher Z transformation adjusted for sampling error associated with averaging correlations. Moderator analyses were conducted to evaluate the effects of the length of the test-retest interval, intraclass correlation coefficient model selection, participant demographics, and study design on reliability. Heterogeneity was evaluated using the Cochran Q statistic. The proportion of acceptable outcomes was greatest for the Axon Sports CogState Test (75%) and lowest for the ImPACT (25%). Moderator analyses indicated that the type of intraclass correlation coefficient model used significantly influenced effect-size estimates, accounting for 17% of the variation in reliability. The Axon Sports CogState Test, which has a higher proportion of acceptable outcomes and shorter test duration relative to other CNTs, may be a reliable option; however, future studies are needed to compare the diagnostic accuracy of these instruments.Objective:
Data Sources:
Study Selection:
Data Extraction:
Data Synthesis:
Conclusions:
Many experts agree that, when combined with symptom and motor-control assessments, computerized neurocognitive tests (CNTs) can aid athletic trainers in managing and evaluating patients with sport-related concussions.1–3 Current estimates suggest that 33% to 39% of athletic trainers include CNTs as part of their return-to-play protocol.4–6 Although widely used, CNTs have been criticized because of low reliability and poor sensitivity. Reliability coefficients as low as 0.22 have been reported for the Immediate Post-Concussion Assessment and Cognitive Testing (ImPACT; ImPACT Applications, Inc, Pittsburgh, PA)7 and 0.10 for the Automated Neuropsychological Assessment Metrics (ANAM; Vista LifeSciences, Parker, CO).8
Reliability is an extremely important concept in concussion testing due to individual serial-testing strategies.9 Reliability refers to the consistency of the scores obtained from a test. When no concussion is present, an athlete's scores should not change in between testing periods. A change in scores when no concussion has occurred indicates measurement error. Unfortunately, no test is perfect, and some level of measurement error is expected with all tests. To address this concern, the reliable change index (RCI), a statistic that estimates the magnitude of differences in scores necessary to suggest true change, is often calculated. The size of an RCI depends on the reliability of the test and the desired level of confidence (eg, 80%, 90%, 95%). When reliability is low, the RCI will be large, and when reliability is high, the RCI will be small.
The 90% RCI reported for the visual memory section of the ImPACT ranges from 18.23 to 26.50 points.10 This indicates that a change in score of at least 18 points is necessary to reflect a true change in visual memory. The intended purpose of the RCI is to minimize the risk of incorrect clinical decisions due to measurement error. Despite the large RCIs reported across each of the ImPACT outcome scores (visual memory, verbal memory, visual-motor speed, and reaction time), 40% to 80% of healthy individuals experienced significant change (ie, a false-positive diagnosis) on at least 2 of 3 trials during serial testing.10 Although this problem is not limited to ImPACT, it highlights a major area of concern in concussion testing.
False-positive diagnoses are problematic because they can lead to unwarranted removal from competition and subject patients to unnecessary medical procedures. Many of the commonly used CNTs have relatively high false-positive (ie, a healthy individual is diagnosed with a concussion) rates. Broglio et al11 identified false-positive rates of 38% on ImPACT and 19% on the Headminder Concussion Resolution Index (Headminder; Headminder Inc, New York, NY). Nelson et al12 found similar false-positive rates for ImPACT, as well as a false-positive rate of 52% for the Axon Sports CogState Test (Axon; CogState Ltd, Melbourne, Australia).
Although false-positive diagnoses can be a nuisance for athletes, false-negative diagnoses are a bigger concern because they can result in an athlete being returned to play prematurely, leading to further injury or worse. Louey et al13 reported a false-negative (ie, a concussed athlete being diagnosed as healthy) rate of 17%. When the RCI for a test is high (ie, poor reliability), a high rate of false-negative diagnoses may occur due to the large change in scores needed to identify true change. Cognitive changes after concussion may be subtle, and even though a change in scores might be identified during postinjury testing, the difference in scores between test periods may not exceed the RCI; thus, an incorrect clinical decision would be made, placing the athlete at risk for further harm.
Schatz et al14 and Ackerman and Kanfer15 proposed that the low reliability reported in some studies may have been the result of inappropriate study designs resulting in test fatigue experienced by participants. When multiple CNTs are administered concurrently, participants may experience cognitive fatigue, which can negatively affect reliability estimates. Alsalaheen et al10 contended that the differences in reliability coefficients were more likely due to differences in analytic methods between studies.
Intraclass correlation coefficients (ICCs) are the preferred method of examining reliability between sets of scores. Baumgartner et al16 suggested that ICCs between 0.70 and 0.79 be considered below-average acceptable; 0.80 to 0.89, average acceptable; and 0.90 to 1.0, above-average acceptable. The ICC, which uses analysis-of-variance techniques to assess variances among sets of scores, includes different models for estimating reliability that depend on the study design. Many of the authors who examined the reliability of CNTs used different ICC models. Inappropriate models can artificially inflate reliability estimates.
In addition, test score reliability appears to depend on the length of time between test administrations.11,17–19 For example, large differences in reliability coefficients were identified when the test-retest interval was increased from 1 hour to 1 week.18 The controversy of CNT reliability is further complicated because of the wide range of reliability estimates (from as low as 0.10 to as high as 0.93) reported across studies.7,8,11,12,17–30
Because of the conflicting reports, a more thorough investigation of the reliability of CNT scores is necessary. Although a systematic review10 summarizing the reliability of ImPACT scores has been previously published, this study was limited to a single CNT. Expanding the previous review to include additional CNTs would be beneficial for determining which instrument is the most reliable. Furthermore, no currently published studies have summarized reliability data for CNTs using meta-analytic techniques. Meta-analysis is an advanced statistical procedure used to combine the results from many independent studies into a single study. Meta-analysis can help to minimize biases associated with small sample sizes and allow for comparison of results across different moderator variables (eg, length of the test-retest interval, ICC model selection, participant demographics, study designs). Therefore, our purpose was to compile and analyze current reliability data for CNTs using meta-analytic techniques and to examine moderating factors that may influence the reliability of CNT scores.
METHODS
Literature Search
We conducted a systematic literature search in March 2016 to locate and identify relevant research for the current study. Combinations of the key words reliability, computerized neurocognitive test, and concussion were entered into the following electronic database search engines with no restrictions for year of publication: MEDLINE, PubMed, Google Scholar, and SPORTDiscus. Literature search findings from each set of key words were recorded and screened for inclusion and exclusion criteria. In addition to electronic database searches, we performed a manual search of reference lists from relevant articles to identify any potential studies missing from the online search. Studies were included in the analysis if they met all of the following inclusion criteria: (1) used a test-retest design; (2) involved at least 1 CNT (ANAM, Axon, Concussion Vital Signs [CNS-VS; CNS Vital Signs, LLC, Morrisville, NC], Headminder Concussion Resolution Index, or ImPACT); (3) author(s) reported sufficient descriptive or inferential statistical data to allow for calculation of effect sizes (ESs); and (4) published in the English language. The outcome of interest was the ICC between the initial baseline test assessment and the follow-up assessment. In this instance, the ICC measures the reproducibility (ie, reliability) between the baseline and follow-up assessments. The Pearson correlation coefficient (r) is another option for reporting reliability; however, r is less desirable because it measures the relative reliability between 2 time points, whereas ICC is a measure of absolute agreement. Therefore, for this meta-analysis, only studies that examined reliability using ICCs were included. Reliability studies using alternative statistics such as the κ statistic or regression were excluded from this review. Using these criteria, potentially relevant studies were screened by 2 independent reviewers, and full texts of all studies meeting the inclusion criteria were further assessed for methodologic quality and data extraction. Unpublished abstracts, dissertations, and theses were considered for inclusion in the study as long as they met the inclusion criteria. When disagreements occurred between the reviewers, consensus was achieved through discussion (see the Figure for a Preferred Reporting Items for Systematic Reviews and Meta-Analyses [PRISMA] diagram illustrating the review process).



Citation: Journal of Athletic Training 52, 9; 10.4085/1062-6050-52.6.03
Methodologic Quality
Two reviewers independently assessed the methodologic quality of studies using a modified version of the Downs and Black checklist,31 which has been applied in a recent meta-analysis32 and systematic review.33 The original checklist contained 27 items to measure the quality of intervention-based studies. Thirteen of the original items were irrelevant to our study design and were removed. The modified checklist consists of 14 items in 3 domains (ie, reporting, external validity, and internal validity) and scores range from 0 to 14 (higher scores indicate better-quality studies). Any study with low methodologic quality (ie, a score greater than 1.5 × interquartile range [IQR] above the upper quartile or a score lower than 1.5 × IQR below the lower quartile) was further examined to determine inclusion in or exclusion from our investigation.
Coding Procedures and Data Extraction
Before coding, we developed a standardized coding form to simplify the extraction process and maintain consistency between the reviewers. Each study was analyzed, and the following data were extracted: the number of participants in each test sample, the type of ICC model used for analyzing test-retest reliability, the average length of the test-retest interval (ie, the average number of days between the first and second testing sessions), the specific CNT(s) used in each study, and the number of CNT(s) administered concurrently in a single session (ie, the number of CNTs each participant completed). Intraclass correlation coefficients were obtained for the reported outcomes of each CNT. When the ICC model was not specified, the author(s) of the study were contacted to determine which model was used. To avoid dependency concerns in studies that reported ICCs for multiple retesting time points, we used only the first time point. If ICCs were reported for multiple subgroups (eg, athletes versus general population, intercollegiate versus high school), each subgroup was assumed to be independent and included in a single meta-analysis.
Key word searches identified 3289 records. Manual searches identified an additional 3 records. After duplicate studies and studies that did not meet the inclusion criteria were removed, 23 studies were available for full review. After reviewing full-text articles for each study, we removed an additional 5 studies from the analysis (see the Figure). This resulted in a final sample of 18 studies. It should be noted that some of these researchers assessed multiple independent samples or administered multiple CNTs to a single sample. Therefore, the final number of samples included in the analysis was 27. Detailed study characteristics are provided in Table 1. Quality of the studies was relatively high, with a median quality index of 13 (upper quartile = 14, lower quartile = 11, IQR = 3). Of the 18 studies, 17 had quality scores within acceptable limits. The remaining study, which had a quality index of 6 (acceptable range = 6.5–14), was an unpublished abstract in which space was limited. After careful consideration and consensus among the authors, we retained the abstract because it contained all the necessary information to calculate ES estimates. Publication bias was determined by performing the Egger test.

Data Analysis
All analyses were performed with R software (version 3.2.4; R Foundation, Vienna, Austria)34 using the metafor (version 1.9–8)35 package. Our measure of ES was the ICC, which represents the reproducibility of CNT scores. The Fisher Z transformation was conducted on the ICCs to adjust for sampling error associated with averaging correlations.36 The transformed average Z coefficients were then transformed back to ICCs to allow for interpretation of the results. A recent Monte Carlo simulation confirmed that the use of back-transformed average Z coefficients are less biased than averaging correlation coefficients.37 Although the previous authors examined only the Pearson correlation (r), it should be noted that both ICC and r are bounded measures (ie, 0 to 1), whereas the transformed average Z coefficient is an unbounded measure. The Z transformation can also be used to build confidence intervals for the ICCs.38
We selected a random-effects model for the current investigation due to the variability among studies. Effect sizes were computed for each outcome on each CNT (eg, 4 ESs were computed for ImPACT: [1] verbal memory, [2] visual memory, [3] visual-motor speed, and [4] reaction time). Effect sizes were estimated using the escalc function, whereas the random-effects model was calculated using the rma function. A restricted maximum-likelihood estimator was used because it has been demonstrated to be unbiased and efficient [model specification: rma(yi, vi, measure = GEN, method = “REML”)].39 A detailed description of the metafor package and the available functions is online at the comprehensive R archive network Web page (https://cran.r-project.org).
To compare the reliability of CNTs, we calculated an average ES by averaging the ESs of the outcomes for each CNT. Furthermore, the proportion of outcomes with acceptable reliability was calculated for each CNT. Only 2 studies21,40 examined the reliability of the CNS-VS test, whereas only a single study11 examined the reliability of Headminder. The samples for these CNTs were not included in the overall meta-analysis because of the small sizes. However, these studies were included in moderator analyses.
Due to the high variability in reported ICCs among CNTs, it is important to assess the potential sources of bias in reliability estimates. Understanding these sources of error can help to minimize testing error and improve future studies. Given the small number of studies for some CNTs, it was not possible to evaluate each moderator for each outcome individually. As a result, the ICCs for each outcome were combined into a single analysis. Although combining related outcomes can result in biased estimates of ES,41 the significance of the moderators can still be determined using these methods with a meta-regression analysis. In this manner, it is possible to determine which variables influence the reliability of outcome scores.
We used mixed-effects models with meta-regression procedures to examine the effects of moderator variables on the reliability of CNTs (model specification: rma[yi, vi, mods = ∼moderatorvariable measure = GEN, method = “REML”]). Yet due to sample-size limitations, only the effects of the moderators were examined. A separate meta-regression analysis examined each of the following moderator variables: (1) length of the test-retest interval, (2) ICC model selection, (3) participant demographics (eg, athlete population versus general population), and (4) study design (eg, number of CNTs completed by each individual in a single study). A wide range of test-retest intervals was reported in some studies, so the average test-retest interval was used. Heterogeneity was evaluated using the Cochran Q statistics (Qmodel and Qerror), which are based on the χ2 distribution, with N–1 degrees of freedom, where N represents the total number of samples included in the analysis. In general, a significant Qmodel suggests that the ES estimates are significantly different across studies. When both Qmodel and Qerror are significant, the moderator variables explain some but not all of the variations in ES estimates across studies. A nonsignificant Qmodel suggests that there is no difference in ES estimates across studies.
RESULTS
Overall Reliability
Effect-size estimates (ICCs), Q statistics (Qtotal), and I2 for CNT outcomes are provided in Table 2. Stem-and-leaf plots illustrating the distribution of ESs for each CNT are shown in Table 3. Evidence of publication bias was examined using the Egger test (P = .15); no such bias was identified. Seventy-five percent (3 of 4) of the outcomes for Axon were below-average acceptable (0.70–0.79). Twenty-five percent (1 of 4) of the outcomes for ImPACT were below-average acceptable. Forty-three percent (3 of 7) of the outcomes for ANAM were below-average acceptable. All other outcomes had poor reliability (<0.70).


Moderator Analyses
Intraclass Correlation Coefficient Model Selection
Effect-size estimates for studies using average-measure ICC models (ICC = 0.76; 95% confidence interval [CI] = 0.70, 0.80) were significantly higher than studies using single-measure ICC models (ICC = 0.61; 95% CI = 0.58, 0.65; Qmodel = 18.40, degrees of freedom [df] = 2, P < .01; Qerror = 697.82, df = 112, P < .01).
Length of Test-Retest Interval
No differences were identified in ES estimates based on average length of the test-retest interval (Qmodel = 0.70, df = 1, P = .40; Qerror = 866.51, df = 110, P < .01).
Study Population
No differences were identified in ES estimates based on the population (athletes versus general population) included in the study (Qmodel = 1.31, df = 1, P = .25; Qerror = 919.34, df = 113, P < .01).
Number of Computerized Neurocognitive Tests in the Study Protocol
No differences were identified in ES estimates based on the number of CNTs evaluated in a single testing session (Qmodel = 2.19, df = 1, P = .14; Qerror = 903.17, df = 113, P < .01).
DISCUSSION
This meta-analysis provides a comprehensive evaluation of the reliability of CNT scores, with combined data from 18 studies, consisting of 27 data samples and 2674 participants. Debate is ongoing among experts regarding the clinical utility of CNTs as part of the clinical decision-making process. Although many studies have been published examining reliability data for the various commercially available CNTs, their large variability can make it difficult to determine which CNT is the most reliable. Athletic trainers often have limited budgets and must choose a single test for use in the clinical setting. The goal of our study was to provide a more in-depth evaluation of CNTs and supply athletic trainers with accurate information for making evidence-based decisions regarding the use of CNTs.
One of the main reasons that direct comparisons of CNTs are difficult is that each test evaluates different domains of cognitive function. This situation is further complicated because some instruments report similar domains, yet these domains are assessed using different tasks. Thus, it can be challenging for athletic trainers to determine which test is the most effective tool. Effect-size estimates across CNT outcomes in this study ranged from 0.52 to 0.77. The majority of the outcomes examined in this meta-analysis (53%) had less than desirable reliability. This is alarming considering the widespread use of these tests in clinical practice. For this reason, the National Athletic Trainers' Association recommends the use of a multidimensional concussion-evaluation protocol.3 Our results support this recommendation; overreliance on CNTs could result in false-positive and false-negative diagnoses due to low reliability.
It should also be noted that, although reliability is a clear concern for CNTs, such tests are not alone in this regard, particularly in the context of concussion evaluation and management. An examination of the Balance Error Scoring System, a commonly used balance assessment, indicated that the interrater and intrarater reliability ICCs for the total scores were 0.57 and 0.74, respectively.42 Furthermore, the reliability of scores appeared to be influenced by sex.43 When multiple baseline assessments were used, the reliability of scores improved.43 Use of a double baseline for concussion testing may be 1 method of improving the reliability of scores.
Comparisons of the CNTs examined in this study suggest that Axon may be the most reliable. First, the Axon test had the highest proportion of outcomes with acceptable reliability (3 of 4 [75%]). Second, compared with the ImPACT, administration time for Axon is considerably shorter. Axon takes approximately 8 to 10 minutes to complete and contains 4 tasks to assess processing speed, attention, learning accuracy, and working memory. The ImPACT consists of 4 composite scores measured using 6 modules and takes twice as long to complete (approximately 20 minutes). Athletic trainers often work with large groups of athletes across multiple teams. Therefore, baseline testing all athletes can take considerable time. The shortened administration time of Axon would allow more individuals to be tested in the same period.
Learning accuracy was the lone Axon outcome with poor reliability. The learning accuracy task, which is associated with delayed memory, requires participants to recall whether their card has been displayed previously. Axon also requires participants to press a key when their card has turned over, determine if the color of the current card is red, or state whether their card is the same as the most recent card to assess processing speed, attention, and working memory, respectively. By comparison, the learning accuracy task seems significantly more challenging. The increased difficulty of the learning accuracy task could explain the low reliability for this particular outcome, especially if the task is too challenging for the patients being assessed.
Another complication that arises when comparing the efficacy of CNTs is related to study design. To increase power and account for a small sample size, a within-subject study design is often used to compare CNTs. This practice of examining multiple CNTs in a single study11,14,21 has been questioned by some due to the high risk of fatigue from extended test protocols.14 Only 1 of the 3 studies counterbalanced to offset these potential biases. We found no differences in ES estimates among studies evaluating a single CNT or multiple CNTs for a single population. These findings are in contrast to those of Schatz et al,14 who proposed that the low reliability in some studies is related to cognitive fatigue and low methodologic scrutiny. It is likely that the differences in ES estimates identified by the studies in question11,21 are related to differences in analytic methods rather than to differences in study design.
Many of the studies included in this meta-analysis used different ICC models for analyzing test-retest reliability. In general, ICCs derived from models using average measures will be higher than ICCs derived from single-measure models. Intraclass correlation coefficient model selection should depend on the type of data used for composite scores and the intended use of the instrument. When a double baseline approach is applied to minimize potential learning and practice effects, average-measure ICC models are used to account for the fact that multiple assessments are being incorporated into a single time point. In most cases, however, CNTs are administered only once at each time point in the test-retest design. This is equivalent to assessing the reliability of a single rater, where the single-measure ICC model would be the most appropriate.
A systematic review10 of the ImPACT reliability studies demonstrated that, when ES estimates are recalculated using average-measure ICC models, coefficients increased by as much as 0.17. Only a single study13 was published on Axon using average-measure ICC models, and ICCs ranged from 0.83 to 0.93 across the 4 tasks. In this meta-analysis, estimated ESs were different between studies using single-measure and average-measure ICC models. The type of ICC model used accounted for 17% of the variation in ICCs across study outcomes. Inappropriate model selection may result in biased estimates of reliability, which could have contributed to the conflicting evidence reported across studies.
Limitations
Our study was limited by the quantity and quality of the research examining the test-retest reliability of CNTs. The overall sample size was relatively small, which may have influenced the power of the results. In addition, some authors failed to designate which ICC models were used to estimate reliability data. Although most investigators described the ICCs used for their studies through online communication, some were unsure which models were used due to the length of time since the study was published. Additionally, some authors did not respond, resulting in their studies being excluded from moderator analyses due to insufficient information.
Our results suggest that the Axon CNT may be superior to other tests, but it should be noted that ImPACT was included in almost double the number of studies as Axon (11 versus 6). It is possible that Axon's indices would be just as unreliable as those of ImPACT if additional studies were to be completed in the future. Furthermore, some studies were published examining the reliability of Headminder and CNS-VS, yet the number of studies investigating these instruments was too small to allow for comparisons with the more popular CNTs. More work is needed to examine the reliability of these instruments.
Practice effects are another potential area of concern that could influence the reliability of CNT scores. Some studies included multiple retesting time points; however, the number of studies that did this was rather small. In addition, the interval between retesting time points was not consistent. This combined with the small sample sizes would make it challenging to separate practice effects from effects related to the test-retest interval. As a result, we were not able to assess this in the current study. Future research is needed to investigate potential practice effects among CNTs.
Last, we used a univariate meta-analysis approach to analyze the data from each outcome independently. Multivariate data, such as those seen with CNT outcomes, should be analyzed under a multivariate model. To conduct a multivariate meta-analysis, the correlations between outcomes are required to calculate the covariance matrix necessary for analyzing the multilevel data. Unfortunately, no studies published currently, to our knowledge, reported correlations between outcomes, which prohibits the use of a multivariate meta-analysis. In addition, it has been suggested that a large number of studies are needed to produce reliable results with a multivariate meta-analysis.44 Three potential solutions to this problem are (1) ignoring the dependencies and analyzing the data anyway, (2) averaging the ICC values across studies, or (3) conducting a separate analysis for each independent outcome.44 Currently, no published investigations, to our knowledge, have examined the correlations between outcomes for each CNT; the effect of ignoring these potential dependencies on estimated ESs is unknown. Therefore, for this study, a combination of methods (2) and (3) was used to calculate ES. First, the ESs were estimated for each outcome independently. Second, the ESs were averaged for each test to determine which test was more reliable.
CONCLUSIONS
Despite limitations, this meta-analysis provides compelling evidence that the reliability of CNTs is less than desirable. Although no significant differences were identified in average ESs across CNTs, the Axon test, which has a higher proportion of acceptable outcomes and shorter test duration relative to other CNTs, may be a reliable option among popular CNTs. Future studies, however, are needed to compare the diagnostic accuracy of these instruments.

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow chart. Abbreviation: ICC, intraclass correlation coefficient.
Contributor Notes