scispace - formally typeset
Search or ask a question
Journal ArticleDOI

ISLES 2015 - A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI

01 Jan 2017-Medical Image Analysis (Elsevier)-Vol. 35, pp 250-269
TL;DR: This paper proposes a common evaluation framework for automatic stroke lesion segmentation from MRIP, describes the publicly available datasets, and presents the results of the two sub‐challenges: Sub‐Acute Stroke Lesion Segmentation (SISS) and Stroke Perfusion Estimation (SPES).
About: This article is published in Medical Image Analysis.The article was published on 2017-01-01 and is currently open access. It has received 417 citations till now.

Summary (5 min read)

1. Introduction

  • Still, segmenting stroke lesions from MRI images poses a challenging problem.
  • First, the stroke lesions' appearance varies significantly over time, not only between but even within the clinical phases of stroke development.
  • In the acute phase, the DWI denotes the infarcted region as hyperintensity.
  • They may or may not be aligned with the vascular supply territories and multiple lesions can appear at the same time (e.g. caused by an embolic shower).
  • Finally, a good segmentation approach must comply with the clinical workflow.

1.1. Current methods

  • The quantification of stroke lesions has gained increasing interest during the past years ( Fig. 1 ).
  • A recent review of non-chronic stroke lesion segmenta-Table 1 Listing of publications describing non-chronic stroke lesion segmentation in MRI with evaluation on human image data since Rekik et al. (2012) .
  • Column Metrics denotes the metrics used in the evaluation.

Method

  • In the present benchmark study, the authors approach the urgent problem of comparability.
  • To this end, the authors planned, organized, and pursued the I schemic S troke LE sion S egmentation challenge: A direct, fair, and independently controlled comparison of automatic methods on a carefully selected public dataset.
  • ISLES combined two subchallenges dealing with different phases of the stroke lesion evolution: First, the S troke P erfusion ES timation (SPES) challenge dealing with the image interpretation of the acute phase of stroke; second, the S ub-acute I schemic S troke lesion S egmentation (SISS) challenge dealing with the later stroke image patterns.

2. Setup of ISLES

  • Interested research teams could register for one or both subchallenges.
  • All submitted algorithms were required to be fully automatic; no other restrictions were imposed.
  • Until the day of the challenge, the SMIR platform listed over 120 registered users for the ISLES 2015 challenge and a similar count of training dataset downloads.
  • Of these, 14 teams provided testing dataset results for SISS and 7 algorithms participated in SPES.

3.2. SPES image data and ground truth

  • The training dataset is additionally equipped with a manually created DWI segmentation ground truth set, which roughly denotes the stroke's core area.
  • These are not considered in the challenge.

3.3. Evaluation metrics

  • Please note that the ASSD and HD values were computed excluding the failed cases (they do, however, incur the lowest vacant rank for these cases).
  • The last row shows the inter-observer results for comparison.
  • The saturation of the node colors indicates the strength of a method, where better methods are highlighted with more saturated colors.
  • Note that all teams with the same number of incoming and outgoing edges perform, statistically spoken, equally well.

Rank

  • Similar, the HD is defined as the maximum of all surface distances with EQUATION.
  • The distance measure d ( ) employed in both cases is the Euclidean distance, computed taking the voxel size into account.

3.4. Ranking

  • The authors chose a third approach based on the ideas of Murphy et al. (2011) that builds on the concept that a ranking reveals only the direction of a relationship between two items (i.e. higher, lower, equal) but not its magnitude.
  • Basically, each participant's results are ranked per case according to each of the three metrics and then the obtained ranks are averaged.

3.5. Label fusion

  • The specific design of each automatic segmentation algorithm will result in certain strengths and weaknesses for particular challenges in the present image data.
  • These algorithms enable a suitable selection and/or fusion to best combine complementary segmentation approaches.
  • First, majority vote ( Xu et al., 1992 ) , which simply counts the number of foreground votes over all classification results for each voxel separately and assigns a foreground label if this number is greater than half the number of algorithms.
  • Second, the STAPLE algorithm ( Warfield et al., 2004 ) , which performs a simultaneous truth and performance level estimation, that calculates a global weight for each rater and attempts to remove the negative influence of poor algorithms during majority voting.
  • Third, the SIMPLE algorithm ( Langerak et al., 2010 ) , which employs a selective and iterative method for performance level estimation by successively removing the algorithms with poorest accuracy as judged by their respective Dice score against a weighted majority vote, where the weights are determined by the previously estimated performances.

4.1. Inter-observer variance

  • Comparing the two ground truths of SISS against each other provides (1) the baseline above which an automatic method can be considered to produce results superior to a human rater and (2) a measure of the task's difficulty ( Table 7 , last row).
  • The two expert segmentations overlap at least partially for all cases.

4.2. Leaderboard

  • The evaluation measures and ranking system employed are described in the method part of this article ( Section 3.4 ).
  • No participating method succeeded in segmenting all 36 testing cases successfully (DC > 0) and the best scores are still substantially below the human rater performance.
  • Note that for all following experiments, the authors will focus on DC averages only as the ASSD and HD values cannot be readily computed for the failed cases and are thus not suitable for a direct comparison of methods with differing numbers of failure cases.

4.3. Statistical analysis

  • The two highest ranking methods, UK-Imp2 and CN-Neu, show no statistically significant differences with a confidence of 95% (i.e. p < 0.025).
  • No other algorithm performs better than them, and they both are better than the 12 remaining ones.
  • Next comes a group of four methods (FI-Hus, BE-Kul2, US-Odu, De-UzL) to which only the two winners prove superior.
  • But among these, FI-Hus takes the highest position as it is statistically better than eight other methods, while the other three only prove superior to at most four competitors.
  • The established leaderboard ranking is largely confirmed by the statistical analysis.

4.4. Impact of multi-center data

  • Since the training dataset contained only cases from the first center, the difference in performance should reveal the methods' generalization abilities.
  • The authors observed that not a single algorithm reached second center scores comparable to its first center scores.

4.5. Combining the participants' results by label fusion

  • Applying the three label fusion algorithms presented in Section 3.5 lead to the results presented in Table 7 at the bottom.
  • The authors found that the SIMPLE algorithm performed best and could reduce outliers as evident by a lower Hausdorff distance.
  • When using majority voting or STAPLE, the negative influence of multiple failed segmentations that are correlated yielded a lower accuracy than at least the two top ranked algorithms.

4.6. Dependency on observer variations

  • The average DC scores of each method differed only slightly over the ground truth sets.
  • Only in a single case, UK-Imp2, the difference was significant (paired Student's t -test with p < 0.05), but the higher results were obtained for the, formerly unseen, GT02 set.
  • The authors can hence conclude that all algorithms generalized well with respect to expert segmentations of different raters.
  • An additional data analysis showed that the ranking of the methods does not change if only one or the other of the ground truth sets is employed for evaluation.

4.7. Outlier cases

  • The three cases that were successfully processed by nearly all algorithms show large, clearly outlined lesions with a strongly hyperintense FLAIR signal.
  • In two of these cases, the DWI signal is relatively weak, in some areas nearly isointense.
  • Still, for these cases the algorithms displayed the highest confidence.
  • Another case (10) , equally showing a small lesion, has a stronger FLAIR support, but also displays large periventricular WMHs that seem to confuse most algorithms despite missing DWI hyperintensities.
  • But many algorithms additionally delineated parts of the periventricular WMHs, which again only show up in the FLAIR sequence.

4.8. Correlation with lesion characteristics

  • Significant moderate correlation was found between the lesion volume and the average DC values.
  • A statistically significant difference of means was found when comparing cases with haemorrhage present and cases without, as well as between left hemispheric and right hemispheric lesions.
  • Since the characteristics cannot be assumed to be independent, the authors furthermore tested the last two groupings for significant differences in lesion volumes between the groups.
  • This was found in both cases (see secondary test for each of these two characteristics).
  • The authors could not reliably establish a significant influence on the results for any single parameter.

5.1. Leaderboard

  • The authors opted not to calculate the HD for SPES as it does not reflect the clinical interest of providing volumetric information of the penumbra region.
  • In addition, since some lesions in SPES contained holes, the HD was not a useful metric for gauging segmentation quality.
  • This ranking is the outcome of the challenge event and was used to determine the competition winners.
  • Visualization of significant differences between the 7 participating methods' case ranks.
  • Note that all teams with the same number of incoming and outgoing edges perform, statistically spoken, equally well.

5.2. Statistical analysis

  • The authors do not observe significant differences between the two first ranked methods nor between the third and fourth place.
  • Hence, SPES has two first ranked, two second ranked, and one third ranked method according to the statistical analysis.

5.4. Outlier cases

  • For case 05, the authors can be observed two previous embolisms that cause a compensatory perfusion change, depicted as two hyperintensity regions within the lesion area in the diffusion image and as hypoperfused areas in the Tmax map.
  • In summary, the main difficulties faced by the algorithms are related to physiological aspects, such as collateral flow, previous infarcts, etc.

6.1. The most suitable algorithm and the remaining challenges

  • The conclusions drawn here are meant to be general and valid for most of the participating methods.
  • Any interested reader is invited to download the participants' training dataset results and perform her/his own analysis to test whether these findings hold true for a particular algorithm.

6.2. Recommendations and limitations

  • Instead, the findings of this investigation can help them to identify suitable solutions that can serve as support tools:.
  • In particular clearly outlined, large lesions are already segmented with good results, which are usually tedious to outline by hand.
  • For smaller and less pronounced lesions the manual approach is still recommended.
  • Furthermore, they should be aware that individual adaptations to each data source are most likely required -either by tuning the hyperparameters or through machine learning.

7. Discussion: SPES

  • The discrepancy between the relatively good results reported by Olivot et al. (2009a ) , Christensen et al. (2010) and Straka et al. (2010) and the poor performance observed in this study can be partially explained by the different end-points (expert segmentation on PWI-MRI vs. follow-up FLAIR/T2), the different evaluation measures (DC/ASSD vs. volume similarity), and the different data.
  • This only serves to highlight the need for a public evaluation dataset.
  • From an image processing point of view, the volume correlation is not a suitable measure to evaluate segmentations as it can lead to good results despite completely missed lesions.

7.1. The most suitable algorithm and the remaining challenges

  • An automated method has to fulfill the strict requirements of clinical routine.
  • With 6 min (CH-Insel) and 20 sec (DE-UzL), including all pre-and post-processing steps, the two winning methods fit the requirements, DE-UzL even leaving room for overhead.

7.2. Recommendations and limitations

  • While MCA strokes are most common and well suited for mechanical reperfusion therapies ( Kemmling et al., 2015 ) , the restriction to low-noise MCA cases limits the result transfer to clinical routine.
  • The generality of the results is additionally reduced by providing only single-center, single-ground truth data.
  • Finally, voxel-sized errors in the ground truth prevented the evaluation of the HD, which would have provided additional information.

8. Conclusion

  • For the next version of ISLES, the authors would like to focus on the acute segmentation problem from a therapeutical point of view.
  • By modeling a benchmark reflecting the time-critical decision making processes for cerebrovascular therapies, the authors hope to promote the transfer from methods to clinical routine and further the exchange between the disciplines.
  • A multi-center dataset with hundreds of cases will allow the participants to develop complex solutions.

Did you find this useful? Give us your feedback

Figures (13)
Citations
More filters
Journal ArticleDOI
TL;DR: An efficient and effective dense training scheme which joins the processing of adjacent image patches into one pass through the network while automatically adapting to the inherent class imbalance present in the data, and improves on the state-of-the‐art for all three applications.

2,842 citations


Cites methods from "ISLES 2015 - A public evaluation be..."

  • ...We participated in the 2015 Ischemic Stroke Lesion Segmentation (ISLES) challenge, where our system achieved the best results among all participants on sub-acute ischemic stroke lesions (Maier et al. (2017))....

    [...]

Posted ContentDOI
Spyridon Bakas1, Mauricio Reyes, Andras Jakab2, Stefan Bauer3  +435 moreInstitutions (111)
TL;DR: This study assesses the state-of-the-art machine learning methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018, and investigates the challenge of identifying the best ML algorithms for each of these tasks.
Abstract: Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumoris a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses thestate-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross tota lresection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.

1,165 citations

Journal ArticleDOI
TL;DR: An auto‐context version of the VoxResNet is proposed by combining the low‐level image appearance features, implicit shape information, and high‐level context together for further improving the segmentation performance, and achieved the best performance in the 2013 MICCAI MRBrainS challenge.

633 citations


Cites background or methods from "ISLES 2015 - A public evaluation be..."

  • ..., 2016) or caused by various lesions (Menze et al., 2015; Maier et al., 2017), the confounding appearance of different...

    [...]

  • ...…modalities were used in brain tumor segmentation including T1, T1 contrast-enhanced, T2, and T2-FLAIR MRI (Menze et al., 2015) and four imaging modalities including T1weighted, T2-weighted, diffusion weighted imaging (DWI), and FLAIR MRI were employed in brain lesion analysis (Maier et al., 2017)....

    [...]

  • ..., 2015) and four imaging modalities including T1weighted, T2-weighted, diffusion weighted imaging (DWI), and FLAIR MRI were employed in brain lesion analysis (Maier et al., 2017)....

    [...]

  • ...…large intra-class variations of these structures among different subjects (Moeskops et al., 2016) or caused by various lesions (Menze et al., 2015; Maier et al., 2017), the confounding appearance of different http://dx.doi.org/10.1016/j.neuroimage.2017.04.041 Accepted 18 April 2017 ⁎…...

    [...]

Posted Content
TL;DR: A CNN-based method with three-dimensional filters is demonstrated and applied to hand and brain MRI and is validated on data both from the central nervous system as well as the bones of the hand.
Abstract: Convolutional neural networks have been applied to a wide variety of computer vision tasks. Recent advances in semantic segmentation have enabled their application to medical image segmentation. While most CNNs use two-dimensional kernels, recent CNN-based publications on medical image segmentation featured three-dimensional kernels, allowing full access to the three-dimensional structure of medical images. Though closely related to semantic segmentation, medical image segmentation includes specific challenges that need to be addressed, such as the scarcity of labelled data, the high class imbalance found in the ground truth and the high memory demand of three-dimensional images. In this work, a CNN-based method with three-dimensional filters is demonstrated and applied to hand and brain MRI. Two modifications to an existing CNN architecture are discussed, along with methods on addressing the aforementioned challenges. While most of the existing literature on medical image segmentation focuses on soft tissue and the major organs, this work is validated on data both from the central nervous system as well as the bones of the hand.

302 citations


Cites methods from "ISLES 2015 - A public evaluation be..."

  • ...A similar approach is followed by Havaei et al. in [13], also training and testing MR images of brains from the BRATS and ISLES data sets....

    [...]

  • ...The authors benchmark their approach on the BRATS [31] and ISLES [30] challenges2....

    [...]

Journal ArticleDOI
TL;DR: An extensive literature review of CNN techniques applied in brain magnetic resonance imaging (MRI) analysis, focusing on the architectures, pre-processing, data-preparation and post-processing strategies available in these works.

264 citations

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations


"ISLES 2015 - A public evaluation be..." refers methods in this paper

  • ...By utilizing mall 3 3 kernels that lead to deeper architectures with less trainble parameters, as well as adopting Dropout, Batch Normalizaion ( Ioffe and Szegedy, 2015 ) and augmenting the database using eflection along the sagittal axis, we heavily regularize our network nd show that it is…...

    [...]

  • ...By utilizing small 33 kernels that lead to deeper 999 architectures with less trainable parameters, as well as adopt1000 ing Dropout, Batch Normalization (Ioffe and Szegedy, 2015) 1001 and augmenting the database using reflection along the sagit1002 tal axis, we heavily regularize our network and show that it is 1003 possible to train such a deep and wide network on a limited 1004 database....

    [...]

Book ChapterDOI
Frank Wilcoxon1
TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.
Abstract: The comparison of two treatments generally falls into one of the following two categories: (a) we may have a number of replications for each of the two treatments, which are unpaired, or (b) we may have a number of paired comparisons leading to a series of differences, some of which may be positive and some negative. The appropriate methods for testing the significance of the differences of the means in these two cases are described in most of the textbooks on statistical methods.

12,871 citations

Journal ArticleDOI
TL;DR: A new tree-based ensemble method for supervised classification and regression problems that consists of randomizing strongly both attribute and cut-point choice while splitting a tree node and builds totally randomized trees whose structures are independent of the output values of the learning sample.
Abstract: This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.

5,246 citations

Journal ArticleDOI
TL;DR: A novel approach to correcting for intensity nonuniformity in magnetic resonance (MR) data is described that achieves high performance without requiring a model of the tissue classes present, and is applied at an early stage in an automated data analysis, before a tissue model is available.
Abstract: A novel approach to correcting for intensity nonuniformity in magnetic resonance (MR) data is described that achieves high performance without requiring a model of the tissue classes present. The method has the advantage that it can be applied at an early stage in an automated data analysis, before a tissue model is available. Described as nonparametric nonuniform intensity normalization (N3), the method is independent of pulse sequence and insensitive to pathological data that might otherwise violate model assumptions. To eliminate the dependence of the field estimate on anatomy, an iterative approach is employed to estimate both the multiplicative bias field and the distribution of the true tissue intensities. The performance of this method is evaluated using both real and simulated MR data.

4,613 citations

Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Isles 2015 - a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral mri" ?

Al ( MRI ) volumes are intensely res datasets and evaluation scheme Stroke Lesion Segmentation ( ISL In this paper the authors propose a com present the results of the two Perfusion Estimation ( SPES ). However, algorit Overall, no algorithmic characte the characteristics of stroke les studied in detail.