scispace - formally typeset
Search or ask a question

Showing papers presented at "Cross-Language Evaluation Forum in 2010"


Proceedings Article
01 Jan 2010
TL;DR: 2009 was the sixth year for the ImageCLEF medical retrieval task and for the first time, 5 case-based topics were provided as an exploratory task, designed to be closer to the clinical workflow.
Abstract: The seventh edition of the ImageCLEF medical retrieval task was organized in 2010. As in 2008 and 2009, the collection in 2010 uses images and captions from the Radiology and Radiographics journals pub- lished by RSNA (Radiological Society of North America). Three sub- tasks were conducted within the auspices of the medical task: modality detection, image-based retrieval and case-based retrieval. The goal of the modality detection task was to detect the acquisition modality of the images in the collection using visual, textual or mixed methods. The goal of the image-based retrieval task was to retrieve an ordered set of images from the collection that best met the information need specified as a textual statement and a set of sample images, while the goal of the case-based retrieval task was to return an ordered set of articles (rather than images) that best met the information need provided as a description of a "case". The number of registrations to the medical task increased to 51 research groups. However, groups submitting runs have remained stable at 16, with the number of submitted runs increasing to 155. Of these, 61 were ad-hoc runs, 48 were case-based runs while the remaining 46 were modal- ity classification runs. The best results for the ad-hoc retrieval topics were obtained using mixed methods with textual methods also performing well. Textual methods were clearly superior for the case-based topics. For the modality de- tection task, although textual and visual methods alone were relatively successful, combining these techniques proved most effective.

67 citations


Proceedings Article
01 Jan 2010
TL;DR: This paper summarizes the definition, resources, evaluation methodology and metrics, participation and comparative results for the second task of the WEPS-3 evaluation campaign, the so-called OnlineReputation Management task.
Abstract: This paper summarizes the definition, resources, evaluation methodology and metrics, participation and comparative results for the second task of the WEPS-3 evaluation campaign The so-called OnlineReputation Management task consists of filtering Twitter posts containing a given company name depending of whether the post is actually related with the company or not Five research groups submitted results for the task

55 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing, showing that the method achieved better results with medium and large plagiarized passages.
Abstract: This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.

30 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A review of the current state of web-based and component-level evaluation of information retrieval (IR) and propositions are made for a comprehensive framework for web service-based component- level IR system evaluation.
Abstract: Automated component-level evaluation of information retrieval (IR) is the main focus of this paper. We present a review of the current state of web-based and component-level evaluation. Based on these systems, propositions are made for a comprehensive framework for web service-based component-level IR system evaluation. The advantages of such an approach are considered, as well as the requirements for implementing it. Acceptance of such systems by researchers who develop components and systems is crucial for having an impact and requires that a clear benefit is demonstrated.

26 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A simulation framework that incorporates existing and new simulation strategies is developed and designed for generating queries and relevance judgments for retrieval system evaluation, and incorporating knowledge about document structure in the query generation process helps create more realistic simulators.
Abstract: We design and validate simulators for generating queries and relevance judgments for retrieval system evaluation. We develop a simulation framework that incorporates existing and new simulation strategies. To validate a simulator, we assess whether evaluation using its output data ranks retrieval systems in the same way as evaluation using real-world data. The real-world data is obtained using logged commercial searches and associated purchase decisions. While no simulator reproduces an ideal ranking, there is a large variation in simulator performance that allows us to distinguish those that are better suited to creating artificial testbeds for retrieval experiments. Incorporating knowledge about document structure in the query generation process helps create more realistic simulators.

24 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A method for the evaluation of multilingual multi-document summarisation that allows saving precious annotation time and that makes the evaluation results across languages directly comparable is presented.
Abstract: We are presenting a method for the evaluation of multilingual multi-document summarisation that allows saving precious annotation time and that makes the evaluation results across languages directly comparable. The approach is based on the manual selection of the most important sentences in a cluster of documents from a sentence-aligned parallel corpus, and by projecting the sentence selection to various target languages. We also present two ways of exploiting inter-annotator agreement levels, apply them both to a baseline sentence extraction summariser in seven languages, and discuss the result differences between the two evaluation versions, as well as a preliminary analysis between languages. The same method can in principle be used to evaluate single-document summarisers or information extraction tools.

23 citations


Book ChapterDOI
20 Sep 2010
TL;DR: It appears that systems obtain scores regarding not only the relevance of retrieved documents, but also according to document names in case of ties, and the case for fairer tie-breaking strategies is argued.
Abstract: We consider Information Retrieval evaluation, especially at TREC with the trec_eval program. It appears that systems obtain scores regarding not only the relevance of retrieved documents, but also according to document names in case of ties (i.e., when they are retrieved with the same score). We consider this tie-breaking strategy as an uncontrolled parameter influencing measure scores, and argue the case for fairer tie-breaking strategies. A study of 22 TREC editions reveals significant differences between the Conventional unfair TREC's strategy and the fairer strategies we propose. This experimental result advocates using these fairer strategies when conducting evaluations.

21 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A small case study is presented in which a cluster of 15 low cost machines are used to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort.
Abstract: We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

19 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A Persian-English comparable corpus from two independent news collections is built using the similarity of the document topics and their publication dates to align the documents in these sets.
Abstract: Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.

19 citations


Book ChapterDOI
20 Sep 2010
TL;DR: StaLe deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms and is compact, efficient and fast to apply to new languages.
Abstract: We present a dictionary- and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. StaLe is competitive, reaching 88-108 % of gold standard performance of a commercial lemmatizer in IR experiments. Despite competitive performance, it is compact, efficient and fast to apply to new languages.

19 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A novel evaluation model for IE is proposed and it is argued that it allows a correct appreciation of the degree of overlap between predicted and true segments, and a fair evaluation of the ability of a system to correctly identify segment boundaries.
Abstract: The issue of how to experimentally evaluate information extraction (IE) systems has received hardly any satisfactory solution in the literature. In this paper we propose a novel evaluation model for IE and argue that, among others, it allows (i) a correct appreciation of the degree of overlap between predicted and true segments, and (ii) a fair evaluation of the ability of a system to correctly identify segment boundaries. We describe the properties of this models, also by presenting the result of a re-evaluation of the results of the CoNLL'03 and CoNLL'02 Shared Tasks on Named Entity Extraction.

Book ChapterDOI
20 Sep 2010
TL;DR: Experiments carried out using runs from the CLEF-IP 2009 datasets show that PRES and Recall are more robust than MAP for incomplete relevance sets for this task with a small preference to PRES as the most robust evaluation metric for patent retrieval with respect to the completeness of the relevance set.
Abstract: Recent years have seen a growing interest in research into patent retrieval. One of the key issues in conducting information retrieval (IR) research is meaningful evaluation of the effectiveness of the retrieval techniques applied to task under investigation. Unlike many existing well explored IR tasks where the focus is on achieving high retrieval precision, patent retrieval is to a significant degree a recall focused task. The standard evaluation metric used for patent retrieval evaluation tasks is currently mean average precision (MAP). However this does not reflect system recall well. Meanwhile, the alternative of using the standard recall measure does not reflect user search effort, which is a significant factor in practical patent search environments. In recent work we introduce a novel evaluation metric for patent retrieval evaluation (PRES) [ 13]. This is designed to reflect both system recall and user effort. Analysis of PRES demonstrated its greater effectiveness in evaluating recall-oriented applications than standard MAP and Recall. One dimension of the evaluation of patent retrieval which has not previously been studied is the effect on reliability of the evaluation metrics when relevance judgements are incomplete. We provide a study comparing the behaviour of PRES against the standard MAP and Recall metrics for varying incomplete judgements in patent retrieval. Experiments carried out using runs from the CLEF-IP 2009 datasets show that PRES and Recall are more robust than MAP for incomplete relevance sets for this task with a small preference to PRES as the most robust evaluation metric for patent retrieval with respect to the completeness of the relevance set.

Proceedings Article
01 Jan 2010
TL;DR: This paper describes the robot vision track that has been proposed to the ImageCLEF 2010 participants, which addressed the problem of visual place classication, with a special focus on general- ization.
Abstract: This paper describes the robot vision track that has been proposed to the ImageCLEF 2010 participants. The track addressed the problem of visual place classication, with a special focus on general- ization. Participants were asked to classify rooms and areas of an oce environment on the basis of image sequences captured by a stereo cam- era mounted on a mobile robot, under varying illumination conditions. The algorithms proposed by the participants had to answer the question \where are you?" (I am in the kitchen, in the corridor, etc) when pre- sented with a test sequence, acquired within the same building but at a dierent oor than the training sequence. The test data contained images of rooms seen during training, or additional rooms that were not imaged in the training sequence. The participants were asked to solve the prob- lem separately for each test image (obligatory task). Additionally, results could also be reported for algorithms exploiting the temporal continuity of the image sequences (optional task). A total of seven groups partic- ipated to the challenge, with 42 runs submitted to the obligatory task, and 13 submitted to the optional task. The best result in the obligatory task was obtained by the Computer Vision and Geometry Laboratory, ETHZ, Switzerland, with an overall score of 677. The best result in the optional task was obtained by the Idiap Research Institute, Martigny, Switzerland, with an overall score of 2052.

Proceedings Article
01 Jan 2010
TL;DR: This paper uses the standard Information Retrieval techniques which contain indexing and retrieval processes to retrieve patent documents based on the prior art and achieves the best result for the combinations of invention-title, description and claims fields in terms of precision and recall.
Abstract: In this paper, we report our approach to retrieve patent documents based on the prior art. We use the standard Information Retrieval (IR) techniques which contain indexing and retrieval processes. We use various combinations of document fields for the query formulation. Based on the evaluation summary, we achieve the best result for the combinations of invention-title, description and claims fields in terms of precision and recall.

Proceedings Article
01 Jan 2010
TL;DR: In this article, the authors present the experiments carried out as part of the participation in the QA@CLEF 2010 ResPubliQA Task and Answer Selection (AS) Task.
Abstract: . The article presents the experiments carried out as part of the participation in the Paragraph Selection (PS) Task and Answer Selection (AS) Task of QA@CLEF 2010 ResPubliQA. Our System use Apache Lucene for –document retrieval system. All test are indexed using Apache documentsLucene. Stop words are removed from each question and query words are identified to retrieve the most relevant documents Lucene. Relevant using paragraphs are selected from the retrieved documents based on the TF-IDF of the matching query words along with n-gram overlap of the paragraph with the original question. Chunk boundaries are detected in the original question and key chunks are identified. Chunk boundaries are also detected in each sentence in a paragraph. The key chunks are matched in each sentence in a paragraph and relevant sentences are identified based on the key chunk matching score. Each question is analyzed to identify its possible answer type. The SRL Tool (Assert Tool Kit) [1] is applied on each sentence in a paragraph to assign semantic roles to each chunk. The Answer Extraction module identifies the appropriate chunk in a sentence as the exact answer whose semantic role matches with the possible answer type for the question. The tasks have been carried out for English. The Paragraph Selection task has been evaluated on the test data with an overall accuracy score of 0.37 and c@1 measure of 0.50. The Answer Extraction task has performed poorly with an overall accuracy score of 0.16 and c@1 measure of 0.26. Keywords: Lucene Index, Chunk Boundary, n-gram overlapping, SRL Tool


Book ChapterDOI
20 Sep 2010
TL;DR: This note motivate and propose a metric that goes beyond mere lexical matching of system-produced descriptors against a ground truth, allowing for graded relevance and rewarding diversity in the list of descriptors returned.
Abstract: Entity profiling is the task of identifying and ranking descriptions of a given entity. The task may be viewed as one where the descriptions being sought are terms that need to be selected from a knowledge source (such as an ontology or thesaurus). In this case, entity profiling systems can be assessed by means of precision and recall values of the descriptive terms produced. However, recent evidence suggests thatmore sophisticated metrics are needed that go beyond mere lexical matching of system-produced descriptors against a ground truth, allowing for graded relevance and rewarding diversity in the list of descriptors returned. In this note, we motivate and propose such a metric.

Proceedings Article
01 Jan 2010
TL;DR: The general strategy is to find the supporting word context from query and candidate passages during the paragraph selection by using the techniques of state-of-the art publicly available question answering systems and the random projection implementation in the Semantic Vectors package to evaluate the word context.
Abstract: This year we participated in the English monolingual paragraph selection task at ResPubliQA 2010 Our general strategy is to find the supporting word context from query and candidate passages during the paragraph selection We use the techniques of state-of-the art publicly available question answering systems, ie Open Ephyra and JIRS, and the random projection implementation in the Semantic Vectors package to evaluate the word context To strengthen the paragraph selection, besides the context evaluation, we also use n-gram overlapping and textual containment Our approach has a c@1 measure of 073 for our pattern-based context configuration and 064 for our n-gram-based context configuration

Book ChapterDOI
20 Sep 2010
TL;DR: The single most important step forward for multilingual and multimedia information access which PROMISE will work towards is to provide an open evaluation infrastructure in order to support automation and collaboration in the evaluation process.
Abstract: Participative Research laboratory for Multimedia and Multilingual Information Systems Evaluation (PROMISE) is a Network of Excellence, starting in conjunction with this first independent CLEF 2010 conference, and designed to support and develop the evaluation of multilingual and multimedia information access systems, largely through the activities taking place in Cross-Language Evaluation Forum (CLEF) today, and taking it forward in important new ways. PROMISE is coordinated by the University of Padua, and comprises 10 partners: the Swedish Institute for Computer Science, the University of Amsterdam, Sapienza University of Rome, University of Applied Sciences of Western Switzerland, the Information Retrieval Facility, the Zurich University of Applied Sciences, the Humboldt University of Berlin, the Evaluation and Language Resources Distribution Agency, and the Centre for the Evaluation of Language Communication Technologies. The single most important step forward for multilingual and multimedia information access which PROMISE will work towards is to provide an open evaluation infrastructure in order to support automation and collaboration in the evaluation process.

Book ChapterDOI
20 Sep 2010
TL;DR: It is claimed that IR is still in its scientific infancy, and experimentation should be used for hypothesis generation and testing in general, in order to accumulate a better understanding of the retrieval process and to develop a broader theoretic foundation for the field.
Abstract: Evaluation has always played a major role in IR research, as a means for judging about the quality of competing models. Lately, however, we have seen an over-emphasis of experimental results, thus favoring engineering approaches aiming at tuning performance and neglecting other scientific criteria. A recent study investigated the validity of experimental results published at major conferences, showing that for 95% of the papers using standard test collections, the claimed improvements were only relative, and the resulting quality was inferior to that of the top performing systems [AMWZ09]. In this talk, it is claimed that IR is still in its scientific infancy. Despite the extensive efforts in evaluation initiatives, the scientific insights gained are still very limited - partly due to shortcomings in the design of the testbeds. From a general scientific standpoint, using test collections for evaluation only is a waste of resources. Instead, experimentation should be used for hypothesis generation and testing in general, in order to accumulate a better understanding of the retrieval process and to develop a broader theoretic foundation for the field.

Book ChapterDOI
20 Sep 2010
TL;DR: A comparative analysis of different log file types and their potential for gathering information about user behavior in a multilingual information system is presented and the Europeana Clickstream Logger is presented.
Abstract: In this paper, a comparative analysis of different log file types and their potential for gathering information about user behavior in a multilingual information system is presented. It starts with a discussion of potential questions to be answered in order to form an appropriate view of user needs and requirements in a multilingual information environment and the possibilities of gaining this information from log files. Based on actual examples from the Europeana portal, we compare and contrast different types of log files and the information gleaned from them. We then present the Europeana Clickstream Logger, which logs and gathers extended information on user behavior, and show first examples of the data collection possibilities.

Proceedings Article
01 Jan 2010
TL;DR: Each image in the test collection was annotated with a set of MeSH headings from two different sources: human-assigned MEDLINE and external data from outside sources.
Abstract: Over the past several years, our team has focused its efforts on improving retrieval precision performance by mixing visual and textual information. This year, we chose to explore ways in which we could use external data to enrich our retrieval system’s data set; specifically, we annotated each image in the test collection with a set of MeSH headings from two different sources: human-assigned MEDLINE

Proceedings Article
01 Jan 2010
TL;DR: In this article, the results of the UIUC-IBM team in participating in the medical case retrieval task of ImageCLEF 2010 were reported, and they experimented with multiple methods to leverage medical ontology and user (physician) feedback.
Abstract: This paper reports the experiment results of the UIUC-IBM team in participating in the medical case retrieval task of ImageCLEF 2010. We experimented with multiple methods to leverage medical ontology and user (physician) feedback; both have worked very well, achieving the best retrieval performance among all the submissions.

Proceedings Article
01 Jan 2010
TL;DR: A new feature set is presented that is a bal nced and extended version of the well known Vector Space Model (VSM) and it is shown that this representation outperforms the original VSM and its attribute se l ct d version as well.
Abstract: In online communities, like Wikipedia, where content editi on s available for every visitor users who deliberately make incorrec t, vandal comments are sure to turn up. In this paper we propose a strong feature s et and a method that can handle this problem and automatically decide wheth er an edit is a vandal contribution or not. We present a new feature set that is a bal nced and extended version of the well known Vector Space Model (VSM) and show th at is representation outperforms the original VSM and its attribute se l ct d version as well. Moreover, we describe other features that we used in our vand alism detection system and a parameter estimation method for a weighted voti ng metaclassifier.