scispace - formally typeset
Search or ask a question

Showing papers presented at "Cross-Language Evaluation Forum in 2008"


Book ChapterDOI
17 Sep 2008
TL;DR: The GeoCLEF 2008 task presented twenty-five geographically challenging search topics for English, German and Portuguese, based on a variety of approaches, including sample documents, named entity extraction and ontology based retrieval.
Abstract: GeoCLEF is an evaluation task running under the scope of the Cross Language Evaluation Forum (CLEF). The purpose of GeoCLEF is to test and evaluate cross-language geographic information retrieval (GIR). The GeoCLEF 2008 task presented twenty-five geographically challenging search topics for English, German and Portuguese. Eleven participants submitted 131 runs, based on a variety of approaches, including sample documents, named entity extraction and ontology based retrieval. The evaluation methodology and results are presented in the paper.

133 citations


Book ChapterDOI
01 May 2008
TL;DR: Experimental results on CLEF-2007 corpora (domain-specific track) show that the dictionary adaptation mechanisms appear quite effective in the CLIR framework, exceeding in certain cases the performance of much more complex Machine Translation systems and even the performanceof the monolingual baseline.
Abstract: Our participation to CLEF07 (Domain-specific Track) was motivated this year by assessing several query translation and expansion strategies that we recently designed and developed. One line of research and development was to use our own Statistical Machine Translation system (called Matrax) and its intermediate outputs to perform query translation and disambiguation. Our idea was to benefit from Matrax' flexibility to output more than one plausible translations and to train its Language Model component on the CLEF07 target corpora. The second line of research consisted in designing algorithms to adapt an initial, general probabilistic dictionary to a particular pair (query, target corpus); this constitutes some extreme viewpoint on the "bilingual lexicon extraction and adaptation" topic. For this strategy, our main contributions lie in a pseudo-feedback algorithm and an EM-like optimisation algorithm that realize this adaptation. A third axis was to evaluate the potential impact of "Lexical Entailment" models in a cross-lingual framework, as they were only used in a monolingual setting up to now. Experimental results on CLEF-2007 corpora (domain-specific track) show that the dictionary adaptation mechanisms appear quite effective in the CLIR framework, exceeding in certain cases the performance of much more complex Machine Translation systems and even the performance of the monolingual baseline. In most cases also, Lexical Entailment models, used as query expansion mechanisms, turned out to be beneficial.

119 citations


Proceedings Article
01 Jan 2008
TL;DR: In this cross-lingual extension of ESA, the cross-language links of Wikipedia are used in order to map the ESA vectors between different languages, thus allowing retrieval across languages.
Abstract: We have participated on the monolingual and bilingual CLEF Ad-Hoc Retrieval Tasks, using a novel extension of the by now well-known Explicit Semantic Analysis (ESA) approach. We call this extension Cross-Language Explicit Semantic Analysis (CL-ESA) as it allows to apply ESA in a cross-lingual information retrieval setting. In essence, ESA represents documents as vectors in the space of Wikipedia articles, using the tfidf measure to capture how “important” a Wikipedia article is for a specific word. The interesting property of ESA is that arbitrary documents can be represented as a vector with respect to the Wikipedia article space. ESA thus replaces the standard BOW model for retrieval. In our cross-lingual extension of ESA, the cross-language links of Wikipedia are used in order to map the ESA vectors between different languages, thus allowing retrieval across languages. Our results are far behind the ones of other systems on the monolingual and ad-hoc retrieval tasks, but our motivation was to find out the potential of the CL-ESA approach using a first and unoptimized implementation thereof.

95 citations


Book ChapterDOI
01 May 2008
TL;DR: The general photographic ad-hoc retrieval task of the ImageCLEF 2007 evaluation campaign provides both the resources and the framework necessary to perform comparative laboratory-style evaluation of visual information retrieval from generic photographic collections.
Abstract: The general photographic ad-hoc retrieval task of the ImageCLEF 2007 evaluation campaign is described. This task provides both the resources and the framework necessary to perform comparative laboratory-style evaluation of visual information retrieval from generic photographic collections. In 2007, the evaluation objective concentrated on retrieval of lightly annotated images, a new challenge that attracted a large number of submissions: a total of 20 participating groups submitted 616 system runs. This paper summarises the components used in the benchmark, including the document collection and the search tasks, and presents an analysis of the submissions and the results.

92 citations


Book ChapterDOI
01 May 2008
TL;DR: The exercise description, changes in the evaluation methodology with respect to the first edition, and the results of this second edition (AVE 2007) show an evidence of the potential gain that more sophisticated AV modules introduce in the task of QA.
Abstract: The Answer Validation Exercise at the Cross Language Evaluation Forum is aimed at developing systems able to decide whether the answer of a Question Answering system is correct or not. We present here the exercise description, the changes in the evaluation methodology with respect to the first edition, and the results of this second edition (AVE 2007). The changes in the evaluation methodology had two objectives: the first one was to quantify the gain in performance when more sophisticated validation modules are introduced in QA systems. The second objective was to bring systems based on Textual Entailment to the Automatic Hypothesis Generation problem which is not part itself of the Recognising Textual Entailment (RTE) task but a need of the Answer Validation setting. 9 groups have participated with 16 runs in 4 different languages. Compared with the QA systems, the results show an evidence of the potential gain that more sophisticated AV modules introduce in the task of QA.

88 citations


Book ChapterDOI
17 Sep 2008
TL;DR: The exercise description, the changes in the evaluation with respect to the last edition and the results show an evidence of the potential gain that more sophisticated AV modules might introduce in the task of QA.
Abstract: The Answer Validation Exercise at the Cross Language Evaluation Forum (CLEF) is aimed at developing systems able to decide whether the answer of a Question Answering (QA) system is correct or not. We present here the exercise description, the changes in the evaluation with respect to the last edition and the results of this third edition (AVE 2008). Last year's changes allowed us to measure the possible gain in performance obtained by using AV systems as the selection method of QA systems. Then, in this edition we wanted to reward AV systems able to detect also if all the candidate answers to a question are incorrect. 9 groups have participated with 24 runs in 5 different languages, and compared with the QA systems, the results show an evidence of the potential gain that more sophisticated AV modules might introduce in the task of QA.

84 citations


Book ChapterDOI
17 Sep 2008
TL;DR: The medical image retrieval task of ImageCLEF is in its fifth year and participation continues to increase to a total of 37 registered research groups, with best results regarding MAP were similar for textual and multi- modal approaches whereas early precision was better for some multi-modal approaches.
Abstract: The medical image retrieval task of ImageCLEF is in its fifth year and participation continues to increase to a total of 37 registered research groups. About half the registered groups finally submit results. Main change in 2008 was the use of a new databases containing images of the medical scientific literature (articles from the Journals Radiology and Radiographics). Besides the images, the figure captions and the part of the caption referring to a particular sub-figure were supplied as well as access to the full text articles in html. All texts were in English and the topics were supplied in German, French, and English. 30 topics were made available, ten of each of the categories visual, mixed, semantic. Most groups concentrated on fully automatic retrieval. Only three groups submitted a total of six manual or interactive runs not showing an increase of performance over automatic approaches. In previous years, multi-modal combinations were the most frequent submissions but in 2008 text only runs were clearly higher. Only very few fully visual runs were submitted and non of the fully visual runs had an extremely good performance. Part of these tendencies might be due to semantic topics and the extremely well annotated database. Best results regarding MAP were similar for textual and multi-modal approaches whereas early precision was better for some multi-modal approaches.

83 citations


Book ChapterDOI
17 Sep 2008
TL;DR: Some of the findings include that the choice of annotation language is almost negligible and the best runs are by combining concept and content-based retrieval methods.
Abstract: ImageCLEFphoto 2008 is an ad-hoc photo retrieval task and part of the ImageCLEF evaluation campaign. This task provides both the resources and the framework necessary to perform comparative laboratory-style evaluation of visual information retrieval systems. In 2008, the evaluation task concentrated on promoting diversity within the top 20 results from a multilingual image collection. This new challenge attracted a record number of submissions: a total of 24 participating groups submitting 1,042 system runs. Some of the findings include that the choice of annotation language is almost negligible and the best runs are by combining concept and content-based retrieval methods.

82 citations


Book ChapterDOI
17 Sep 2008
TL;DR: The QA campaign at CLEF 2008, was mainly the same as that proposed last year, but the main task still proved to be very challenging for participating systems and the best overall accuracy dropped significantly, but increased a little in the monolingual sub-tasks.
Abstract: The QA campaign at CLEF 2008 [1], was mainly the same as that proposed last year. The results and the analyses reported by last year's participants suggested that the changes introduced in the previous campaign had led to a drop in systems' performance. So for this year's competition it has been decided to practically replicate last year's exercise. Following last year's experience some QA pairs were grouped in clusters. Every cluster was characterized by a topic (not given to participants). The questions from a cluster contained coreferences between one of them and the others. Moreover, as last year, the systems were given the possibility to search for answers in Wikipedia as document corpus beside the usual newswire collection. In addition to the main task, three additional exercises were offered, namely the Answer Validation Exercise (AVE), the Question Answering on Speech Transcriptions (QAST), which continued last year's successful pilots, together with the new Word Sense Disambiguation for Question Answering (QA-WSD). As general remark, it must be said that the main task still proved to be very challenging for participating systems. As a kind of shallow comparison with last year's results the best overall accuracy dropped significantly from 42% to 19% in the multi-lingual subtasks, but increased a little in the monolingual sub-tasks, going from 54% to 63%.

81 citations


Book ChapterDOI
17 Sep 2008
TL;DR: Two retrieval models are evaluated, i.e. SR-Text and SR-Word, based on semantic relatedness by comparing their performance to a statistical model as implemented by Lucene, which shows that the latter approach especially improves the retrieval performance in cases where the machine translation system incorrectly translates query terms.
Abstract: The main objective of our experiments in the domain-specific track at CLEF 2008 is utilizing semantic knowledge from collaborative knowledge bases such as Wikipedia and Wiktionary to improve the effectiveness of information retrieval. While Wikipedia has already been used in IR, the application of Wiktionary in this task is new. We evaluate two retrieval models, i.e. SR-Text and SR-Word, based on semantic relatedness by comparing their performance to a statistical model as implemented by Lucene. We refer to Wikipedia article titles and Wiktionary word entries as concepts and map query and document terms to concept vectors which are then used to compute the document relevance. In the bilingual task, we translate the English topics into the document language, i.e. German, by using machine translation. For SR-Text, we alternatively perform the translation process by using cross-language links in Wikipedia, whereby the terms are directly mapped to concept vectors in the target language. The evaluation shows that the latter approach especially improves the retrieval performance in cases where the machine translation system incorrectly translates query terms.

76 citations


Book ChapterDOI
01 May 2008
TL;DR: The results and the analyses reported by the participants suggest that the introduction of Wikipedia and the topic related questions led to a drop in systems' performance.
Abstract: The fifth QA campaign at CLEF [1], having its first edition in 2003, offered not only a main task but an Answer Validation Exercise (AVE) [2], which continued last year's pilot, and a new pilot: the Question Answering on Speech Transcripts (QAST) [3, 15]. The main task was characterized by the focus on cross-linguality, while covering as many European languages as possible. As novelty, some QA pairs were grouped in clusters. Every cluster was characterized by a topic (not given to participants). The questions from a cluster possibly contain co-references between one of them and the others. Finally, the need for searching answers in web formats was satisfied by introducing Wikipedia as document corpus. The results and the analyses reported by the participants suggest that the introduction of Wikipedia and the topic related questions led to a drop in systems' performance.

Book ChapterDOI
01 May 2008
TL;DR: The CLEF-2007 Cross-Language Speech Retrieval (CL-SR) track included two tasks: to identify topically coherent segments of English interviews in a known-boundary condition and to identify time stamps marking the beginning of topically relevant passages in Czech interviews in an unknown- boundary condition.
Abstract: The CLEF-2007 Cross-Language Speech Retrieval (CL-SR) track included two tasks: to identify topically coherent segments of English interviews in a known-boundary condition, and to identify time stamps marking the beginning of topically relevant passages in Czech interviews in an unknown-boundary condition. Six teams participated in the English evaluation, performing both monolingual and cross-language searches of ASR transcripts, automatically generated metadata, and manually generated metadata. Four teams participated in the Czech evaluation, performing monolingual searches of automatic speech recognition transcripts.

Book ChapterDOI
17 Sep 2008
TL;DR: This paper presented WikiTranslate, a system which performs query translation for cross-lingual information retrieval using only Wikipedia to obtain translations, where queries are mapped to Wikipedia concepts and corresponding translations of these concepts in the target language are used to create the final query.
Abstract: This paper presents WikiTranslate, a system which performs query translation for cross-lingual information retrieval (CLIR) using only Wikipedia to obtain translations. Queries are mapped to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. WikiTranslate is evaluated by searching with topics formulated in Dutch, French and Spanish in an English data collection. The system achieved a performance of 67% compared to the monolingual baseline.

Book ChapterDOI
17 Sep 2008
TL;DR: The objectives and organization of the CLEF 2008 Ad Hoc track are described and the main characteristics of the tasks offered to test monolingual and cross-language textual document retrieval systems are discussed.
Abstract: We describe the objectives and organization of the CLEF 2008 Ad Hoc track and discuss the main characteristics of the tasks offered to test monolingual and cross-language textual document retrieval systems. The track was changed considerably this year with the introduction of tasks with new document collections consisting of (i) library catalog records derived from The European Library, and (ii) and non-European language data, plus a task offering the chance to test retrieval with word sense disambiguated data. The track was thus structured in three distinct streams denominated: TEL@CLEF, Persian@CLEF and Robust WSD. The results obtained for each task are presented and statistical analyses are given.

Book ChapterDOI
01 May 2008
TL;DR: This paper describes the attempt to build a Cross-Lingual Information Retrieval (CLIR) system as a part of the Indian language sub-task of the main Adhoc monolingual and bilingual track in CLEF competition.
Abstract: This paper describes our attempt to build a Cross-Lingual Information Retrieval (CLIR) system as a part of the Indian language sub-task of the main Adhoc monolingual and bilingual track in CLEF competition. In this track, the task required retrieval of relevant documents from an English corpus in response to a query expressed in different Indian languages including Hindi, Tamil, Telugu, Bengali and Marathi. Groups participating in this track were required to submit a English to English monolingual run and a Hindi to English bilingual run with optional runs in rest of the languages. Our submission consisted of a monolingual English run and a Hindi to English cross-lingual run. We used a word alignment table that was learnt by a Statistical Machine Translation (SMT) system trained on aligned parallel sentences, to map a query in the source language into an equivalent query in the language of the document collection. The relevant documents are then retrieved using a Language Modeling based retrieval algorithm. On the CLEF 2007 data set, our official cross-lingual performance was 54.4% of the monolingual performance and in the post submission experiments we found that it can be significantly improved up to 76.3%.


Book ChapterDOI
01 May 2008
TL;DR: This paper describes the medical image retrieval and medical image annotation tasks of ImageCLEF 2007 and describes each of the two tasks, with the participation and an evaluation of major findings from the results of each given.
Abstract: This paper describes the medical image retrieval and medical image annotation tasks of ImageCLEF 2007. Separate sections describe each of the two tasks, with the participation and an evaluation of major findings from the results of each given. A total of 13 groups participated in the medical retrieval task and 10 in the medical annotation task. The medical retrieval task added two new data sets for a total of over 66'000 images. Topics were derived from a log file of the Pubmed biomedical literature search system, creating realistic information needs with a clear user model. The medical annotation task was in 2007 organized in a new format as a hierarchical classification had to be performed and classification could be stopped at any hierarchy level. This required algorithms to change significantly and to integrate a confidence level into their decisions to be able to judge where to stop classification to avoid making mistakes in the hierarchy. Scoring took into account errors and unclassified parts.

Book ChapterDOI
17 Sep 2008
TL;DR: An overview of the wikipediaMM task's resources, topics, assessments, participants' approaches, and main results is presented.
Abstract: The wikipediaMM task provides a testbed for the systemoriented evaluation of ad-hoc retrieval from a large collection of Wikipedia images. It became a part of the ImageCLEF evaluation campaign in 2008 with the aim of investigating the use of visual and textual sources in combination for improving the retrieval performance. This paper presents an overview of the task's resources, topics, assessments, participants' approaches, and main results.

Book ChapterDOI
01 May 2008
TL;DR: The objectives and organization of the CLEF 2007 Ad Hoc track are described and the main characteristics of the tasks offered to test monolingual and cross-language textual document retrieval systems are discussed.
Abstract: We describe the objectives and organization of the CLEF 2007 Ad Hoc track and discuss the main characteristics of the tasks offered to test monolingual and cross-language textual document retrieval systems. The track was divided into two streams. The main stream offered mono- and bilingual tasks on target collections for central European languages (Bulgarian, Czech and Hungarian). Similarly to last year, a bilingual task that encouraged system testing with non-European languages against English documents was also offered; this year, particular attention was given to Indian languages. The second stream, designed for more experienced participants, offered mono- and bilingual "robust" tasks with the objective of privileging experiments which achieve good stable performance over all queries rather than high average performance. These experiments re-used CLEF test collections from previous years in three languages (English, French, and Portuguese). The performance achieved for each task is presented and discussed.

Book ChapterDOI
01 May 2008
TL;DR: A system for unsupervised morpheme analysis and the results it obtained at Morpho Challenge 2007, which takes a plain list of words as input and returns a list of labelled morphemic segments for each word.
Abstract: This paper describes a system for unsupervised morpheme analysis and the results it obtained at Morpho Challenge 2007. The system takes a plain list of words as input and returns a list of labelled morphemic segments for each word. Morphemic segments are obtained by an unsupervised learning process which can directly be applied to different natural languages. Results obtained at competition 1 (evaluation of the morpheme analyses) are better in English, Finnish and German than in Turkish. For information retrieval (competition 2), the best results are obtained when indexing is performed using Okapi (BM25) weighting for all morphemes minus those belonging to an automatic stop list made of the most common morphemes.

Book ChapterDOI
17 Sep 2008
TL;DR: The results show that query-adapted methods are more effcient than nonadapted method, that visual only runs are more difficult to diversify than text only and text-image runs, and finally that only few methods maximize both the precision and the cluster recall at 20 documents.
Abstract: This article compares eight different diversity methods: 3 based on visual information, 1 based on date information, 3 adapted to each topic based on location and visual information; finally, for completeness, 1 based on random permutation To compare the effectiveness of these methods, we apply them on 26 runs obtained with varied methods from different research teams and based on different modalities We then discuss the results of the more than 200 obtained runs The results show that query-adapted methods are more effcient than nonadapted method, that visual only runs are more difficult to diversify than text only and text-image runs, and finally that only few methods maximize both the precision and the cluster recall at 20 documents

Book ChapterDOI
17 Sep 2008
TL;DR: The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content, and will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks.
Abstract: The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content. In its first year, VideoCLEF piloted the Vid2RSS task, whose main subtask was the classification of dual language video (Dutch-language television content featuring English-speaking experts and studio guests). The task offered two additional discretionary subtasks: feed translation and automatic keyframe extraction. Task participants were supplied with Dutch archival metadata, Dutch speech transcripts, English speech transcripts and ten thematic category labels, which they were required to assign to the test set videos. The videos were grouped by class label into topic-based RSS-feeds, displaying title, description and keyframe for each video. Five groups participated in the 2008 VideoCLEF track. Participants were required to collect their own training data; both Wikipedia and general web content were used. Groups deployed various classifiers (SVM, Naive Bayes and k-NN) or treated the problem as an information retrieval task. Both the Dutch speech transcripts and the archival metadata performed well as sources of indexing features, but no group succeeded in exploiting combinations of feature sources to significantly enhance performance. A small scale fluency/adequacy evaluation of the translation task output revealed the translation to be of sufficient quality to make it valuable to a non-Dutch speaking English speaker. For keyframe extraction, the strategy chosen was to select the keyframe from the shot with the most representative speech transcript content. The automatically selected shots were shown, with a small user study, to be competitive with manually selected shots. Future years of VideoCLEF will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks.

Book ChapterDOI
01 May 2008
TL;DR: The SINAI research group has participated in the multilingual image retrieval subtask and has used the MeSH ontology to expand the queries, and processed the set of collections using Information Gain in the same way as in ImageCLEFmed 2006.
Abstract: This paper describes the SINAI team participation in the ImageCLEFmed campaign. The SINAI research group has participated in the multilingual image retrieval subtask. The experiments accomplished are based on the integration of specific knowledge in the topics. We have used the MeSH ontology to expand the queries. The expansion consists in searching terms from the topic query in the MeSH ontology in order to add similar terms. We have processed the set of collections using Information Gain (IG) in the same way as in ImageCLEFmed 2006. In our experiments mixing visual and textual information we obtain better results than using only textual information. The weigth of the textual information is very strong in this mixed strategy. In the experiments with a low textual weight, the use of IG improves the results obtained.

Book ChapterDOI
01 May 2008
TL;DR: Two different question-answering systems on speech transcripts which participated to the QAst 2007 evaluation are presented, one of which replaces the handcrafted rules for small text fragments (snippet) selection and answer extraction with an automatically generated research descriptor.
Abstract: In this paper, we present two different question-answering systems on speech transcripts which participated to the QAst 2007 evaluation. These two systems are based on a complete and multi-level analysis of both queries and documents. The first system uses handcrafted rules for small text fragments (snippet) selection and answer extraction. The second one replaces the handcrafting with an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. The evaluation results are ranged from 17% to 39% as accuracy depending on the tasks.

Book ChapterDOI
17 Sep 2008
TL;DR: The Visual Concept Detection Task (VCDT) of ImageCLEF 2008 is described, where a database of 2,827 images were manually annotated with 17 concepts and the best runs obtained an AUC of 80% or above.
Abstract: The Visual Concept Detection Task (VCDT) of ImageCLEF 2008 is described. A database of 2,827 images were manually annotated with 17 concepts. Of these, 1,827 were used for training and 1,000 for testing the automated assignment of categories. In total 11 groups participated and submitted 53 runs. The runs were evaluated using ROC curves, from which the Area Under the Curve (AUC) and Equal Error Rate (EER) were calculated. For each concept, the best runs obtained an AUC of 80% or above.

Book ChapterDOI
17 Sep 2008
TL;DR: The QAST 2008 evaluation framework is described, along with descriptions of the five scenarios and their associated data, the system submissions for this pilot track and the official evaluation results.
Abstract: This paper describes the experience of QAST 2008, the second time a pilot track of CLEF has been held aiming to evaluate the task of Question Answering in Speech Transcripts. Five sites submitted results for at least one of the five scenarios (lectures in English, meetings in English, broadcast news in French and European Parliament debates in English and Spanish). In order to assess the impact of potential errors of automatic speech recognition, for each task contrastive conditions are with manual and automatically produced transcripts. The QAST 2008 evaluation framework is described, along with descriptions of the five scenarios and their associated data, the system submissions for this pilot track and the official evaluation results.

Book ChapterDOI
01 May 2008
TL;DR: This model makes use of the textual part of ImageCLEFmed corpus and of the medical knowledge as found in the Unified Medical Language System (UMLS) knowledge sources to create a graph model for each document.
Abstract: The main idea in this paper is to incorporate medical knowledge in the language modeling approach to information retrieval (IR). Our model makes use of the textual part of ImageCLEFmed corpus and of the medical knowledge as found in the Unified Medical Language System (UMLS) knowledge sources. The use of UMLS allows us to create a conceptual representation of each sentence in the corpus. We use these representations to create a graph model for each document. As in the standard language modeling approach, we evaluate the probability that a document graph model generates the query graph. Graphs are created from medical texts and queries, and are built for different languages, with different methods. After developing the graph model, we present our tests, which involve mixing different concepts sources (i.e. languages and methods) for the matching of the query and text graphs. Results show that using language model on concepts provides good results in IR. Multiplying the concept sources further improves the results. Lastly, using relations between concepts (provided by the graphs under consideration) improves results when only few conceptual sources are used to analyze the query.

Book ChapterDOI
01 May 2008
TL;DR: ParaMor as mentioned in this paper automatically learns morphological paradigms from unlabeled text and uses them to annotate word forms with morpheme boundaries, achieving first place in the English and German tracks of Morpho Challenge 2007.
Abstract: ParaMor automatically learns morphological paradigms from unlabelled text, and uses them to annotate word forms with morpheme boundaries. ParaMor competed in the English and German tracks of Morpho Challenge 2007 (Kurimo et al., 2008). In English, ParaMor's balanced precision and recall outperform at F1 an already sophisticated baseline induction algorithm, Morfessor (Creutz, 2006). In German, ParaMor suffers from a low morpheme recall. But combining ParaMor's analyses with analyses from Morfessor results in a set of analyses that outperform either algorithm alone, and that place first in F1 among all algorithms submitted to Morpho Challenge 2007. Categories and Subject Descriptions: I.2 [Artificial Intelligence]: I.2.7 Natural Language Processing.

Book ChapterDOI
17 Sep 2008
TL;DR: This paper reports on the GikiP pilot that took place in 2008 in GeoCLEF, a combination of methods from geographical information retrieval and question answering to answer queries to the Wikipedia.
Abstract: This paper reports on the GikiP pilot that took place in 2008 in GeoCLEF. This pilot task requires a combination of methods from geographical information retrieval and question answering to answer queries to the Wikipedia. We start by the task description, providing details on topic choice and evaluation measures. Then we offer a brief motivation from several perspectives, and we present results in detail. A comparison of participants' approaches is then presented, and the paper concludes with improvements for the next edition.

Book ChapterDOI
01 May 2008
TL;DR: This paper presents the Hindi to English and Marathi to English CLIR systems developed as part of the participation in the CLEF 2007 Ad-Hoc Bilingual task, and takes a query translation based approach using bi-lingual dictionaries.
Abstract: In this paper, we present our Hindi to English and Marathi to English CLIR systems developed as part of our participation in the CLEF 2007 Ad-Hoc Bilingual task. We take a query translation based approach using bi-lingual dictionaries. Query words not found in the dictionary are transliterated using a simple rule based transliteration approach. The resultant transliteration is then compared with the unique words of the corpus to return the `k' words most similar to the transliterated word. The resulting multiple translation/transliteration choices for each query word are disambiguated using an iterative page-rank style algorithm which, based on term-term co-occurrence statistics, produces the final translated query. Using the above approach, for Hindi, we achieve a Mean Average Precision (MAP) of 0.2366 using title and a MAP of 0.2952 using title and description. For Marathi, we achieve a MAP of 0.2163 using title.