scispace - formally typeset
Search or ask a question

Showing papers on "Annotation published in 2017"


Journal ArticleDOI
TL;DR: The web application GeSeq combines batch processing with a fully customizable reference sequence selection of organellar genome records from NCBI and/or references uploaded by the user to support high-quality annotations of chloroplast genomes.
Abstract: We have developed the web application GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html) for the rapid and accurate annotation of organellar genome sequences, in particular chloroplast genomes. In contrast to existing tools, GeSeq combines batch processing with a fully customizable reference sequence selection of organellar genome records from NCBI and/or references uploaded by the user. For the annotation of chloroplast genomes, the application additionally provides an integrated database of manually curated reference sequences. GeSeq identifies genes or other feature-encoding regions by BLAT-based homology searches and additionally, by profile HMM searches for protein and rRNA coding genes and two de novo predictors for tRNA genes. These unique features enable the user to conveniently compare the annotations of different state-of-the-art methods, thus supporting high-quality annotations. The main output of GeSeq is a GenBank file that usually requires only little curation and is instantly visualized by OGDRAW. GeSeq also offers a variety of optional additional outputs that facilitate downstream analyzes, for example comparative genomic or phylogenetic studies.

1,663 citations


Book ChapterDOI
Lin Yang1, Yizhe Zhang1, Jianxu Chen1, Siyuan Zhang1, Danny Z. Chen1 
10 Sep 2017
TL;DR: A deep active learning framework that combines fully convolutional network (FCN) and active learning to significantly reduce annotation effort by making judicious suggestions on the most effective annotation areas is presented.
Abstract: Image segmentation is a fundamental problem in biomedical image analysis. Recent advances in deep learning have achieved promising results on many biomedical image segmentation benchmarks. However, due to large variations in biomedical images (different modalities, image settings, objects, noise, etc.), to utilize deep learning on a new application, it usually needs a new set of training data. This can incur a great deal of annotation effort and cost, because only biomedical experts can annotate effectively, and often there are too many instances in images (e.g., cells) to annotate. In this paper, we aim to address the following question: With limited effort (e.g., time) for annotation, what instances should be annotated in order to attain the best performance? We present a deep active learning framework that combines fully convolutional network (FCN) and active learning to significantly reduce annotation effort by making judicious suggestions on the most effective annotation areas. We utilize uncertainty and similarity information provided by FCN and formulate a generalized version of the maximum set cover problem to determine the most representative and uncertain areas for annotation. Extensive experiments using the 2015 MICCAI Gland Challenge dataset and a lymph node ultrasound image segmentation dataset show that, using annotation suggestions by our method, state-of-the-art segmentation performance can be achieved by using only 50% of training data.

438 citations


Journal ArticleDOI
TL;DR: The MalaCards human disease database is an integrated compendium of annotated diseases mined from 68 data sources and adopts a ‘flat’ disease-card approach, but each card is mapped to popular hierarchical ontologies and contains information about multi-level relations among diseases, thereby providing an optimal tool for disease representation and scrutiny.
Abstract: The MalaCards human disease database (http://www.malacards.org/) is an integrated compendium of annotated diseases mined from 68 data sources. MalaCards has a web card for each of ∼20 000 disease entries, in six global categories. It portrays a broad array of annotation topics in 15 sections, including Summaries, Symptoms, Anatomical Context, Drugs, Genetic Tests, Variations and Publications. The Aliases and Classifications section reflects an algorithm for disease name integration across often-conflicting sources, providing effective annotation consolidation. A central feature is a balanced Genes section, with scores reflecting the strength of disease-gene associations. This is accompanied by other gene-related disease information such as pathways, mouse phenotypes and GO-terms, stemming from MalaCards' affiliation with the GeneCards Suite of databases. MalaCards' capacity to inter-link information from complementary sources, along with its elaborate search function, relational database infrastructure and convenient data dumps, allows it to tackle its rich disease annotation landscape, and facilitates systems analyses and genome sequence interpretation. MalaCards adopts a 'flat' disease-card approach, but each card is mapped to popular hierarchical ontologies (e.g. International Classification of Diseases, Human Phenotype Ontology and Unified Medical Language System) and also contains information about multi-level relations among diseases, thereby providing an optimal tool for disease representation and scrutiny.

372 citations


Journal ArticleDOI
01 Sep 2017
TL;DR: The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.
Abstract: This paper presents the methodology, design principles and detailed evaluation of a new freely available multilayer corpus, collected and edited via classroom annotation using collaborative software. After briefly discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML, multiple part of speech tagging, constituent and dependency parsing, information structural and coreference annotation, and Rhetorical Structure Theory analysis. Layers are inspected for annotation quality and together they coalesce to form a richly annotated corpus that can be used to study the interactions between different levels of linguistic description. The evaluation gives an indication of the expected quality of a corpus created by students with relatively little training. A multifactorial example study on lexical NP coreference likelihood is also presented, which illustrates some applications of the corpus. The results of this project show that high quality, richly annotated resources can be created effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for corpora in linguistics pedagogy.

223 citations


Journal ArticleDOI
TL;DR: This paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the MicroScope platform: computation of pan-genomes and prediction of biosynthetic gene clusters.
Abstract: The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations.

166 citations


Journal ArticleDOI
TL;DR: The Spoken British National Corpus 2014 is introduced, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016.
Abstract: This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.

159 citations


Proceedings ArticleDOI
18 Apr 2017
TL;DR: ChartAccent, a tool that allows people to quickly and easily augment charts via a palette of annotation interactions that generate manual and data-driven annotations, is designed and developed.
Abstract: Annotation plays an important role in conveying key points in visual data-driven storytelling; it helps presenters explain and emphasize core messages and specific data. However, the visualization research community has a limited understanding of annotation and its role in data-driven storytelling, and existing charting software provides limited support for creating annotations. In this paper, we characterize a design space of chart annotations, one informed by a survey of 106 annotated charts published by six prominent news graphics desks. Using this design space, we designed and developed ChartAccent, a tool that allows people to quickly and easily augment charts via a palette of annotation interactions that generate manual and data-driven annotations. We also report on a study in which participants reproduced a series of annotated charts using ChartAccent, beginning with unadorned versions of the same charts. Finally, we discuss the lessons learned during the process of designing and evaluating ChartAccent, and suggest directions for future research.

91 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: The SemEval 2016 stance and sentiment dataset is extended with emotion annotation to investigate annotation reliability and annotation merging, and the relation between emotion annotation and the other annotation layers (stance, sentiment).
Abstract: There is a rich variety of data sets for sen- timent analysis (viz., polarity and subjec- tivity classification). For the more chal- lenging task of detecting discrete emotions following the definitions of Ekman and Plutchik, however, there are much fewer data sets, and notably no resources for the social media domain. This paper con- tributes to closing this gap by extending the SemEval 2016 stance and sentiment dataset with emotion annotation. We (a) analyse annotation reliability and annotation merg- ing; (b) investigate the relation between emotion annotation and the other annota- tion layers (stance, sentiment); (c) report modelling results as a baseline for future work.

82 citations


Posted ContentDOI
20 Feb 2017-bioRxiv
TL;DR: FUMA is developed, a web-based platform to facilitate functional annotation of GWAS results, prioritization of genes and interactive visualization of annotated results by incorporating information from multiple state-of-the-art biological databases.
Abstract: A main challenge in genome-wide association studies (GWAS) is to prioritize genetic variants and identify potential causal mechanisms of human diseases. Although multiple bioinformatics resources are available for functional annotation and prioritization, a standard, integrative approach is lacking. We developed FUMA: a web-based platform to facilitate functional annotation of GWAS results, prioritization of genes and interactive visualization of annotated results by incorporating information from multiple state-of-the-art biological databases.

71 citations


Journal ArticleDOI
TL;DR: This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine.
Abstract: The abundance and unstructured nature of biomedical texts, be it clinical or research content, impose significant challenges for the effective and efficient use of information and knowledge stored in such texts. Annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search. This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS. As a result, the meaning of those mentions is unambiguously and explicitly defined, and thus made readily available for automated processing. This process is widely known as semantic annotation, and the tools that perform it are known as semantic annotators. Over the last dozen years, the biomedical research community has invested significant efforts in the development of biomedical semantic annotation technology. Aiming to establish grounds for further developments in this area, we review a selected set of state of the art biomedical semantic annotators, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine. We also examine potential directions for further improvements of today’s annotators which could make them even more capable of meeting the needs of real-world applications. To motivate and encourage further developments in this area, along the suggested and/or related directions, we review existing and potential practical applications and benefits of semantic annotators.

65 citations


Book ChapterDOI
01 Jan 2017
TL;DR: The paper describes the development of a corpus from social media built with the aim of representing and analysing hate speech against some minority groups in Italy and introduces the issues related to data collection and annotation.
Abstract: English. The paper describes the development of a corpus from social media built with the aim of representing and analysing hate speech against some minority groups in Italy. The issues related to data collection and annotation are introduced, focusing on the challenges we addressed in designing a multifaceted set of labels where the main features of verbal hate expressions may be modelled. Moreover, an analysis of the disagreement among the annotators is presented in order to carry out a preliminary evaluation of the data set and the scheme. Italiano. L’articolo descrive un corpus di testi estratti da social media costruito con il principale obiettivo di rappresentare ed analizzare il fenomeno dell’hate speech rivolto contro i migranti in Italia. Vengono introdotti gli aspetti significativi della raccolta ed annotazione dei dati, richiamando l’attenzione sulle sfide affrontate per progettare un insieme di etichette che rifletta le molte sfaccettature necessarie a cogliere e modellare le caratteristiche delle espressioni di odio. Inoltre viene presentata un’analisi del disagreement tra gli annotatori allo scopo di tentare una preliminare valutazione del corpus e dello schema di annotazione stesso.

Journal ArticleDOI
TL;DR: A concept-based automatic semantic annotation method for the documents of online BIM products and a prototype annotation system, named BIMTag, is developed and combined with a search engine for demonstrating the utility and effectiveness of the method.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: A new affect annotation tool, RankTrace, is introduced that allows for the annotation of affect in a continuous yet unbounded fashion and suggests that the relative processing of traces via their mean gradient yields the best and most robust predictions of phasic manifestations of skin conductance.
Abstract: How should annotation data be processed so that it can best characterize the ground truth of affect? This paper attempts to address this critical question by testing various methods of processing annotation data on their ability to capture phasic elements of skin conductance. Towards this goal the paper introduces a new affect annotation tool, RankTrace, that allows for the annotation of affect in a continuous yet unbounded fashion. RankTrace is tested on first-person annotations of tension elicited from a horror video game. The key findings of the paper suggest that the relative processing of traces via their mean gradient yields the best and most robust predictors of phasic manifestations of skin conductance.


Journal ArticleDOI
TL;DR: This paper proposes an effective and robust scheme, termed robust multi-view semi-supervised learning (RMSL), for facilitating image annotation task, and exploits both labeled images and unlabeled images to uncover the intrinsic data structural information.
Abstract: Driven by the rapid development of Internet and digital technologies, we have witnessed the explosive growth of Web images in recent years. Seeing that labels can reflect the semantic contents of the images, automatic image annotation, which can further facilitate the procedure of image semantic indexing, retrieval, and other image management tasks, has become one of the most crucial research directions in multimedia. Most of the existing annotation methods, heavily rely on well-labeled training data (expensive to collect) and/or single view of visual features (insufficient representative power). In this paper, inspired by the promising advance of feature engineering (e.g., CNN feature and scale-invariant feature transform feature) and inexhaustible image data (associated with noisy and incomplete labels) on the Web, we propose an effective and robust scheme, termed robust multi-view semi-supervised learning (RMSL), for facilitating image annotation task. Specifically, we exploit both labeled images and unlabeled images to uncover the intrinsic data structural information. Meanwhile, to comprehensively describe an individual datum, we take advantage of the correlated and complemental information derived from multiple facets of image data (i.e., multiple views or features). We devise a robust pairwise constraint on outcomes of different views to achieve annotation consistency. Furthermore, we integrate a robust classifier learning component via $\ell _{2,p}$ loss, which can provide effective noise identification power during the learning process. Finally, we devise an efficient iterative algorithm to solve the optimization problem in RMSL. We conduct comprehensive experiments on three different data sets, and the results illustrate that our proposed approach is promising for automatic image annotation.

Journal ArticleDOI
TL;DR: This work proposes 2-pass k-nearest neighbour (2PKNN) algorithm, a two-step variant of the classical k-NEarest neighbour algorithm, that tries to address issues in the image annotation task, and establishes a new state-of-the-art on the prevailing image annotation datasets.
Abstract: Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels ("class-imbalance"). Additionally, due to the limitations of human annotation, several images are not annotated with all the relevant labels ("incomplete-labelling"). These two issues affect the performance of most of the existing image annotation models. In this work, we propose 2-pass k-nearest neighbour (2PKNN) algorithm. It is a two-step variant of the classical k-nearest neighbour algorithm, that tries to address these issues in the image annotation task. The first step of 2PKNN uses "image-to-label" similarities, while the second step uses "image-to-image" similarities, thus combining the benefits of both. We also propose a metric learning framework over 2PKNN. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label data. In addition to the features provided by Guillaumin et al. (2009) that are used by almost all the recent image annotation methods, we benchmark using new features that include features extracted from a generic convolutional neural network model and those computed using modern encoding techniques. We also learn linear and kernelized cross-modal embeddings over different feature combinations to reduce semantic gap between visual features and textual labels. Extensive evaluations on four image annotation datasets (Corel-5K, ESP-Game, IAPR-TC12 and MIRFlickr-25K) demonstrate that our method achieves promising results, and establishes a new state-of-the-art on the prevailing image annotation datasets.

Journal ArticleDOI
TL;DR: The strengths and weaknesses of approaches for the annotation and classification of important elements of protein- coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation are discussed as a practical guide for clinicians when considering pathogenic sequence variation.
Abstract: The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.

Posted Content
TL;DR: A new dataset, Functional Map of the World (fMoW), which aims to inspire the development of machine learning models capable of predicting the functional purpose of buildings and land use from temporal sequences of satellite images and a rich set of metadata features.
Abstract: We present a new dataset, Functional Map of the World (fMoW), which aims to inspire the development of machine learning models capable of predicting the functional purpose of buildings and land use from temporal sequences of satellite images and a rich set of metadata features. The metadata provided with each image enables reasoning about location, time, sun angles, physical sizes, and other features when making predictions about objects in the image. Our dataset consists of over 1 million images from over 200 countries. For each image, we provide at least one bounding box annotation containing one of 63 categories, including a "false detection" category. We present an analysis of the dataset along with baseline approaches that reason about metadata and temporal views. Our data, code, and pretrained models have been made publicly available.

Journal ArticleDOI
TL;DR: This paper proposes an image annotation model that incorporates contextual cues collected from sources both intrinsic and extrinsic to images, to bridge the semantic gap, and outperforms the state of the art on the collected data set of approximately 20 000 items.
Abstract: Automatic image annotation methods are extremely beneficial for image search, retrieval, and organization systems. The lack of strict correlation between semantic concepts and visual features, referred to as the semantic gap , is a huge challenge for annotation systems. In this paper, we propose an image annotation model that incorporates contextual cues collected from sources both intrinsic and extrinsic to images, to bridge the semantic gap . The main focus of this paper is a large real-world data set of news images that we collected. Unlike standard image annotation benchmark data sets, our data set does not require human annotators to generate artificial ground truth descriptions after data collection, since our images already include contextually meaningful and real-world captions written by journalists. We thoroughly study the nature of image descriptions in this real-world data set. News image captions describe both visual contents and the contexts of images. Auxiliary information sources are also available with such images in the form of news article and metadata (e.g., keywords and categories). The proposed framework extracts contextual -cues from available sources of different data modalities and transforms them into a common representation space, i.e., the probability space. Predicted annotations are later transformed into sentence-like captions through an extractive framework applied over news articles. Our context -driven framework outperforms the state of the art on the collected data set of approximately 20 000 items, as well as on a previously available smaller news images data set.

Proceedings ArticleDOI
10 Nov 2017
TL;DR: This study examines inter-annotator agreement in multi-class, multi-label sentiment annotation of messages, using several annotation agreement measures, as well as statistical analysis and Machine Learning to assess the resulting annotations.
Abstract: Manual text annotation is an essential part of Big Text analytics. Although annotators work with limited parts of data sets, their results are extrapolated by automated text classification and affect the final classification results. Reliability of annotations and adequacy of assigned labels are especially important in the case of sentiment annotations. In the current study we examine inter-annotator agreement in multi-class, multi-label sentiment annotation of messages. We used several annotation agreement measures, as well as statistical analysis and Machine Learning to assess the resulting annotations.

Book ChapterDOI
01 Jan 2017
TL;DR: This chapter describes recent and ongoing annotation efforts using the ISO 24617-2 standard for dialogue act annotation, and the construction of corpora of dialogues, annotated according to ISO 246 17-2, is discussed.
Abstract: This chapter describes recent and ongoing annotation efforts using the ISO 24617-2 standard for dialogue act annotation. Experimental studies are reported on the annotation by human annotators and by annotation machines of some of the specific features of the ISO annotation scheme, such as its multidimensional annotation of communicative functions, the recognition of each of its nine dimensions, and the recognition of dialogue act qualifiers for certainty, conditionality, and sentiment. The construction of corpora of dialogues, annotated according to ISO 24617-2, is discussed, including the recent DBOX and DialogBank corpora.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work introduces a method to greatly reduce the amount of redundant annotations required when crowdsourcing annotations such as bounding boxes, parts, and class labels, and develops specialized models and algorithms for binary annotation, part keypoint annotation, and sets of bounding box annotations.
Abstract: We introduce a method to greatly reduce the amount of redundant annotations required when crowdsourcing annotations such as bounding boxes, parts, and class labels. For example, if two Mechanical Turkers happen to click on the same pixel location when annotating a part in a given image–an event that is very unlikely to occur by random chance–, it is a strong indication that the location is correct. A similar type of confidence can be obtained if a single Turker happened to agree with a computer vision estimate. We thus incrementally collect a variable number of worker annotations per image based on online estimates of confidence. This is done using a sequential estimation of risk over a probabilistic model that combines worker skill, image difficulty, and an incrementally trained computer vision model. We develop specialized models and algorithms for binary annotation, part keypoint annotation, and sets of bounding box annotations. We show that our method can reduce annotation time by a factor of 4-11 for binary filtering of websearch results, 2-4 for annotation of boxes of pedestrians in images, while in many cases also reducing annotation error. We will make an end-to-end version of our system publicly available.

Posted Content
Lin Yang1, Yizhe Zhang1, Jianxu Chen1, Siyuan Zhang1, Danny Z. Chen1 
TL;DR: In this paper, a deep active learning framework that combines fully convolutional network (FCN) and active learning to reduce annotation effort by making judicious suggestions on the most effective annotation areas.
Abstract: Image segmentation is a fundamental problem in biomedical image analysis. Recent advances in deep learning have achieved promising results on many biomedical image segmentation benchmarks. However, due to large variations in biomedical images (different modalities, image settings, objects, noise, etc), to utilize deep learning on a new application, it usually needs a new set of training data. This can incur a great deal of annotation effort and cost, because only biomedical experts can annotate effectively, and often there are too many instances in images (e.g., cells) to annotate. In this paper, we aim to address the following question: With limited effort (e.g., time) for annotation, what instances should be annotated in order to attain the best performance? We present a deep active learning framework that combines fully convolutional network (FCN) and active learning to significantly reduce annotation effort by making judicious suggestions on the most effective annotation areas. We utilize uncertainty and similarity information provided by FCN and formulate a generalized version of the maximum set cover problem to determine the most representative and uncertain areas for annotation. Extensive experiments using the 2015 MICCAI Gland Challenge dataset and a lymph node ultrasound image segmentation dataset show that, using annotation suggestions by our method, state-of-the-art segmentation performance can be achieved by using only 50% of training data.

Dissertation
07 Jul 2017
TL;DR: This thesis investigates the problem of detecting hate speech posted online with an exhaustive and methodical approach, and investigates the potential advantages of using hierarchical classes to annotate a dataset.
Abstract: The use of the internet and social networks, in particular for communication, has significantly increased in recent years. This growth has also resulted in the adoption of more aggressive communication. Therefore it is important that governments and social network platforms have tools to detect this type of communication, because it can be harmful to its targets. In this thesis we investigate the problem of detecting hate speech posted online. The first goal of our work was to make a complete overview on the topic, focusing on the perspective of computer science and engineering. We adopted an exhaustive and methodical approach that we called Systematic Literature Review. As a result, we critically summarized different perspectives on the hate speech concept and complemented our definition with rules, examples, and a comparison with other related concepts, such as cyberbullying and abusive language. Regarding the past work in the topic, we observed that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g. n-grams, word2vec), or hate speech specific features (e.g. othering discourse). In the majority of these studies new datasets are collected, but those remain private, which makes more difficult to compare results across different works. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language and to make it available as well. Regarding the dataset annotation, we built a classification system using a hierarchical structure. The main advantage of this strategy is that it allows to better consider nuances in the hate speech concept, such as the existence and intersectionality of the subtypes of hate speech. Our data was collected from Twitter, and manually annotated by following a set of rules, that are also a valuable product of our work. We annotated a dataset with 5,668 messages from 1,156 distinct users, where 85 distinct classes of hate speech were considered. From the total 5,668 messages, around 22% contain some type of hate speech. Regarding the annotators agreement, using the hierarchical approach allowed us to improve results, however this was still an issue in identifying hate speech. Further analysis pointed out that the several types of hate speech present different characteristics (e.g. distinct number of messages, time occurrences, vocabulary size, distinct n-grams and POS). A final goal of our thesis was to investigate the potential advantages of using hierarchical classes to annotate a dataset. For this, we used the dataset annotated for Portuguese and we conducted an experiment with training, validation and test phases. In this experiment we compare two different approaches: we called unimodel to the model using only the hate speech class; and multimodel to the model using the several hierarchical classes. The main conclusion of our experiment was that the performance of the multimodel seemed to be slightly better than the unimodel in the F1 metric, and additionally, our method helped to identify a larger number of hate speech messages. This is the case because it has a better recall, in detriment of the precision. Finally, we think that in the future this experiment can be extended in order to better identify hate speech and the respective subtypes.

Journal ArticleDOI
Jian Zhao1, Michael Glueck1, Simon Breslav1, Fanny Chevalier, Azam Khan1 
TL;DR: This work presents annotation graphs, a dynamic graph visualization that enables meta-analysis of data based on user-authored annotations, and applies principles of Exploratory Sequential Data Analysis in designing C8, and links these to an existing task typology in the visualization literature.
Abstract: User-authored annotations of data can support analysts in the activity of hypothesis generation and sensemaking, where it is not only critical to document key observations, but also to communicate insights between analysts. We present annotation graphs, a dynamic graph visualization that enables meta-analysis of data based on user-authored annotations. The annotation graph topology encodes annotation semantics, which describe the content of and relations between data selections, comments, and tags. We present a mixed-initiative approach to graph layout that integrates an analyst's manual manipulations with an automatic method based on similarity inferred from the annotation semantics. Various visual graph layout styles reveal different perspectives on the annotation semantics. Annotation graphs are implemented within C8, a system that supports authoring annotations during exploratory analysis of a dataset. We apply principles of Exploratory Sequential Data Analysis (ESDA) in designing C8, and further link these to an existing task typology in the visualization literature. We develop and evaluate the system through an iterative user-centered design process with three experts, situated in the domain of analyzing HCI experiment data. The results suggest that annotation graphs are effective as a method of visually extending user-authored annotations to data meta-analysis for discovery and organization of ideas.

Journal ArticleDOI
TL;DR: UROPA (Universal RObust Peak Annotator) is a command line based tool, intended for universal genomic range annotation, that can incorporate reference annotation files (GTF) from different sources (Gencode, Ensembl, RefSeq), as well as custom reference annotations files.
Abstract: The annotation of genomic ranges of interest represents a recurring task for bioinformatics analyses. These ranges can originate from various sources, including peaks called for transcription factor binding sites (TFBS) or histone modification ChIP-seq experiments, chromatin structure and accessibility experiments (such as ATAC-seq), but also from other types of predictions that result in genomic ranges. While peak annotation primarily driven by ChiP-seq was extensively explored, many approaches remain simplistic (“most closely located TSS”), rely on fixed pre-built references, or require complex scripting tasks on behalf of the user. An adaptable, fast, and universal tool, capable to annotate genomic ranges in the respective biological context is critically missing. UROPA (Universal RObust Peak Annotator) is a command line based tool, intended for universal genomic range annotation. Based on a configuration file, different target features can be prioritized with multiple integrated queries. These can be sensitive for feature type, distance, strand specificity, feature attributes (e.g. protein_coding) or anchor position relative to the feature. UROPA can incorporate reference annotation files (GTF) from different sources (Gencode, Ensembl, RefSeq), as well as custom reference annotation files. Statistics and plots transparently summarize the annotation process. UROPA is implemented in Python and R.

Book ChapterDOI
01 Jan 2017
TL;DR: The primary motivation for the annotation project was the accumulating body of evidence indicating that the bodies of journal articles contain much information that is not present in the abstracts, and that the textual and structural characteristics of article bodies are different from those of abstracts.
Abstract: The Colorado Richly Annotated Full Text (CRAFT) corpus consists of full-text journal articles. The primary motivation for the annotation project was the accumulating body of evidence indicating that the bodies of journal articles contain much information that is not present in the abstracts, and that the textual and structural characteristics of article bodies are different from those of abstracts. The development of CRAFT was characterized by a “multi-model” annotation task. The sample population was all journal articles that had been used by the Mouse Genome Informatics group as evidence for at least one Gene Ontology or Mouse Phenotype Ontology “annotation.” The linguistic annotation is represented in the widely known Penn Treebank format (Marcus et al., Comput. Linguist. 19(2), 313–330, 1993) [50], with the addition of a small number of tags and phrasal categories to accommodate the idiosyncrasies of the domain.

Journal ArticleDOI
TL;DR: It is demonstrated that FunctionAnnotator can efficiently annotate transcriptomes and greatly benefits studies focusing on non-model organisms or metatranscriptomes, and that it can estimate the taxonomic composition of environmental samples and assist in the identification of novel proteins by combining RNA-Seq data with proteomics technology.
Abstract: Along with the constant improvement in high-throughput sequencing technology, an increasing number of transcriptome sequencing projects are carried out in organisms without decoded genome information and even on environmental biological samples. To study the biological functions of novel transcripts, the very first task is to identify their potential functions. We present a web-based annotation tool, FunctionAnnotator, which offers comprehensive annotations, including GO term assignment, enzyme annotation, domain/motif identification and predictions for subcellular localization. To accelerate the annotation process, we have optimized the computation processes and used parallel computing for all annotation steps. Moreover, FunctionAnnotator is designed to be versatile, and it generates a variety of useful outputs for facilitating other analyses. Here, we demonstrate how FunctionAnnotator can be helpful in annotating non-model organisms. We further illustrate that FunctionAnnotator can estimate the taxonomic composition of environmental samples and assist in the identification of novel proteins by combining RNA-Seq data with proteomics technology. In summary, FunctionAnnotator can efficiently annotate transcriptomes and greatly benefits studies focusing on non-model organisms or metatranscriptomes. FunctionAnnotator, a comprehensive annotation web-service tool, is freely available online at: http://fa.cgu.edu.tw/ . This new web-based annotator will shed light on field studies involving organisms without a reference genome.

Journal ArticleDOI
TL;DR: HaploR as discussed by the authors is an R package for querying web-based genome annotation tools HaploReg and RegulomeDB that gathers information in a data frame which is suitable for downstream bioinformatic analyses.
Abstract: We developed haploR , an R package for querying web based genome annotation tools HaploReg and RegulomeDB. haploR gathers information in a data frame which is suitable for downstream bioinformatic analyses. This will facilitate post-genome wide association studies streamline analysis for rapid discovery and interpretation of genetic associations.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors constructed a comprehensive syntactic and semantic corpus of Chinese clinical texts and developed tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain.