Topic
Unicode
About: Unicode is a(n) research topic. Over the lifetime, 1360 publication(s) have been published within this topic receiving 13934 citation(s). The topic is also known as: The Unicode Standard & Unicode Standard.
Papers
More filters
19 Apr 2016-
Abstract: Bidirectional long short-term memory (biLSTM) networks have recently proven successful for various NLP sequence modeling tasks, but little is known about their reliance to input representations, target languages, data set size, and label noise. We address these issues and evaluate bi-LSTMs with word, character, and unicode byte embeddings for POS tagging. We compare bi-LSTMs to traditional POS taggers across languages and data sizes. We also present a novel biLSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words. The model obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages. Our analysis suggests that biLSTMs are less sensitive to training data size and label corruptions (at small noise levels) than previously assumed.
425 citations
06 Jul 2002-
TL;DR: GATE is presented, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion and can be used to develop applications and Resources in multiple languages, based on its thorough Unicode support.
Abstract: In this paper we present GATE, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion. The GATE architecture has enabled us not only to develop a number of successful applications for various language processing tasks (such as Information Extraction), but also to build and annotate corpora and carry out evaluations on the applications generated. The framework can be used to develop applications and resources in multiple languages, based on its thorough Unicode support.
413 citations
TL;DR: Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions and has been tested on various Unix systems and Windows.
Abstract: We present “Transcriber”, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of different data formats will become easier.
337 citations
Proceedings Article•
01 Aug 2013-
TL;DR: An open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages, made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository.
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
275 citations
01 Jan 2005-
TL;DR: This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifiers (URI), which means that IRIs can be used instead of URIs, where appropriate, to identify resources.
Abstract: This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement of the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
URIs is defined, which means that IRIs can be used instead of URIs,
where appropriate, to identify resources. The approach of defining a
new protocol element was chosen instead of extending or changing the
definition of URIs. This was done in order to allow a clear
distinction and to avoid incompatibilities with existing software.
Guidelines are provided for the use and deployment of IRIs in various
protocols, formats, and software components that currently deal with
URIs.
262 citations