scispace - formally typeset
Topic

Unicode

About: Unicode is a(n) research topic. Over the lifetime, 1360 publication(s) have been published within this topic receiving 13934 citation(s). The topic is also known as: The Unicode Standard & Unicode Standard.
Papers
More filters

Proceedings ArticleDOI
19 Apr 2016-
Abstract: Bidirectional long short-term memory (biLSTM) networks have recently proven successful for various NLP sequence modeling tasks, but little is known about their reliance to input representations, target languages, data set size, and label noise. We address these issues and evaluate bi-LSTMs with word, character, and unicode byte embeddings for POS tagging. We compare bi-LSTMs to traditional POS taggers across languages and data sizes. We also present a novel biLSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words. The model obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages. Our analysis suggests that biLSTMs are less sensitive to training data size and label corruptions (at small noise levels) than previously assumed.

425 citations


Proceedings ArticleDOI
06 Jul 2002-
TL;DR: GATE is presented, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion and can be used to develop applications and Resources in multiple languages, based on its thorough Unicode support.
Abstract: In this paper we present GATE, a framework and graphical development environment which enables users to develop and deploy language engineering components and resources in a robust fashion. The GATE architecture has enabled us not only to develop a number of successful applications for various language processing tasks (such as Information Extraction), but also to build and annotate corpora and carry out evaluations on the applications generated. The framework can be used to develop applications and resources in multiple languages, based on its thorough Unicode support.

413 citations


Journal ArticleDOI
TL;DR: Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions and has been tested on various Unix systems and Windows.
Abstract: We present “Transcriber”, a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of different data formats will become easier.

337 citations


Proceedings Article
01 Aug 2013-
TL;DR: An open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages, made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository.
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.

275 citations


01 Jan 2005-
TL;DR: This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifiers (URI), which means that IRIs can be used instead of URIs, where appropriate, to identify resources.
Abstract: This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement of the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs, where appropriate, to identify resources. The approach of defining a new protocol element was chosen instead of extending or changing the definition of URIs. This was done in order to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines are provided for the use and deployment of IRIs in various protocols, formats, and software components that currently deal with URIs.

262 citations


Network Information
Related Topics (5)
Embedding

16.5K papers, 295.7K citations

77% related
Levenshtein distance

778 papers, 15.3K citations

76% related
Quranic Arabic Corpus

25 papers, 419 citations

76% related
Pandemonium architecture

2 papers, 292 citations

74% related
Concatenation

2.8K papers, 53.4K citations

73% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20221
202146
202049
201960
201854
201776

Top Attributes

Show by:

Topic's top 5 most impactful authors

Tony McEnery

7 papers, 140 citations

Askar Hamdulla

6 papers, 10 citations

Imdad Ali Ismaili

5 papers, 25 citations

Sandeep Bele

5 papers, 153 citations

Sai Krishnam Raju Nadimpalli

5 papers, 153 citations