scispace - formally typeset
Search or ask a question

Showing papers on "Chunking (computing) published in 2018"


Proceedings Article
01 Aug 2018
TL;DR: This work reproduces twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conducts a systematic model comparison on three benchmarks, to reach several practical conclusions which can be useful to practitioners.
Abstract: We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging). Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments. In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners.

169 citations


Journal ArticleDOI
TL;DR: The results reveal a normative rationale for center-surround connectivity in working memory circuitry, call for reevaluation of memory performance differences that have previously been attributed to differences in capacity, and support a more nuanced view of visual working memory capacity limitations.
Abstract: The nature of capacity limits for visual working memory has been the subject of an intense debate that has relied on models that assume items are encoded independently. Here we propose that instead, similar features are jointly encoded through a "chunking" process to optimize performance on visual working memory tasks. We show that such chunking can: (a) facilitate performance improvements for abstract capacity-limited systems, (b) be optimized through reinforcement, (c) be implemented by center-surround dynamics, and (d) increase effective storage capacity at the expense of recall precision. Human performance on a variant of a canonical working memory task demonstrated performance advantages, precision detriments, interitem dependencies, and trial-to-trial behavioral adjustments diagnostic of performance optimization through center-surround chunking. Models incorporating center-surround chunking provided a better quantitative description of human performance in our study as well as in a meta-analytic dataset, and apparent differences in working memory capacity across individuals were attributable to individual differences in the implementation of chunking. Our results reveal a normative rationale for center-surround connectivity in working memory circuitry, call for reevaluation of memory performance differences that have previously been attributed to differences in capacity, and support a more nuanced view of visual working memory capacity limitations: strategic tradeoff between storage capacity and memory precision through chunking contribute to flexible capacity limitations that include both discrete and continuous aspects. (PsycINFO Database Record

71 citations


01 Jan 2018
TL;DR: By improving knowledge retention, micro-learning supports learning in a more easily accessible bites of information productively designed in an online environment.
Abstract: This article focuses on micro-learning for its effectiveness in online learning design. Faculty members in many universities incorporate micro-learning in their classes as it engages students with the subject matter and results in deeper learning, by encouraging them to connect to the subject matter with their everyday lives as well as the world around them. By improving knowledge retention, micro-learning supports learning in a more easily accessible bites of information productively designed in an online environment.

21 citations


Journal ArticleDOI
TL;DR: New techniques to enhance TTTD chunking algorithm using a new fingerprint function, a multi-level hashing and matching technique, new indexing technique to store the Metadata, and a new hashing algorithm to solve the collision problem are presented.
Abstract: Due to the fast indiscriminate increase of digital data, data reduction has acquired increasing concentration and became a popular approach in large-scale storage systems. One of the most effective approaches for data reduction is Data Deduplication technique in which the redundant data at the file or sub-file level is detected and identifies by using a hash algorithm. Data Deduplication showed that it was much more efficient than the conventional compression technique in large-scale storage systems in terms of space reduction. Two Threshold Two Divisor (TTTD) chunking algorithm is one of the popular chunking algorithm used in deduplication. This algorithm needs time and many system resources to compute its chunk boundary. This paper presents new techniques to enhance TTTD chunking algorithm using a new fingerprint function, a multi-level hashing and matching technique, new indexing technique to store the Metadata. These new techniques consist of four hashing algorithm to solve the collision problem and adding a new chunk condition to the TTTD chunking conditions in order to increase the number of the small chunks which leads to increasing the Deduplication Ratio. This enhancement improves the Deduplication Ratio produced by TTTD algorithm and reduces the system resources needed by this algorithm. The proposed algorithm is tested in terms of Deduplication Ratio, execution time, and Metadata size.

11 citations


Patent
03 May 2018
TL;DR: The authors proposed a joint many-task neural network model to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model.
Abstract: The technology disclosed provides a so-called "joint many-task neural network model" to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model The model is successively trained by considering linguistic hierarchies, directly connecting word representations to all model layers, explicitly using predictions in lower tasks, and applying a so-called "successive regularization" technique to prevent catastrophic forgetting Three examples of lower level model layers are part-of-speech (POS) tagging layer, chunking layer, and dependency parsing layer Two examples of higher level model layers are semantic relatedness layer and textual entailment layer The model achieves the state-of-the-art results on chunking, dependency parsing, semantic relatedness and textual entailment

9 citations


Patent
03 May 2018
TL;DR: This article proposed a joint many-task neural network model to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model.
Abstract: The technology disclosed provides a so-called “joint many-task neural network model” to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model. The model is successively trained by considering linguistic hierarchies, directly connecting word representations to all model layers, explicitly using predictions in lower tasks, and applying a so-called “successive regularization” technique to prevent catastrophic forgetting. Three examples of lower level model layers are part-of-speech (POS) tagging layer, chunking layer, and dependency parsing layer. Two examples of higher level model layers are semantic relatedness layer and textual entailment layer. The model achieves the state-of-the-art results on chunking, dependency parsing, semantic relatedness and textual entailment.

6 citations


Proceedings ArticleDOI
TL;DR: This paper introduces a new annotated corpus based on an existing informal text corpus: the NUS SMS Corpus, and explores several graphical models, including a novel variant of the semi-Markov conditional random fields (semi-CRF) for the task of noun phrase chunking.
Abstract: This paper introduces a new annotated corpus based on an existing informal text corpus: the NUS SMS Corpus (Chen and Kan, 2013). The new corpus includes 76,490 noun phrases from 26,500 SMS messages, annotated by university students. We then explored several graphical models, including a novel variant of the semi-Markov conditional random fields (semi-CRF) for the task of noun phrase chunking. We demonstrated through empirical evaluations on the new dataset that the new variant yielded similar accuracy but ran in significantly lower running time compared to the conventional semi-CRF.

5 citations


Posted Content
TL;DR: In this article, the authors propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels.
Abstract: Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.

4 citations


Proceedings Article
01 Aug 2018
TL;DR: In this article, the authors proposed a high-order LSTM-CRF model to take longer distance dependencies between tags into consideration, which achieved state-of-the-art results in all-phrase chunking and NER datasets.
Abstract: Existing neural models usually predict the tag of the current token independent of the neighboring tags. The popular LSTM-CRF model considers the tag dependencies between every two consecutive tags. However, it is hard for existing neural models to take longer distance dependencies between tags into consideration. The scalability is mainly limited by the complex model structures and the cost of dynamic programming during training. In our work, we first design a new model called “high order LSTM” to predict multiple tags for the current token which contains not only the current tag but also the previous several tags. We call the number of tags in one prediction as “order”. Then we propose a new method called Multi-Order BiLSTM (MO-BiLSTM) which combines low order and high order LSTMs together. MO-BiLSTM keeps the scalability to high order models with a pruning technique. We evaluate MO-BiLSTM on all-phrase chunking and NER datasets. Experiment results show that MO-BiLSTM achieves the state-of-the-art result in chunking and highly competitive results in two NER datasets.

3 citations



Journal ArticleDOI
TL;DR: It is pointed out that for one such set of primitives, whose quantitative effectiveness has been demonstrated by analysis and computer simulation, emerging technologies for stimulation and recording are making it possible to test directly whether cortex is capable of performing them.

Posted Content
TL;DR: In this article, the authors present a system that helps reduce the effort associated with sourcing reference material and course creation by automatically generating learning objectives from content, creating descriptive content metadata to improve content-discoverability.
Abstract: Instructional Systems Design is the practice of creating of instructional experiences that make the acquisition of knowledge and skill more efficient, effective, and appealing. Specifically in designing courses, an hour of training material can require between 30 to 500 hours of effort in sourcing and organizing reference data for use in just the preparation of course material. In this paper, we present the first system of its kind that helps reduce the effort associated with sourcing reference material and course creation. We present algorithms for document chunking and automatic generation of learning objectives from content, creating descriptive content metadata to improve content-discoverability. Unlike existing methods, the learning objectives generated by our system incorporate pedagogically motivated Bloom's verbs. We demonstrate the usefulness of our methods using real world data from the banking industry and through a live deployment at a large pharmaceutical company.

Book ChapterDOI
26 May 2018
TL;DR: This paper presents a recurrent neural network (RNN) framework based on multi-perspective embeddings for Chinese chunking, which takes the character representation, part-of-speech (POS) embeds and word embeds as the input features of the RNN layer.
Abstract: Chunking is a crucial step in natural language processing (NLP), which aims to divide a text into syntactically correlated but non-overlapping chunks. The task is typically modeled as a sequence labeling problem. Various machine learning algorithms, such as Conditional Random Fields (CRFs) and Support Vector Machines (SVMs), have been successfully used for this task. However, these state-of-the-art chunking systems largely depend on hand-crafted appropriate features. In this paper, we present a recurrent neural network (RNN) framework based on multi-perspective embeddings for Chinese chunking. This framework takes the character representation, part-of-speech (POS) embeddings and word embeddings as the input features of the RNN layer. On top of the RNN, we use a CRF layer to jointly decode labels for the whole sentence. Experimental results show that various embeddings can improve the performance of the RNN model. Although our model uses these embeddings as the only features, it can be successfully used for Chinese chunking without any feature engineering efforts.

Journal ArticleDOI
19 Sep 2018
TL;DR: Preliminary results are presented that show that it is feasible to compress linguistic data into chunks without significantly diminishing parsing performance and potentially increasing the speed.
Abstract: We introduce a “Chunk-and-Pass” parsing technique influenced by a psycholinguistic model, where linguistic information is processed not word-by-word but rather in larger chunks of words. We present preliminary results that show that it is feasible to compress linguistic data into chunks without significantly diminishing parsing performance and potentially increasing the speed.


Proceedings ArticleDOI
01 May 2018
TL;DR: A sampling-based chunking method is proposed and a tool named SmartChunker is developed to estimate the optimal chunking configuration for deduplication systems to eliminate duplicates in data with different chunk size settings.
Abstract: Data backup is regularly required by both enterprise and individual users to protect their data from unexpected loss. There are also various commercial data deduplication systems or software that help users to eliminate duplicates in their backup data to save storage space. In data deduplication systems, the data chunking process splits data into small chunks. Duplicate data is identified by comparing the fingerprints of the chunks. The chunk size setting has significant impact on deduplication performance. A variety of chunking algorithms have been proposed in recent studies. In practice, existing systems often set the chunking configuration in an empirical manner. A chunk size of 4KB or 8KB is regarded as the sweet spot for good deduplication performance. However, the data storage and access patterns of users vary and change along time, as a result, the empirical chunk size setting may not lead to a good deduplication ratio and sometimes results in difficulties of storage capacity planning. Moreover, it is difficult to make changes to the chunking settings once they are put into use as duplicates in data with different chunk size settings cannot be eliminated directly. In this paper, we propose a sampling-based chunking method and develop a tool named SmartChunker to estimate the optimal chunking configuration for deduplication systems. Our evaluations on real-world datasets demonstrate the efficacy and efficiency of SmartChunker.

01 Jan 2018
TL;DR: This paper will highlight a portion of a student project where information reduction provided understanding beyond initial student impressions and encouraged them to move forward.
Abstract: Studio projects increase from simple & straightforward to complex & indeterminate as undergraduate industrial design students’ progress through their educational experiences. As project complexity increases, students are faced with information overload and can struggle to move forward in a meaningful way. Complex Problem Solving studies and Cognitive Load Theory suggest information reduction as a way to grasp the critical aspects of a problem and move beyond the impasse inherent with too much information. Segmentation and chunking are common strategies for information reduction but the abstraction inherent in the chunking process provides better conceptual understanding. The simplified but meaningful results from the chunking process can then be leveraged to create a model or framework that helps students organise and clarify what they have observed as well as point to new opportunities for design activity. Despite the fear of oversimplification, significantly abstracted models have great “explanatory or predictive power” and can lead to rich results. Reviewing the concepts of complexity, cognitive load, and contrasting segmentation with data chunking, this paper will then highlight a portion of a student project where information reduction provided understanding beyond initial student impressions and encouraged them to move forward.

Journal ArticleDOI
TL;DR: The main contributions are the construction of an annotated corpus (vnEMR) and lexical resources in the medical domain and the improvement of the quality of the tools for clinical text analysis, including word segmentation, part-of-speech tagging and chunking.
Abstract: Clinical texts contain textual data recorded by doctors during medical examinations. Sentences in clinical texts are generally short, narrative, notstrictly adhering to Vietnamese grammar and contain many medical terms which are not present in general dictionaries. In this paper, we investigate the tasks oflexical analysis and phrase chunking for Vietnam eseclinical texts. Although there exist several tools for general Vietnamese text analysis, these tools showeda limited quality in the clinical domain due to the specific grammatical style of clinical texts and the lack of medical vocabulary. Our main contributions are the construction of an annotated corpus (vnEMR) and lexical resources in the medical domain and in consequence theimprovement of the quality of the tools for clinical text analysis, including word segmentation, part-of-speech tagging and chunking.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: A greater ability of the chunk model to predict participants’ IPIs throughout learning is found and this work discusses the differences between these two ideas and compares the ability of these two models to predict inter-press time intervals (IPIs) measured by a discrete sequence production task.
Abstract: Many complex behaviors consist of sequentially ordered actions. When acquiring a novel sequential skill, the transition between actions can be performed with increasing speed. This observation has led to the idea that the elementary actions are bound together during the learning process. Two ideas for this process have been proposed: First, statistical probabilities between different elementary actions could be acquired. Secondly, discrete groupings of elementary actions – so-called chunks could emerge with learning. We discuss the differences between these two ideas and compare the ability of the two models to predict inter-press time intervals (IPIs) measured by a discrete sequence production task. We find a greater ability of the chunk model to predict participants’ IPIs throughout learning.

Journal ArticleDOI
TL;DR: This paper thoroughly investigates the method introduced by Indig and Endredy to find out the best lexicalization level for chunking and to explore the behavior of different IOB representations.
Abstract: Lexicalization of the input of sequential taggers has gone a long way since it was invented by Molina and Pla [4]. In this paper we thoroughly investigate the method introduced by Indig and Endredy [2] to find out ´ the best lexicalization level for chunking and to explore the behavior of different IOB representations. Both tasks are applied to the CoNLL-2000 dataset. Our goal is to introduce a transformation method to accommodate the parameters of the development set to the training set using their frequency distributions which other tasks like POS tagging or NER could benefit too.

Proceedings ArticleDOI
01 Feb 2018
TL;DR: A performance analysis of Rabin based chunking and Rapid Asymmetric Maximum (RAM) chunking using throughput is proposed using throughput for reducing storage utilization and handling data replication in the backup environment efficiently.
Abstract: The reduction of storage utilization and handling data replication in the backup environment efficiently which was introduced by the emerging technology called Data de-duplication. The Chunk is a method of breaking data into multiple pieces and each chunk has the unique hash identifier for identification. To check data duplication the hash identifier of a chunk is checked with the previously stored chunk. It increases the efficiency of cloud storage. This project proposes a performance analysis of Rabin based chunking and Rapid Asymmetric Maximum (RAM) chunking using throughput. Process considers the case study of population dataset in which the state and district are considering the parameter to perform chunking and to analyze the de-duplication.