Showing papers on "Chunking (computing) published in 2018"

PDF

Open Access

Proceedings Article•

Design Challenges and Misconceptions in Neural Sequence Labeling

[...]

Jie Yang¹, Shuailong Liang², Yue Zhang³•Institutions (3)

Brigham and Women's Hospital¹, Singapore University of Technology and Design², Beijing Institute of Technology³

01 Aug 2018

TL;DR: This work reproduces twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conducts a systematic model comparison on three benchmarks, to reach several practical conclusions which can be useful to practitioners.

...read moreread less

Abstract: We investigate the design challenges of constructing effective and efficient neural sequence labeling systems, by reproducing twelve neural sequence labeling models, which include most of the state-of-the-art structures, and conduct a systematic model comparison on three benchmarks (i.e. NER, Chunking, and POS tagging). Misconceptions and inconsistent conclusions in existing literature are examined and clarified under statistical experiments. In the comparison and analysis process, we reach several practical conclusions which can be useful to practitioners.

...read moreread less

169 citations

Journal Article•DOI•

Chunking as a rational strategy for lossy data compression in visual working memory.

[...]

Matthew R. Nassar¹, Julie C. Helmers¹, Michael J. Frank¹•Institutions (1)

Brown University¹

01 Jul 2018-Psychological Review

TL;DR: The results reveal a normative rationale for center-surround connectivity in working memory circuitry, call for reevaluation of memory performance differences that have previously been attributed to differences in capacity, and support a more nuanced view of visual working memory capacity limitations.

...read moreread less

Abstract: The nature of capacity limits for visual working memory has been the subject of an intense debate that has relied on models that assume items are encoded independently. Here we propose that instead, similar features are jointly encoded through a "chunking" process to optimize performance on visual working memory tasks. We show that such chunking can: (a) facilitate performance improvements for abstract capacity-limited systems, (b) be optimized through reinforcement, (c) be implemented by center-surround dynamics, and (d) increase effective storage capacity at the expense of recall precision. Human performance on a variant of a canonical working memory task demonstrated performance advantages, precision detriments, interitem dependencies, and trial-to-trial behavioral adjustments diagnostic of performance optimization through center-surround chunking. Models incorporating center-surround chunking provided a better quantitative description of human performance in our study as well as in a meta-analytic dataset, and apparent differences in working memory capacity across individuals were attributable to individual differences in the implementation of chunking. Our results reveal a normative rationale for center-surround connectivity in working memory circuitry, call for reevaluation of memory performance differences that have previously been attributed to differences in capacity, and support a more nuanced view of visual working memory capacity limitations: strategic tradeoff between storage capacity and memory precision through chunking contribute to flexible capacity limitations that include both discrete and continuous aspects. (PsycINFO Database Record

...read moreread less

71 citations

Beyond Chunking: Micro-learning Secrets for Effective Online Design

[...]

Amanda E. Major, Tina Calandrino

01 Jan 2018

TL;DR: By improving knowledge retention, micro-learning supports learning in a more easily accessible bites of information productively designed in an online environment.

...read moreread less

Abstract: This article focuses on micro-learning for its effectiveness in online learning design. Faculty members in many universities incorporate micro-learning in their classes as it engages students with the subject matter and results in deeper learning, by encouraging them to connect to the subject matter with their everyday lives as well as the world around them. By improving knowledge retention, micro-learning supports learning in a more easily accessible bites of information productively designed in an online environment.

...read moreread less

21 citations

Journal Article•DOI•

New Techniques to Enhance Data Deduplication using Content based-TTTD Chunking Algorithm

[...]

Hala AbdulSalam Jasim, Assmaa A. Fahad

01 Jan 2018-International Journal of Advanced Computer Science and Applications

TL;DR: New techniques to enhance TTTD chunking algorithm using a new fingerprint function, a multi-level hashing and matching technique, new indexing technique to store the Metadata, and a new hashing algorithm to solve the collision problem are presented.

...read moreread less

Abstract: Due to the fast indiscriminate increase of digital data, data reduction has acquired increasing concentration and became a popular approach in large-scale storage systems. One of the most effective approaches for data reduction is Data Deduplication technique in which the redundant data at the file or sub-file level is detected and identifies by using a hash algorithm. Data Deduplication showed that it was much more efficient than the conventional compression technique in large-scale storage systems in terms of space reduction. Two Threshold Two Divisor (TTTD) chunking algorithm is one of the popular chunking algorithm used in deduplication. This algorithm needs time and many system resources to compute its chunk boundary. This paper presents new techniques to enhance TTTD chunking algorithm using a new fingerprint function, a multi-level hashing and matching technique, new indexing technique to store the Metadata. These new techniques consist of four hashing algorithm to solve the collision problem and adding a new chunk condition to the TTTD chunking conditions in order to increase the number of the small chunks which leads to increasing the Deduplication Ratio. This enhancement improves the Deduplication Ratio produced by TTTD algorithm and reduces the system resources needed by this algorithm. The proposed algorithm is tested in terms of Deduplication Ratio, execution time, and Metadata size.

...read moreread less

11 citations

Patent•

Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies

[...]

Kazuma Hashimoto¹, Caiming Xiong¹, Richard Socher¹•Institutions (1)

Salesforce.com¹

03 May 2018

TL;DR: The authors proposed a joint many-task neural network model to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model.

...read moreread less

Abstract: The technology disclosed provides a so-called "joint many-task neural network model" to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model The model is successively trained by considering linguistic hierarchies, directly connecting word representations to all model layers, explicitly using predictions in lower tasks, and applying a so-called "successive regularization" technique to prevent catastrophic forgetting Three examples of lower level model layers are part-of-speech (POS) tagging layer, chunking layer, and dependency parsing layer Two examples of higher level model layers are semantic relatedness layer and textual entailment layer The model achieves the state-of-the-art results on chunking, dependency parsing, semantic relatedness and textual entailment

...read moreread less

9 citations

Patent•

Joint many-task neural network model for multiple natural language processing (NLP) tasks

[...]

Kazuma Hashimoto¹, Caiming Xiong¹, Richard Socher¹•Institutions (1)

Salesforce.com¹

03 May 2018

TL;DR: This article proposed a joint many-task neural network model to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model.

...read moreread less

Abstract: The technology disclosed provides a so-called “joint many-task neural network model” to solve a variety of increasingly complex natural language processing (NLP) tasks using growing depth of layers in a single end-to-end model. The model is successively trained by considering linguistic hierarchies, directly connecting word representations to all model layers, explicitly using predictions in lower tasks, and applying a so-called “successive regularization” technique to prevent catastrophic forgetting. Three examples of lower level model layers are part-of-speech (POS) tagging layer, chunking layer, and dependency parsing layer. Two examples of higher level model layers are semantic relatedness layer and textual entailment layer. The model achieves the state-of-the-art results on chunking, dependency parsing, semantic relatedness and textual entailment.

...read moreread less

6 citations

Proceedings Article•DOI•

Weak Semi-Markov CRFs for NP Chunking in Informal Text.

[...]

Aldrian Obaja Muis, Wei Lu

19 Oct 2018-arXiv: Computation and Language

TL;DR: This paper introduces a new annotated corpus based on an existing informal text corpus: the NUS SMS Corpus, and explores several graphical models, including a novel variant of the semi-Markov conditional random fields (semi-CRF) for the task of noun phrase chunking.

...read moreread less

Abstract: This paper introduces a new annotated corpus based on an existing informal text corpus: the NUS SMS Corpus (Chen and Kan, 2013). The new corpus includes 76,490 noun phrases from 26,500 SMS messages, annotated by university students. We then explored several graphical models, including a novel variant of the semi-Markov conditional random fields (semi-CRF) for the task of noun phrase chunking. We demonstrated through empirical evaluations on the new dataset that the new variant yielded similar accuracy but ran in significantly lower running time compared to the conventional semi-CRF.

...read moreread less

5 citations

Posted Content•

Neural CRF transducers for sequence labeling

[...]

Kai Hu¹, Zhijian Ou¹, Min Hu², Junlan Feng²•Institutions (2)

Tsinghua University¹, China Mobile Research Institute²

04 Nov 2018-arXiv: Learning

TL;DR: In this article, the authors propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels.

...read moreread less

Abstract: Conditional random fields (CRFs) have been shown to be one of the most successful approaches to sequence labeling. Various linear-chain neural CRFs (NCRFs) are developed to implement the non-linear node potentials in CRFs, but still keeping the linear-chain hidden structure. In this paper, we propose NCRF transducers, which consists of two RNNs, one extracting features from observations and the other capturing (theoretically infinite) long-range dependencies between labels. Different sequence labeling methods are evaluated over POS tagging, chunking and NER (English, Dutch). Experiment results show that NCRF transducers achieve consistent improvements over linear-chain NCRFs and RNN transducers across all the four tasks, and can improve state-of-the-art results.

...read moreread less

4 citations

Proceedings Article•

Does Higher Order LSTM Have Better Accuracy for Segmenting and Labeling Sequence Data

[...]

Yi Zhang¹, Xu Sun², Shuming Ma¹, Yang Yang, Xuancheng Ren¹ - Show less +1 more•Institutions (2)

Peking University¹, The University of Nottingham Ningbo China²

01 Aug 2018

TL;DR: In this article, the authors proposed a high-order LSTM-CRF model to take longer distance dependencies between tags into consideration, which achieved state-of-the-art results in all-phrase chunking and NER datasets.

...read moreread less

Abstract: Existing neural models usually predict the tag of the current token independent of the neighboring tags. The popular LSTM-CRF model considers the tag dependencies between every two consecutive tags. However, it is hard for existing neural models to take longer distance dependencies between tags into consideration. The scalability is mainly limited by the complex model structures and the cost of dynamic programming during training. In our work, we first design a new model called “high order LSTM” to predict multiple tags for the current token which contains not only the current tag but also the previous several tags. We call the number of tags in one prediction as “order”. Then we propose a new method called Multi-Order BiLSTM (MO-BiLSTM) which combines low order and high order LSTMs together. MO-BiLSTM keeps the scalability to high order models with a pruning technique. We evaluate MO-BiLSTM on all-phrase chunking and NER datasets. Experiment results show that MO-BiLSTM achieves the state-of-the-art result in chunking and highly competitive results in two NER datasets.

...read moreread less

3 citations

DOI•

The Effects of Frontloading and Chunking Vocabulary Instruction with Second Language Learners

[...]

Benjamin Woodward

10 Jul 2018

3 citations

Journal Article•DOI•

Toward Identifying the Systems-Level Primitives of Cortex by In-Circuit Testing

[...]

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

20 Nov 2018-Frontiers in Neural Circuits

TL;DR: It is pointed out that for one such set of primitives, whose quantitative effectiveness has been demonstrated by analysis and computer simulation, emerging technologies for stimulation and recording are making it possible to test directly whether cortex is capable of performing them.

...read moreread less

Posted Content•

Document Chunking and Learning Objective Generation for Instruction Design

[...]

Khoi-Nguyen Tran¹, Jey Han Lau², Danish Contractor¹, Utkarsh Gupta, Bikram Sengupta¹, Christopher J. Butler¹, Mukesh K. Mohania¹ - Show less +3 more•Institutions (2)

IBM¹, University of Melbourne²

01 Jun 2018-arXiv: Computation and Language

TL;DR: In this article, the authors present a system that helps reduce the effort associated with sourcing reference material and course creation by automatically generating learning objectives from content, creating descriptive content metadata to improve content-discoverability.

...read moreread less

Abstract: Instructional Systems Design is the practice of creating of instructional experiences that make the acquisition of knowledge and skill more efficient, effective, and appealing. Specifically in designing courses, an hour of training material can require between 30 to 500 hours of effort in sourcing and organizing reference data for use in just the preparation of course material. In this paper, we present the first system of its kind that helps reduce the effort associated with sourcing reference material and course creation. We present algorithms for document chunking and automatic generation of learning objectives from content, creating descriptive content metadata to improve content-discoverability. Unlike existing methods, the learning objectives generated by our system incorporate pedagogically motivated Bloom's verbs. We demonstrate the usefulness of our methods using real world data from the banking industry and through a live deployment at a large pharmaceutical company.

...read moreread less

Book Chapter•DOI•

Multi-perspective Embeddings for Chinese Chunking

[...]

Chen Lyu¹, Bo Chen¹, Donghong Ji¹•Institutions (1)

Guangdong University of Foreign Studies¹

26 May 2018

TL;DR: This paper presents a recurrent neural network (RNN) framework based on multi-perspective embeddings for Chinese chunking, which takes the character representation, part-of-speech (POS) embeds and word embeds as the input features of the RNN layer.

...read moreread less

Abstract: Chunking is a crucial step in natural language processing (NLP), which aims to divide a text into syntactically correlated but non-overlapping chunks. The task is typically modeled as a sequence labeling problem. Various machine learning algorithms, such as Conditional Random Fields (CRFs) and Support Vector Machines (SVMs), have been successfully used for this task. However, these state-of-the-art chunking systems largely depend on hand-crafted appropriate features. In this paper, we present a recurrent neural network (RNN) framework based on multi-perspective embeddings for Chinese chunking. This framework takes the character representation, part-of-speech (POS) embeddings and word embeddings as the input features of the RNN layer. On top of the RNN, we use a CRF layer to jointly decode labels for the whole sentence. Experimental results show that various embeddings can improve the performance of the RNN model. Although our model uses these embeddings as the only features, it can be successfully used for Chinese chunking without any feature engineering efforts.

...read moreread less

Journal Article•DOI•

Increasing NLP Parsing Efficiency with Chunking

[...]

Mark Anderson, David Vilares

19 Sep 2018

TL;DR: Preliminary results are presented that show that it is feasible to compress linguistic data into chunks without significantly diminishing parsing performance and potentially increasing the speed.

...read moreread less

Abstract: We introduce a “Chunk-and-Pass” parsing technique influenced by a psycholinguistic model, where linguistic information is processed not word-by-word but rather in larger chunks of words. We present preliminary results that show that it is feasible to compress linguistic data into chunks without significantly diminishing parsing performance and potentially increasing the speed.

...read moreread less

Dissertation•

Seatwork Chunking: Evaluation Of The Efficacy Of A Presentation Accommodation For Children With ADHD

[...]

Stephanie Jerome

01 Jan 2018

Proceedings Article•DOI•

One size does not fit all: the case for chunking configuration in backup deduplication

[...]

Huijun Wu¹, Chen Wang, Kai Lu², Yinjin Fu², Liming Zhu¹ - Show less +1 more•Institutions (2)

University of New South Wales¹, National University of Defense Technology²

01 May 2018

TL;DR: A sampling-based chunking method is proposed and a tool named SmartChunker is developed to estimate the optimal chunking configuration for deduplication systems to eliminate duplicates in data with different chunk size settings.

...read moreread less

Abstract: Data backup is regularly required by both enterprise and individual users to protect their data from unexpected loss. There are also various commercial data deduplication systems or software that help users to eliminate duplicates in their backup data to save storage space. In data deduplication systems, the data chunking process splits data into small chunks. Duplicate data is identified by comparing the fingerprints of the chunks. The chunk size setting has significant impact on deduplication performance. A variety of chunking algorithms have been proposed in recent studies. In practice, existing systems often set the chunking configuration in an empirical manner. A chunk size of 4KB or 8KB is regarded as the sweet spot for good deduplication performance. However, the data storage and access patterns of users vary and change along time, as a result, the empirical chunk size setting may not lead to a good deduplication ratio and sometimes results in difficulties of storage capacity planning. Moreover, it is difficult to make changes to the chunking settings once they are put into use as duplicates in data with different chunk size settings cannot be eliminated directly. In this paper, we propose a sampling-based chunking method and develop a tool named SmartChunker to estimate the optimal chunking configuration for deduplication systems. Our evaluations on real-world datasets demonstrate the efficacy and efficiency of SmartChunker.

...read moreread less

Information reduction and studio project frameworks

[...]

Richard Eldon Fry

01 Jan 2018

TL;DR: This paper will highlight a portion of a student project where information reduction provided understanding beyond initial student impressions and encouraged them to move forward.

...read moreread less

Abstract: Studio projects increase from simple & straightforward to complex & indeterminate as undergraduate industrial design students’ progress through their educational experiences. As project complexity increases, students are faced with information overload and can struggle to move forward in a meaningful way. Complex Problem Solving studies and Cognitive Load Theory suggest information reduction as a way to grasp the critical aspects of a problem and move beyond the impasse inherent with too much information. Segmentation and chunking are common strategies for information reduction but the abstraction inherent in the chunking process provides better conceptual understanding. The simplified but meaningful results from the chunking process can then be leveraged to create a model or framework that helps students organise and clarify what they have observed as well as point to new opportunities for design activity. Despite the fear of oversimplification, significantly abstracted models have great “explanatory or predictive power” and can lead to rich results. Reviewing the concepts of complexity, cognitive load, and contrasting segmentation with data chunking, this paper will then highlight a portion of a student project where information reduction provided understanding beyond initial student impressions and encouraged them to move forward.

...read moreread less

Journal Article•DOI•

Building Resources For Vietnamese Clinical Text Processing

[...]

Hiep Nguyen Minh, Huyen Nguyen Thi Minh¹•Institutions (1)

Hanoi University of Science¹

30 Dec 2018-Computación Y Sistemas

TL;DR: The main contributions are the construction of an annotated corpus (vnEMR) and lexical resources in the medical domain and the improvement of the quality of the tools for clinical text analysis, including word segmentation, part-of-speech tagging and chunking.

...read moreread less

Abstract: Clinical texts contain textual data recorded by doctors during medical examinations. Sentences in clinical texts are generally short, narrative, notstrictly adhering to Vietnamese grammar and contain many medical terms which are not present in general dictionaries. In this paper, we investigate the tasks oflexical analysis and phrase chunking for Vietnam eseclinical texts. Although there exist several tools for general Vietnamese text analysis, these tools showeda limited quality in the clinical domain due to the specific grammatical style of clinical texts and the lack of medical vocabulary. Our main contributions are the construction of an annotated corpus (vnEMR) and lexical resources in the medical domain and in consequence theimprovement of the quality of the tools for clinical text analysis, including word segmentation, part-of-speech tagging and chunking.

...read moreread less

Proceedings Article•DOI•

Evidence for chunking vs. statistical learning in motor sequence production

[...]

Nicola J. Popp¹, Neda Kordjaz¹, Paul L. Gribble¹, Jörn Diedrichsen¹•Institutions (1)

University of Western Ontario¹

01 Jan 2018

TL;DR: A greater ability of the chunk model to predict participants’ IPIs throughout learning is found and this work discusses the differences between these two ideas and compares the ability of these two models to predict inter-press time intervals (IPIs) measured by a discrete sequence production task.

...read moreread less

Abstract: Many complex behaviors consist of sequentially ordered actions. When acquiring a novel sequential skill, the transition between actions can be performed with increasing speed. This observation has led to the idea that the elementary actions are bound together during the learning process. Two ideas for this process have been proposed: First, statistical probabilities between different elementary actions could be acquired. Secondly, discrete groupings of elementary actions – so-called chunks could emerge with learning. We discuss the differences between these two ideas and compare the ability of the two models to predict inter-press time intervals (IPIs) measured by a discrete sequence production task. We find a greater ability of the chunk model to predict participants’ IPIs throughout learning.

...read moreread less

Journal Article•DOI•

Less is More, More or Less... Finding the Optimal Threshold for Lexicalization in Chunking

[...]

Balázs Indig¹•Institutions (1)

Pázmány Péter Catholic University¹

01 Jan 2018-Computación Y Sistemas

TL;DR: This paper thoroughly investigates the method introduced by Indig and Endredy to find out the best lexicalization level for chunking and to explore the behavior of different IOB representations.

...read moreread less

Abstract: Lexicalization of the input of sequential taggers has gone a long way since it was invented by Molina and Pla [4]. In this paper we thoroughly investigate the method introduced by Indig and Endredy [2] to find out ´ the best lexicalization level for chunking and to explore the behavior of different IOB representations. Both tasks are applied to the CoNLL-2000 dataset. Our goal is to introduce a transformation method to accommodate the parameters of the development set to the training set using their frequency distributions which other tasks like POS tagging or NER could benefit too.

...read moreread less

Proceedings Article•DOI•

Performance Analysis of Cloud Storage Using Chunking Algorithm

[...]

P. Minishapriya¹, S. Maheswari¹•Institutions (1)

National Engineering College¹

01 Feb 2018

TL;DR: A performance analysis of Rabin based chunking and Rapid Asymmetric Maximum (RAM) chunking using throughput is proposed using throughput for reducing storage utilization and handling data replication in the backup environment efficiently.

...read moreread less

Abstract: The reduction of storage utilization and handling data replication in the backup environment efficiently which was introduced by the emerging technology called Data de-duplication. The Chunk is a method of breaking data into multiple pieces and each chunk has the unique hash identifier for identification. To check data duplication the hash identifier of a chunk is checked with the previously stored chunk. It increases the efficiency of cloud storage. This project proposes a performance analysis of Rabin based chunking and Rapid Asymmetric Maximum (RAM) chunking using throughput. Process considers the case study of population dataset in which the state and district are considering the parameter to perform chunking and to analyze the de-duplication.

...read moreread less