scispace - formally typeset
Search or ask a question

Showing papers on "Chunking (computing) published in 2016"


Journal ArticleDOI
TL;DR: This is an open-access article and the use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and the original publication in this journal is cited, in accordance with accepted academic practice.
Abstract: © 2016 Gobet, Lloyd-Kelly and Lane. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

36 citations


Journal ArticleDOI
TL;DR: In this article, the structural properties of language to minimize dependency distance were investigated and it was shown that chunking may significantly reduce mean dependency distance of linear sequences, which suggests that language may have evolved the mechanism of dynamic chunking to reduce the complexity for the sake of efficient communication.
Abstract: Natural language is a complex adaptive system with multiple levels. The hierarchical structure may have much to do with the complexity of language. Dependency Distance has been invoked to explain various linguistic patterns regarding syntactic complexity. However, little attention has been paid to how the structural properties of language to minimize dependency distance. This article computationally simulates several chunked artificial languages, and shows, through comparison with Mandarin Chinese, that chunking may significantly reduce mean dependency distance of linear sequences. These results suggest that language may have evolved the mechanism of dynamic chunking to reduce the complexity for the sake of efficient communication. © 2016 Wiley Periodicals, Inc. Complexity 21: 33–41, 2016

27 citations


Proceedings Article
12 Feb 2016
TL;DR: A joint model that performs segmentation, POS-tagging and chunking simultaneously, and employs a semi-supervised method to derive chunk cluster features from large-scale automatically-chunked data to address the sparsity of full chunk features.
Abstract: Chinese chunking has traditionally been solved by assuming gold standard word segmentation. We find that the accuracies drop drastically when automatic segmentation is used. Inspired by the fact that chunking knowledge can potentially improve segmentation, we explore a joint model that performs segmentation, POS-tagging and chunking simultaneously. In addition, to address the sparsity of full chunk features, we employ a semi-supervised method to derive chunk cluster features from large-scale automatically-chunked data. Results show the effectiveness of the joint model with semi-supervised features.

25 citations


Patent
09 Mar 2016
TL;DR: In this article, small objects are efficiently stored with erasure codes by combining a small object with other small objects and/or large objects to form a single large object for chunking, and providing early notification of permanent storage to the sources of the objects to prevent small objects from becoming stale while waiting for additional objects to be combined.
Abstract: Small objects are efficiently stored with erasure codes by combining a small object with other small objects and/or large objects to form a single large object for chunking, and providing early notification of permanent storage to the sources of the objects to prevent small objects from becoming stale while waiting for additional objects to be combined

23 citations


Journal ArticleDOI
TL;DR: It is found that participants who adopted a consistent chunking strategy during symbolic sequence learning showed a greater improvement of their performance and a larger decrease in cognitive workload over time, indicating that chunking is a cost-saving strategy that enhances effectiveness of symbolic sequenceLearning.
Abstract: Chunking, namely the grouping of sequence elements in clusters, is ubiquitous during sequence processing, but its impact on performance remains debated. Here, we found that participants who adopted a consistent chunking strategy during symbolic sequence learning showed a greater improvement of their performance and a larger decrease in cognitive workload over time. Stronger reliance on chunking was also associated with higher scores in a WM updating task, suggesting the contribution of WM gating mechanisms to sequence chunking. Altogether, these results indicate that chunking is a cost-saving strategy that enhances effectiveness of symbolic sequence learning.

19 citations


Proceedings ArticleDOI
01 Jan 2016
TL;DR: A technique to extract keywords from educational video transcripts from MOOC's is discussed, based on Regular Expression Grammar Rule approach to identify the Noun Chunks in the text of the transcript.
Abstract: Keyword Extraction is the most important task while working with the text data. Extracting Keywords benefit the reader as to judge the important part of text instead of going through the whole text. In this paper a technique to extract keywords from educational video transcripts from MOOC's is discussed. The technique is based on Regular Expression Grammar Rule approach to identify the Noun Chunks in the text of the transcript. Extracting Keywords help in finding out the specifically important part of the educational material.

15 citations



Journal ArticleDOI
TL;DR: It is proposed that language comprehenders must immediately compress perceptual data by “chunking” them into higher-level categories, knowing that effective language understanding, however, requires maintaining perceptual information long enough to integrate it with downstream cues.
Abstract: Christiansen & Chater (C&C) propose that language comprehenders must immediately compress perceptual data by "chunking" them into higher-level categories. Effective language understanding, however, requires maintaining perceptual information long enough to integrate it with downstream cues. Indeed, recent results suggest comprehenders do this. Although cognitive systems are undoubtedly limited, frameworks that do not take into account the tasks that these systems evolved to solve risk missing important insights.

10 citations


Book ChapterDOI
22 Aug 2016
TL;DR: An intelligent data management framework that can facilitate development of highly scalable and mobile healthcare applications for remote monitoring of patients through the use of a global log data abstraction that leverages the storage and processing capabilities of the edge devices and the cloud in a seamless manner is presented.
Abstract: We present an intelligent data management framework that can facilitate development of highly scalable and mobile healthcare applications for remote monitoring of patients. This is achieved through the use of a global log data abstraction that leverages the storage and processing capabilities of the edge devices and the cloud in a seamless manner. In existing log based storage systems, data is read as fixed size chunks from the cloud to enhance performance. However, in healthcare applications, where the data access pattern of the end users differ widely, this approach leads to unnecessary storage and cost overheads. To overcome these, we propose dynamic log chunking. The experimental results, comparing existing fixed chunking against the H-Plane model, show 13 %–19 % savings in network bandwidth as well as cost while fetching the data from the cloud.

9 citations


Journal ArticleDOI
TL;DR: This work builds a bio-plausible hierarchical chunking of sequential memory (HCSM) model, and uncovers that a chunking mechanism reduces the requirements of synaptic plasticity since it allows applying synapses with narrow dynamic range and low precision to perform a memory task.
Abstract: Chunking refers to a phenomenon whereby individuals group items together when performing a memory task to improve the performance of sequential memory. In this work, we build a bio-plausible hierarchical chunking of sequential memory (HCSM) model to explain why such improvement happens. We address this issue by linking hierarchical chunking with synaptic plasticity and neuromorphic engineering. We uncover that a chunking mechanism reduces the requirements of synaptic plasticity since it allows applying synapses with narrow dynamic range and low precision to perform a memory task. We validate a hardware version of the model through simulation, based on measured memristor behavior with narrow dynamic range in neuromorphic circuits, which reveals how chunking works and what role it plays in encoding sequential memory. Our work deepens the understanding of sequential memory and enables incorporating it for the investigation of the brain-inspired computing on neuromorphic architecture.

9 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: The authors' system (DTSim) developed a Conditional Random Fields based chunker and applied rules blended with semantic similarity methods in order to predict chunk alignments, alignment types and similarity scores.
Abstract: In this paper we describe our system (DTSim) submitted at SemEval-2016 Task 2: Interpretable Semantic Textual Similarity (iSTS). We participated in both gold chunks category (texts chunked by human experts and provided by the task organizers) and system chunks category (participants had to automatically chunk the input texts). We developed a Conditional Random Fields based chunker and applied rules blended with semantic similarity methods in order to predict chunk alignments, alignment types and similarity scores. Our system obtained F1 score up to 0.648 in predicting the chunk alignment types and scores together and was one of the top performing systems overall.

Proceedings ArticleDOI
01 Aug 2016
TL;DR: The computational formulation of chunking in the C-ULM is described, followed by results of simulation studies examining impacts of chunked versus no chunking on agent learning and agent effectiveness are examined.
Abstract: Chunking has emerged as a basic property of human cognition. Computationally, chunking has been proposed as a process for compressing information also has been identified in neural processes in the brain and used in models of these processes. Our purpose in this paper is to expand understanding of how chunking impacts both learning and performance using the Computational-Unified Learning Model (C-ULM) a multi-agent computational model. Chunks in C-ULM long-term memory result from the updating of concept connection weights via statistical learning. Concept connection weight values move toward the accurate weight value needed for a task and a confusion interval reflecting certainty in the weight value is shortened each time a concept is attended in working memory and each time a task is solved, and the confusion interval is lengthened when a chunk is not retrieved over a number of cycles and each time a task solution attempt fails. The dynamic tension between these updating mechanisms allows chunks to come to represent the history of relative frequency of co-occurrence for the concept connections present in the environment; thereby encoding the statistical regularities in the environment in the long-term memory chunk network. In this paper, the computational formulation of chunking in the C-ULM is described, followed by results of simulation studies examining impacts of chunking versus no chunking on agent learning and agent effectiveness. Then, conclusions and implications of the work both for understanding human learning and for applications within cognitive informatics, artificial intelligence, and cognitive computing are discussed.

Proceedings ArticleDOI
27 Jun 2016
TL;DR: It is shown that the gain brought by algorithms that are aggressively focusing on DER often come at a significant cost in terms of throughput, and advocates for future optimizations taking throughput into account and for making balanced tradeoffs between DER and throughput.
Abstract: Data deduplication techniques are often used by cloud storage systems to reduce network bandwidth and storage requirements. As a consequence, the current research literature tends to focus most of its algorithmic efforts on improving the Duplicate Elimination Ratio (DER), which reflects the compression achieved using a given algorithm. Yet, the importance of this indicator tends to be overestimated, while another key indicator, namely throughput, tends to be underestimated. To substantiate this claim, we reimplement a selection of popular Content-Defined Chunking algorithms (CDC) and perform a detailed performance analysis. On this basis, we show that the gain brought by algorithms that are aggressively focusing on DER often come at a significant cost in terms of throughput. As a consequence, we advocate for future optimizations taking throughput into account and for making balanced tradeoffs between DER and throughput.

Proceedings ArticleDOI
01 Aug 2016
TL;DR: This paper designs a new Unlimited Content-Defined Chunking (UCDC) algorithm, which contains file-chunking, file-comparing and file-merging, and evaluates the effectiveness of the UCDC by simulation example that produces the description of file.
Abstract: Nowadays, the data centric system has been playing an increasingly important role in blogs sharing, content delivery and news broadcasting, file-synchronization, and so on. Due to generated amount of data within the system, data backup and archiving has become a main challenging task. A main methods to solve the problem is Chunking based deduplication by eliminating redundant data and reducing the total storage space. In this paper, we summarized several ways of file-differing, and then designs a new Unlimited Content-Defined Chunking (UCDC) algorithm, which contains file-chunking, file-comparing and file-merging. We evaluate the effectiveness of the UCDC by simulation example that produces the description of file.

Book ChapterDOI
03 Apr 2016
TL;DR: A less studied side of phrase chunking is investigated, i.e. the voting between different currently available taggers, the checking of invalid sequences and the way how the state-of-the-art method can be adapted to morphologically rich, agglutinative languages.
Abstract: The CoNLL-2000 dataset is the de-facto standard dataset for measuring chunkers on the task of chunking base noun phrases (NP) and arbitrary phrases. The state-of-the-art tagging method is utilising TnT, an HMM-based Part-of-Speech tagger (POS), with simple majority voting on different representations and fine-grained classes created by lexcialising tags. In this paper the state-of-the-art English phrase chunking method was deeply investigated, re-implemented and evaluated with several modifications. We also investigated a less studied side of phrase chunking, i.e. the voting between different currently available taggers, the checking of invalid sequences and the way how the state-of-the-art method can be adapted to morphologically rich, agglutinative languages.

Patent
16 Jun 2016
TL;DR: In this paper, the authors present a system and methods for providing distinct conversations within a file activity feed for display on a user interface of a client computing device, where a file created with an application may be rendered on the user interface.
Abstract: Aspects of the present disclosure relate to systems and methods for providing distinct conversations within a file activity feed for display on a user interface of a client computing device. A file created with an application may be rendered on the user interface. The file may include at least a chat pane comprising a plurality of chat messages and a file activity feed including one or more activities associated with the file. It may be determined when a distinct conversation begins and ends within the chat pane. The distinct conversation may include at least some of the plurality of chat messages. In response to determining when the distinct conversation begins and ends, the distinct conversation may be recorded as a distinct conversation activity associated with the file. The distinct conversation activity may be displayed within the file activity feed.

Patent
29 Apr 2016
TL;DR: In this article, the query is chunked or broken down into a sequence of smaller chunked queries and the chunked results of those smaller queries are streamed back to the requestor.
Abstract: Instead of processing a query as-is, the query is chunked or broken down into a sequence of smaller chunked queries and the chunked results of those smaller queries are streamed back to the requestor. Chunking the query and streaming the chunked results can substantially decrease the user's time to value when running a query by returning some immediate results for display which are refined and eventually converge on the full results as each chunked query runs.

Patent
06 Jul 2016
TL;DR: In this article, a conditional random field and transformative learning-based Vietnamese chunking method is proposed to improve the chunking performance for Vietnamese sentences, which can be used for work such as phrase trees, semantic analysis, machine translation and the like.
Abstract: The invention relates to a conditional random field and transformative learning based Vietnamese chunking method and belongs to the technical field of natural language processing. The method comprises the steps of firstly preprocessing Vietnamese corpora to obtain sentence level Vietnamese chunking training corpora; extracting the sentence level Vietnamese chunking training corpora from a database and performing chunking modeling on the sentence level Vietnamese chunking training corpora to obtain a Vietnamese chunking conditional random field model; obtaining a transformative mode set; and performing chunking marking on to-be-chunked Vietnamese sentence level test corpora through the established Vietnamese chunking conditional random field model and the obtained transformative mode set to obtain a Vietnamese chunking marking result. The method realizes effective chunking analysis for Vietnamese sentences and paves the way for work such as phrase trees, semantic analysis, machine translation and the like; and compared with an existing Vietnamese chunking tool, the Vietnamese chunking method is remarkably improved in accuracy, recall rate and F value.

Book ChapterDOI
21 Sep 2016
TL;DR: The realization of the proposed model offers a new view over the task of automatic spelling correction and allows eliminating the possible alternatives generated by the system according to a morphological dictionary.
Abstract: The present paper regards the syntactic support of spelling correction in Russian and English. The syntactic model used in the research is represented by chunking in dependency trees due to the fact that chunking has a great potential for the goal of our study. Particularly, it does not require a complete description of the syntactic model. Moreover, it allows eliminating the possible alternatives generated by the system according to a morphological dictionary. Thus, the realization of the proposed model offers a new view over the task of automatic spelling correction.

Proceedings ArticleDOI
08 Sep 2016
TL;DR: Presented in: Interspeech 2016, San Francisco, United States of America, 9 - 12 September 2016.
Abstract: Presented in: Interspeech 2016, San Francisco, United States of America, 9 - 12 September 2016. 'For any article published in Interspeech proceedings, ISCA grants each author permission to use the article in that author's dissertation or in institutional repositories (paper and/or electronic versions), provided that the article is correctly referenced (including page numbers and/or paper number)'

01 Jun 2016
TL;DR: This document specifies a chunking protocol for dividing a user payload into CCNx Content Objects and specification for the naming convention to use for the chunked payload and a field added to a Content Object to represent the last chunk of an object.
Abstract: This document specifies a chunking protocol for dividing a user payload into CCNx Content Objects. This includes specification for the naming convention to use for the chunked payload and a field added to a Content Object to represent the last chunk of an object.

Journal Article
TL;DR: The proposed approach of Source Retrieval task of External Plagiarism Detection System includes chunking of documents based on paragraphs along with Part-of- Speech tagging and an efficient download filtering method which exhibited improved efficiency in PAN 2015 conducted by PAN CLEF Evaluation lab1.
Abstract: Source Retrieval is an important task of External Plagiarism Detection system which involves in identifying a set of candidate source documents for a given suspicious document. Not to lose any actual source document while reducing the size of the candidate source document set is crucial. This paper describes the approach of Source Retrieval task of External Plagiarism Detection System. The approach includes chunking of documents based on paragraphs along with Part-of- Speech tagging and an efficient download filtering method. The proposed system is evaluated against PAN 2011-12, PAN 2012-13 PAN 2014-15 Test Data Set and results are analysed and compared using standard PAN measures: Recall, Precision, F Measure, average number of queries and downloads. The proposed approach exhibited improved efficiency in PAN 2015 conducted by PAN CLEF Evaluation lab1, by acquiring highest values for F Measure and Precision along with least Downloads. The results are further improved by incorporating efficient query and download filtering mechanisms over the proposed system. The effect of the enhanced proposed system is also discussed and analysed in this paper.

Proceedings Article
01 Dec 2016
TL;DR: A method to use chunkers to develop a cross-lingual parser for Bengali which results in an improvement of unlabelled attachment score (UAS) from 65.1 (baseline parser) to 78.2.
Abstract: While statistical methods have been very effective in developing NLP tools, the use of linguistic tools and understanding of language structure can make these tools better. Cross-lingual parser construction has been used to develop parsers for languages with no annotated treebank. Delexicalized parsers that use only POS tags can be transferred to a new target language. But the success of a delexicalized transfer parser depends on the syntactic closeness between the source and target languages. The understanding of the linguistic similarities and differences between the languages can be used to improve the parser. In this paper, we use a method based on cross-lingual model transfer to transfer a Hindi parser to Bengali. The technique does not need any parallel corpora but makes use of chunkers of these languages. We observe that while the two languages share broad similarities, Bengali and Hindi phrases do not have identical construction. We can improve the transfer based parser if the parser is transferred at the chunk level. Based on this we present a method to use chunkers to develop a cross-lingual parser for Bengali which results in an improvement of unlabelled attachment score (UAS) from 65.1 (baseline parser) to 78.2.

Book ChapterDOI
01 Jan 2016
TL;DR: This paper focuses on the experimental study on various chunking algorithms since chunking plays a very important role in data redundancy elimination system.
Abstract: Data deduplication also known as data redundancy elimination is a technique for saving storage space. The data deduplication system is highly successful in backup storage environments. Large number of redundancies may exist in a backup storage environment. These redundancies can be eliminated by finding and comparing the fingerprints. This comparison of fingerprints may be done at the file level or splits the files to create chunks and done at the chunk level. The file level deduplication system leads poor results than the chunk level since it considers the entire file for finding hash value and eliminates only duplicate files. This paper focuses on the experimental study on various chunking algorithms since chunking plays a very important role in data redundancy elimination system.

Proceedings ArticleDOI
01 Jan 2016
TL;DR: Variable size chunking algorithms are the best suited deduplication techniques for the backup operation, based on different parameters.
Abstract: Data is the most imperative part of any organization for their productive need or to make more profit. Rapid growth data with variations is solemn issue to handle or process. Data is generating at higher rate that data needs to be stored the databases with uniqueness. Deduplication is a technique to abolish the duplicated data from the databases and provides the backup of the data. In data deduplication numerous algorithm are feasible that basically detect a eliminate the redundant data and store unique copy of data contents. Various chunking techniques are used for t backup operations and to perform deduplication of the data. Various chunking techniques like Fixed sized chunking Whole file chunking and Content Defined Chunking are used for the data deduplication. Backup operation is achieved using these chunking techniques. These techniques are compared with each other for getting best suited technique for t backup job. In this paper we have presented performance evaluation of various deduplication techniques. Performan parameters matrix having parameters Deduplication ratio, Deduplication time, Hashing time, Chunking time a Throughput. The analysis result provides some guidelines to adopt the best deduplication techniques to clear away fro clone data. After comparing these chunking techniques based on different parameters it is concluded that variable size chunking algorithms are the best suited deduplication techniques for the backup operation.

Proceedings ArticleDOI
05 Jul 2016
TL;DR: The application of this dynamic log chunking mechanism based on reader access pattern and domain specific data characteristics in the area of remote patient monitoring in bandwidth starved rural areas is shown to result in bandwidth and cost savings of 14% without affecting the prefetch performance.
Abstract: Time series data from sensor devices are increasingly stored in log data structures across the cloud and mobile devices. Currently, log data is accessed as chunks of fixed size, which enhances performance by prefetching of data. However, in applications such as remote monitoring of patients using mobile devices, data requirement of end users varies significantly depending upon their roles. The fixed chunking approach would lead to unnecessary data download due to the dynamic variability of data access. Also, the requests are more often than not based on fixed time chunks that do not necessarily translate to fixed data size. To overcome this challenge, we present a dynamic log chunking mechanism based on reader access pattern and domain specific data characteristics. The application of this method in the area of remote patient monitoring in bandwidth starved rural areas is shown to result in bandwidth and cost savings of 14% without affecting the prefetch performance.


29 Sep 2016
TL;DR: This paper compares and discusses notions such as chunking, intuition, emergent grammar and ad-hoc constructions, based on chosen texts from the respective fields of study.
Abstract: Located at the intersection of applied linguistics and more formal language theory, this paper draws a parallel between concepts applied to grasp ELF and increasingly influential usage-based approaches to grammar. More precisely, I compare and discuss notions such as chunking , intuition , emergent grammar and ad-hoc constructions . The discussion is based on chosen texts from the respective fields of study. Basically, that is Sinclair and Mauranen’s book on Linear Unit Grammar and, the work of Joan Bybee.