scispace - formally typeset
Search or ask a question

Showing papers on "Chunking (computing) published in 2013"


Patent
11 Jan 2013
TL;DR: In this paper, a method and system for providing unified local storage support for file and cloud access is described, which comprises writing a chunk on a storage server, and replicating the chunk to other selected storage servers when necessary.
Abstract: A method and system for providing unified local storage support for file and cloud access is disclosed. The method comprises writing a chunk on a storage server, and replicating the chunk to other selected storage servers when necessary. The method and system further comprise writing a version manifest on the storage server; replicating the version manifest to other selected storage servers when necessary. Object puts or appends are implemented by first chunking the object, determining if the chunks are new, transferring the chunks if required, followed by creation of a new version manifest referencing the chunks. Finally, the method and system include providing concurrent file-oriented read and write access consistent with the stored version manifests and chunks.

70 citations


Journal ArticleDOI
TL;DR: This study investigates high-frequency organizing chunks in ELF corpora using the Linear Unit Grammar (LUG) framework and finds the lower frequency, organizing chunks showed a higher rate of approximation and number of unique forms, while the higher frequency chunks were primarily attested in conventional forms both in written and spoken ELF.
Abstract: Abstract An ongoing discussion in ELF research is the ability of ELF speakers to store and retrieve holistic chunks of language, facilitating efficient and fluent production of speech. These questions involve the frequency effects of formulaic chunks of language and their varying degrees of entrenchment for ELF users. In addition, the variable forms in which these chunks may be attested can be treated as approximations of conventional chunks, while serving identical functions. This study addresses these issues by investigating high-frequency organizing chunks in ELF corpora using the Linear Unit Grammar (LUG) framework (Sinclair and Mauranen 2006). Drawing data from the ELFA corpus of spoken academic ELF, the study also considers organizing chunks in written academic ELF from the nascent WrELFA corpus. With ENL comparison data taken from the Michigan Corpus of Academic Spoken English (MICASE), findings are presented on the forms and frequencies of textual and interactive organizing chunks in ELF, with implications for the reality of frequency effects and their connection to distributions of approximated chunks. The lower frequency, organizing chunks showed a higher rate of approximation and number of unique forms, while the higher frequency chunks were primarily attested in conventional forms both in written and spoken ELF.

56 citations


Journal ArticleDOI
TL;DR: A general procedure for the analysis of naturalistic driving data called chunking is presented that can support many of these analyses by increasing their robustness and sensitivity, and create a solid basis for further data analyses.

36 citations


Journal ArticleDOI
TL;DR: This paper studies the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, as well as a family of viral genome RNAs whose structures have not been predicted before.
Abstract: Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment. On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance. By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.

21 citations


Journal ArticleDOI
Ider Lkhagvasuren1, Jungmin So1, Jeong-Gun Lee1, Chuck Yoo, Young Woong Ko1 
TL;DR: An algorithm and structure for a deduplication method which can be efficiently used for eliminating identical data between files existing different machines with high rate and performing it within rapid time is presented.
Abstract: This paper presents an algorithm and structure for a deduplication method which can be efficiently used for eliminating identical data between files existing different machines with high rate and performing it within rapid time. The algorithm predicts identical parts between source and destination files very fast, and then assures the identical parts and transfers only those parts of blocks that proved to be unique region. The fundamental aspect of reaching faster and high scalability determining duplicate result is that data are expressed as fixedsize block chunks which are distributed to “Index-table” by chunk’s both side boundary values. “Index-table” is a fixed sized table structure; chunk’s boundary byte values are used as their cell row and column numbers. Experiment result shows that the proposed solution enhances data deduplication performance and reduces data storage capacity extensively.

10 citations


Book ChapterDOI
20 Oct 2013
TL;DR: Three distinct approaches to chunk transcribed oral data with labeling tools learnt from a corpus of written texts are tried to reach the best possible results with the least possible manual correction or re-learning effort.
Abstract: In this paper, we try three distinct approaches to chunk transcribed oral data with labeling tools learnt from a corpus of written texts. The purpose is to reach the best possible results with the least possible manual correction or re-learning effort.

8 citations


Proceedings ArticleDOI
08 Sep 2013
TL;DR: This paper addresses the issues of how incrementally chunking learned action rules of increasing length and complexity can assist in solving problems of ever greater complexity by employing a micro-world with simple objects and simplified physical behaviors.
Abstract: In this paper we address the issues of how incrementally chunking learned action rules of increasing length and complexity can assist in solving problems of ever greater complexity. To this end, we employ a micro-world with simple objects and simplified physical behaviors. The agent first learns some basic elemental rules capturing the fundamental physical behaviors of the agent itself, the objects and their interactions. Then, some moderately complex problems such as going from a start state to a goal state that do not require too many steps are given to the system and the system uses a standard search process (e.g., A) to find solutions which do not require too much search time because the problems are relatively simple. The solutions are then remembered as "chunked" rules of taking a sequence of actions to achieve a certain goal. Later, when a more complex problem - one that requires many steps to solve - is encountered, the chunked rules discovered earlier can be used to greatly reduce the search space by providing chunked sub-steps. Problem solving for complex problems without the chunking process would be impossible, as the search space would be combinatorially large.

4 citations




Journal ArticleDOI
TL;DR: This work proposed a novel Improved Frequency Based Chunking algorithm for data de-duplication based on the FBC algorithm, and proved that the IFBC algorithm has a great improvement on the performance compared with the F BC algorithm.
Abstract: Invoked by the thought of hierarchical substring caching, we proposed a novel Improved Frequency Based Chunking (called IFBC) algorithm for data de-duplication, based on the FBC algorithm proposed in Frequency Based Chunking for Data De-Duplication. Then we conducted a lot of experiments and proved that the IFBC algorithm has a great improvement on the performance compared with the FBC algorithm.

2 citations


Proceedings Article
01 Sep 2013
TL;DR: This approach attacks the difficulty of acquiring more complex longer rules when inducing inversion transduction grammars via unsupervised bottom-up chunking, by augmenting its model search with top-down segmentation that minimizes CDL, resulting in significant translation accuracy gains.
Abstract: We present an unsupervised learning model that induces phrasal inversion transduction grammars by introducing a minimum conditional description length (CDL) principle to drive search over a space defined by two opposing extreme types of ITGs. Our approach attacks the difficulty of acquiring more complex longer rules when inducing inversion transduction grammars via unsupervised bottom-up chunking, by augmenting its model search with top-down segmentation that minimizes CDL, resulting in significant translation accuracy gains. Chunked rules tend to be relatively short; long rules are hard to learn through chunking, as the smaller parts of the long rules may not necessarily be good translations themselves. Our objective criterion is a conditional adaptation of the notion of description length, that is conditioned on a fixed preexisting model, in this case the initial chunked ITG. The notion of minimum CDL (MCDL) facilitates a novel strategy for avoiding the pitfalls of premature pruning in chunking approaches, by incrementally splitting an ITG with reference to a second ITG that conditions this search.

01 Jan 2013
TL;DR: This paper proposes and evaluates two ways of combining a symbolic model and a statistical model learnt by a CRF, and shows that in both cases they benefit from one another.
Abstract: Symbolic and statistical learning for chunking : comparison and combinations We describe in this paper how to use grammatical inference algorithms for chunking, then compare and combine them to CRFs (Conditional Random Fields) which are known efficient for this task. Our corpus is extracted from the FrenchTreebank. We propose and evaluate two ways of combining a symbolic model and a statistical model learnt by a CRF, and show that in both cases they benefit from one another.


01 Jun 2013
TL;DR: L’utilisation d’algorithmes d”inférence grammaticale pour la tâche de chunking, pour ensuite les comparer et les combiner avec des CRF (Conditional Random Fields), à l’efficacité éprouvée pour cette tâches.
Abstract: RÉSUMÉ Nous décrivons dans cet article l’utilisation d’algorithmes d’inférence grammaticale pour la tâche de chunking, pour ensuite les comparer et les combiner avec des CRF (Conditional Random Fields), à l’efficacité éprouvée pour cette tâche. Notre corpus est extrait du French TreeBank. Nous proposons et évaluons deux manières différentes de combiner modèle symbolique et modèle statistique appris par un CRF et montrons qu’ils bénéficient dans les deux cas l’un de l’autre.


Proceedings Article
01 Aug 2013
TL;DR: This work uses crowdsourcing to obtain query and sentence chunking and shows that entailment can not only be used as an effective evaluation metric to assess the quality of annotations, but it can also be employed to filter out noisy annotations.
Abstract: Hierarchical or nested annotation of linguistic data often co-exists with simpler non-hierarchical or flat counterparts, a classic example being that of annotations used for parsing and chunking. In this work, we propose a general strategy for comparing across these two schemes of annotation using the concept of entailment that formalizes a correspondence between them. We use crowdsourcing to obtain query and sentence chunking and show that entailment can not only be used as an effective evaluation metric to assess the quality of annotations, but it can also be employed to filter out noisy annotations.

01 Jan 2013
TL;DR: This paper links the structuring of decision problems in MCDM to the theory of chunking, which describes how human cognition structures and perceives environmental information, and proposes that the validity of models representing multi-criteria decision problems can be assessed by evaluating the degree to which they match the structures formed by chunking.
Abstract: The first steps of multi-criteria decision making (MCDM) are typically the decomposition and structuring of the decision problem at hand. As all subsequent process steps of MCDM are based on the initial structuring of the decision problem, the validity of the structure representing the decision problem is of particular importance for the quality of the decision making process. This paper seeks to further develop our understanding of validity in structuring multi-criteria decisions. For this purpose, we link the structuring of decision problems in MCDM to the theory of chunking, which describes how human cognition structures and perceives environmental information. Based on this, we propose that the validity of models representing multi-criteria decision problems can be assessed by evaluating the degree to which they match the structures formed by chunking. We discuss a preliminary framework of how the match between the cognitive and the MCDM model can be tested. To demonstrate how this framework can be utilized in research practice, we apply it to empirically show that algorithmic, bottom-up structuring of MCDM problems leads to valid goal-criteria hierarchies.

Patent
07 May 2013
TL;DR: In this article, a computer-implemented method for parallel content-defined data chunking may include identifying a data stream to be chunked, splitting the data stream into a plurality of data sub-streams by alternatingly dividing consecutive bytes of the data streams among the plurality of substreams.
Abstract: A computer-implemented method for parallel content-defined data chunking may include (1) identifying a data stream to be chunked, (2) splitting the data stream into a plurality of data sub-streams by alternatingly dividing consecutive bytes of the data stream among the plurality of data sub-streams, and (3) chunking, in parallel, each data sub-stream within the plurality of data sub-streams into a plurality of data segments using a content-defined chunking algorithm. Various other methods, systems, and computer-readable media are also disclosed.



Proceedings ArticleDOI
01 Nov 2013
TL;DR: This paper proposes an approach that should faster and better identify copied fragments of text data than standard approaches for plagiarism detection and first identifies topic related pairs of text documents and then select those pairs on further processing that discuss similar topic.
Abstract: Plagiarism is a serious problem especially in academic environment Basically we define this problem as a theft of stealing somebody else's work or ideas In this paper we focus on plagiarism in a domain of student assignments written in natural language We propose an approach that should faster and better identify copied fragments of text data than standard approaches We first identify topic related pairs of text documents and then select those pairs on further processing that discuss similar topic We experimented with usage of different chunking methods in the comparison process to overcome typical problems as shorter fragments of text copied from other documents The results show that our approach is more suitable for plagiarism detection as a standard n-gram method

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an adaptive chunking method based on application locality and file name locality of written data in SSD-based server storage, which can reduce the overhead of chunking and hash key generation and prevent duplicated data writing.
Abstract: NAND flash-based SSDs (Solid State Drive) have advantages of fast input/output performance and low power consumption so that they could be widely used as storages on tablet, desktop PC, smart-phone, and server. But, SSD has the disadvantage of wear-leveling due to increase of the number of writes. In order to improve the lifespan of the SSD, a variety of data deduplication techniques have been introduced. General fixed-size splitting method allocates fixed size of chunk without considering locality of data so that it may execute unnecessary chunking and hash key generation, and variable-size splitting method occurs excessive operation since it compares data byte-by-byte for deduplication. This paper proposes adaptive chunking method based on application locality and file name locality of written data in SSD-based server storage. The proposed method split data into 4KB or 64KB chunks adaptively according to application locality and file name locality of duplicated data so that it can reduce the overhead of chunking and hash key generation and prevent duplicated data writing. The experimental results show that the proposed method can enhance write performance, reduce power consumption and operation time compared to existing variable-size splitting method and fixed size splitting method using 4KB.

Proceedings Article
01 Jan 2013
TL;DR: It is found that the acoustic properties of the syllables had a larger impact on the non-learners’ decisions since they could not operate on linguistic knowledge of German, and Chinese and Mexican nonlearners show a preference to mark an accent when the syllable is followed by a word boundary.
Abstract: This study concerns the perception of boundaries and accented syllables by native German subjects as compared to foreign non-speakers and learners of the language at different proficiency levels. To this effect six-syllable sequences excised from a context of three poly-syllabic words of German were presented to participants who had to select the syllables they perceived as accented, as well as the locations of word boundaries. Results show that German native subjects perform well at the word boundary task, but mark correctly less than two thirds of accented syllables. Chinese and Mexican nonlearners still detect a considerable number of word boundaries and accented syllables. Learners of German show improvement at the task with growing experience though they often pick legal subword units that do not necessarily form a plausible sequence. Correlation analysis of factors for syllable and boundary selection performed for non-learners and German subjects – as expected – shows considerably different behaviours. Whereas the boundary location does not influence the Germans’ decision on the accent location, Chinese and Mexican non-learners show a preference to mark an accent when the syllable is followed by a word boundary. We also found that the acoustic properties of the syllables had a larger impact on the non-learners’ decisions since they could not operate on linguistic knowledge of German.


Patent
12 Apr 2013
TL;DR: In this paper, the authors propose a method for the chunking of the data and the delivery of high bandwidth chunks to a requesting user at times that are more convenient for the network.
Abstract: The present invention provides for the chunking of the data, and the delivery of high bandwidth chunks to a requesting user at times that are more convenient for the network.