scispace - formally typeset
Search or ask a question

Showing papers on "Chunking (computing) published in 2019"


Proceedings ArticleDOI
12 Sep 2019
TL;DR: Effectiveness of multi-dataset-multi-task learning in training neural models for four sequence tagging tasks for Twitter data, namely, part of speech tagging, chunking, super sense tagging, and named entity recognition is studied.
Abstract: Multi-task learning is effective in reducing the required data for learning a task, while ensuring competitive accuracy with respect to single task learning. We study effectiveness of multi-dataset-multi-task learning in training neural models for four sequence tagging tasks for Twitter data, namely, part of speech (POS) tagging, chunking, super sense tagging, and named entity recognition (NER). We utilize -- 7 POS, 10 NER, 1 Chunking, and 2 super sense -- tagged publicly available datasets. We use a multi-dataset-multi-task neural model based on pre-trained contextual text embeddings and compare it against single-dataset-single-task, and multi-dataset-single-task models. Even within a task, the tagging schemes may differ across datasets. The model learns using this tagging diversity across all datasets for a task. The models are more effective compared to single data/task models, leading to significant improvements for POS (1-2% acc., 7 datasets), NER (1-10% F1, 9 datasets), and chunking (4%). For super sense tagging there is 2% improvement in F1 for out of domain data. Our models and tools can be found at https://socialmediaie.github.io/

19 citations


Journal ArticleDOI
TL;DR: This article shows that in German WhatsApp dialogues, users apply a chronological as well as a reversed ordering of SPPs in order to foreground particular topics in extended, chat-like dialogues.
Abstract: In computer-mediated communication, users cannot ensure that responsive postings are placed in a directly adjacent position. Yet, paired actions are discernible in which a first pair part (FPP) mak...

16 citations


Proceedings ArticleDOI
22 May 2019
TL;DR: SS-CDC is proposed, a two-stage parallel CDC that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio and exploits instruction-level SIMD parallelism available in today's processors.
Abstract: Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio.

15 citations


Journal ArticleDOI
TL;DR: This paper presents the first constant-time chunking algorithm that divides every packet into a predefined number of chunks, irrespective of the packet size, and presents the best implementation practice for packet-level deduplication by selecting an optimal combination of chunking, fingerprinting, and hash table algorithms.

14 citations


Journal ArticleDOI
TL;DR: Two pause thresholds were tested, aimed at chunking the translation task workflow into task segments and classifying pauses into different kinds of pauses, and found that pauses below 200 ms were dubbed delays and pauses above that level were called pauses.
Abstract: Two pause thresholds were tested, aimed at chunking the translation task workflow into task segments and classifying pauses into different kinds. Pauses below 200 ms were dubbed delays and excluded...

11 citations


Book ChapterDOI
01 Jan 2019
TL;DR: AE significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR).
Abstract: For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

10 citations


Journal ArticleDOI
TL;DR: Experimental evidence is provided suggesting that frames also play a role in explaining certain long-distance dependency phenomena, as originally proposed by Deane (1991), and that complex structures can evoke complex frames as well, if sufficiently frequent and semantically coherent, and therefore more easily license deeper subextractions.
Abstract: Abstract The idea that conventionalized general knowledge – sometimes referred to as a frame – guides the perception and interpretation of the world around us has long permeated various branches of cognitive science, including psychology, linguistics, and artificial intelligence. In this paper we provide experimental evidence suggesting that frames also play a role in explaining certain long-distance dependency phenomena, as originally proposed by Deane (1991). We focus on a constraint that restricts the extraction of an NP from another NP, called subextraction, which Deane (1991) claims is ultimately a framing effect. In Experiment 1 we provide evidence showing that referents are extractable to the degree that they are deemed important for the proposition expressed by the utterance. This suggests that the world knowledge that the main verb evokes plays a key role in establishing which referents are extractable. In Experiment 2 we offer evidence suggesting that the acceptability of deep subextractions is correlated with the overall plausibility of the proposition, suggesting that complex structures can evoke complex frames as well, if sufficiently frequent and semantically coherent, and therefore more easily license deeper subextractions.

9 citations


Journal ArticleDOI
TL;DR: In this paper, the cut and chip effect in rubber is studied and the authors propose a method to understand the effect of the cut-and-chip effect on rubber and apply it to the development of new products for tires used in off-road or poor road conditions.
Abstract: Understanding the cut and chip (CC) effect in rubber is important for successful product development for tires used in off-road or poor road conditions and for other demanding applications of rubbe...

9 citations


Book ChapterDOI
11 Nov 2019
TL;DR: The Pangeo ecosystem as mentioned in this paper is an interactive computing software stack for HPC and public cloud infrastructures, which is used for geoscience operations on two different HPC systems.
Abstract: The Pangeo ecosystem is an interactive computing software stack for HPC and public cloud infrastructures. In this paper, we show benchmarking results of the Pangeo platform on two different HPC systems. Four different geoscience operations were considered in this benchmarking study with varying chunk sizes and chunking schemes. Both strong and weak scaling analyses were performed. Chunk sizes between 64 MB to 512 MB were considered, with the best scalability obtained for 512 MB. Compared to certain manual chunking schemes, the auto chunking scheme scaled well.

6 citations


Journal ArticleDOI
TL;DR: An associative memory and recall (AMR) model that stores associative knowledge from sensor data is proposed that can organize human activity knowledge in the manner that is efficient and effective to store and recall.

5 citations


01 Mar 2019
TL;DR: This article explored the effects of utilizing lexical chunks in individualized coaching on university students' practical English skills and found that lexical chunk usage had a positive effect on students' performance.
Abstract: The present case study explored the effects of utilizing lexical chunks in individualized coaching on university students’ practical English...

Journal ArticleDOI
30 Dec 2019
TL;DR: The new grammar rule and pause break rule from this research have a better prediction accuracy than the earlier research with the correct predictive value of sentences increasing by 23% from the earlier rule.
Abstract: Pause break is one of the indicators of speech to be easily understood in the Text-to-Speech System. This research aims to improve the accuracy of pause prediction in Pontianak Malay Language Sentences based on earlier research using a chunking phrase. This research is done as one of the efforts to preserve Pontianak Malay Language in order not to become extinct as a local language. Chunking method uses RegexpParser function in Natural Language Toolkit to crop sentences into phrases based on the Part of Speech type. In this research, the authors have developed a new grammar and pause break rule that is different from the earlier research to increase the accuracy of pause prediction. The data used is 500 Pontianak Malay Language sentences that have been recorded by a Pontianak Malay Language native speaker to get the pause break analysis. The pause consists of a short pause (symbolized as “/1) and a long pause (symbolized as “/2”). The tests were a test of pause break compatibility in one sentence and a test using f-measure, recall, and precision parameters. Based on the tests that have been done, the new grammar rule and pause break rule from this research have a better prediction accuracy than the earlier research with the correct predictive value of sentences increasing by 23% from the earlier rule.

Journal ArticleDOI
TL;DR: This article describes the data analyzed in the paper "Implicit sequence learning of chunking and abstract structures" and includes reaction times in the serial reaction time task and generation proformance for each confidence rating or attribution under the inclusion and exclusion tests from three experiments.

Proceedings Article
01 Jan 2019
TL;DR: The results support the notion that the segmentation and generalization of linguistic structure occurs in parallel, using similar computations, and that chunked representations of the nonadjacent dependencies are flexible enough to accommodate novel instances.
Abstract: s: Nonadjacent dependencies are dependencies between linguistic units that occur over one or more variable intervening units (e.g., AXC, where units A and C reliably co-occur, but the identity of X varies). These dependencies are a common feature of many natural languages, and are acquired by both infants and adults using statistical learning. However, despite the wealth of studies examining the acquisition on nonadjacent dependencies, a number of outstanding debates about this form of learning remain. For example, it is unclear whether participants in nonadjacent dependency experiments have learned the relative positions of syllables in these sequences (Endress & Bonnatti, 2007), or if they remember specific items from the input (Perruchet, Tyler, Galland & Peeremen, 2004). Moreover, substantial debate exists as to whether the segmentation and generalization of structure are two distinct processes that rely on separate computations (Peña, Bonatti, Nespor & Mehler, 2002), or whether they occur in tandem, using the same statistical learning computations (Frost & Monaghan, 2016). Here, we investigate these questions by testing the segmentation and generalization of nonadjacent dependencies in adults. We hypothesized that chunking – which has been shown to account for the statistical learning of adjacent dependencies (Isbilen, McCauley, Kidd & Christiansen, 2017) – may also play a role in the acquisition of nonadjacent dependencies (Isbilen, Frost, Monaghan & Christiansen, 2018). Following the method of Frost and Monaghan (2016), participants were presented with an artificial language composed of three nonadjacent dependencies. Following exposure, participants’ ability to segment and generalize these structures was tested using two different tasks: a two-alternative forcedchoice task (2AFC), and the statistically-induced chunking recall task (SICR; Isbilen et al., 2017). We predicted that while both tasks would show evidence of learning, SICR may provide clearer insights into the resulting output representations of learning. Our results confirm that participants successfully segmented and generalized nonadjacent structures on both types of task. However, while 2AFC performance on the generalization trials was significantly lower than on the segmentation trials, the results of SICR revealed no difference between the two, suggesting that the difference between segmentation and generalization found in previous studies may in part stem from the task demands of 2AFC (i.e., making familiarity judgements), rather than differences in learning. Taken together, our results support the notion that the segmentation and generalization of linguistic structure occurs in parallel, using similar computations, and that chunked representations of the nonadjacent dependencies are flexible enough to accommodate novel instances. The Seventh Conference of the Scandinavian Association for Language and Cognition Aarhus University, May 22 – 24, 2019

Journal ArticleDOI
TL;DR: It is suggested that the posterior rhythmic activities in the gamma band may underlie the processes that are directly associated with perceptual manipulations of chunking, while the subsequent beta-gamma activation over frontal areas appears to reflect a post-evaluation process such as reinforcement of the selected rules over alternative solutions, which may be an important characteristic of goal-directed chunking.
Abstract: Previous studies have revealed a specific role of the prefrontal-parietal network in rapid goal-directed chunking (RGDC), which dissociates prefrontal activity related to chunking from parietal working memory demands. However, it remains unknown how the prefrontal and parietal cortices collaborate to accomplish RGDC. To this end, a novel experimental design was used that presented Chinese characters in a chunking task, testing eighteen undergraduate students (9 females, mean age = 22.4 years) while recoding the electroencephalogram (EEG). In the experiment, radical-level chunking was accomplished in a timely stringent way (RT = 1485 ms, SD = 371 ms), whereas the stroke-level chunking was accomplished less coherently (RT = 3278 ms, SD = 1083 ms). By comparing the differences between radical-level chunking vs. stroke-level chunking, we were able to dissociate the chunking processes in the radical-level chunking condition within the analyzed time window (-200 to 1300 ms). The chunking processes resulted in an early increase of gamma band synchronization over parietal and occipital cortices, followed by enhanced power in the beta-gamma band (25-38 Hz) over frontal areas. We suggest that the posterior rhythmic activities in the gamma band may underlie the processes that are directly associated with perceptual manipulations of chunking, while the subsequent beta-gamma activation over frontal areas appears to reflect a post-evaluation process such as reinforcement of the selected rules over alternative solutions, which may be an important characteristic of goal-directed chunking.

Journal Article
TL;DR: This paper presents the design of an efficient chunking algorithm to achieve high throughput and to reduce processing time and that can be improved using deduplication and variable size chunking.
Abstract: Large amount of data gets generated every day and storing that data efficiently becomes a heuristic task. Backup storages are more prominently used media for storing every day, the generated data. The significant amount of data that is stored in the backup storage is redundant and leads to the wastage of storage space. Storage space can be saved and processing speed of backup media can be improved using deduplication and variable size chunking. Various chunking algorithms have been presented in the past to improve deduplication process. This paper presents the design of an efficient chunking algorithm to achieve high throughput and to reduce processing time.

Proceedings Article
01 Jan 2019
TL;DR: This work presents a way to calculate n weak rolling hashes at a time using single instruction multiple data (SIMD) instructions available on today’s processors and shows how to calculate chunk boundaries cheaply using other instructions also available on these processors.
Abstract: Deduplication is a special case of data compression where repeated chunks of data are stored only once. The input data is divided into chunks using a chunking algorithm and a cryptographically strong hash is calculated on each chunk and used as its unique identifier for further searching and duplicate elimination. As the input stream is processed, a chunk boundary is declared at a byte address in the input stream if some weak hash of a fixed number of preceding bytes (the “hash window”) satisfies some criterion. Commonly, a rolling hash like Karp-Rabin [6] or some cyclic polynomial [7] is used for the weak hash since these cheaply support moving the hash window forward one byte in the input stream. This work presents a way to calculate n weak rolling hashes at a time using single instruction multiple data (SIMD) instructions available on today’s processors. Furthermore, it shows how to calculate chunk boundaries cheaply using other instructions also available on these processors. Empirical results show that the proposed algorithm is four times as fast as previous algorithms, and that these optimizations save up to 25% of the computation required for deduplication.

Patent
01 Aug 2019
TL;DR: In this paper, a chunking engine and a policy engine are employed to evaluate one or more storage policies relating to, for example, cost, security, and network conditions in view of services and/or requirements of the multiple cloud storage providers.
Abstract: Techniques for chunking data in data storage systems that provide increased data storage security across multiple cloud storage providers. The techniques employ a chunking engine and a policy engine, which evaluates one or more storage policies relating to, for example, cost, security, and/or network conditions in view of services and/or requirements of the multiple cloud storage providers. Having evaluated such storage policies, the policy engine generates and provides operating parameters to the chunking engine, which uses the operating parameters when chunking and/or distributing the data across the multiple cloud storage providers, thereby satisfying the respective storage policies. In this way, users of data storage systems obtain the benefits of cloud storage resources and/or services while reducing their data security concern and optimizing the total cost of data storage.

Patent
09 May 2019
TL;DR: In this article, the authors proposed a lightweight complexity-based packet-level deduplication apparatus, which consists of a chunk dividing unit for performing an N-way chunking operation on a specific packet and dividing the chunk into Nway chunks.
Abstract: The present invention relates to a lightweight complexity-based packet-level deduplication apparatus and a method thereof. The lightweight complexity-based packet-level deduplication apparatus comprises: a chunk dividing unit for performing an N-way chunking operation on a specific packet and dividing the chunk into N-way chunks; a chunk extracting unit for extracting at least one target chunk used for deduplication among the N-way chunks; and deduplication processing unit for determining duplication of the specific packet based on the at least one target chunk and performing deduplication. Accordingly, a network bandwidth can be saved by removing a duplicated portion of a packet at packet-level.


Patent
13 Jun 2019
TL;DR: In this paper, the authors present techniques for applying fine-grained client-specific rules to divide (e.g., chunk) data statements to achieve cost reduction and failure rate reduction associated with executing the data statements over a subject dataset.
Abstract: Techniques are presented for applying fine-grained client-specific rules to divide (e.g., chunk) data statements to achieve cost reduction and/or failure rate reduction associated with executing the data statements over a subject dataset. Data statements for the subject dataset are received from a client. Statement attributes derived from the data statements are processed with respect to fine-grained rules and/or other client-specific data to determine whether a data statement chunking scheme is to be applied to the data statements. If a data statement chunking scheme is to be applied, further analysis is performed to select a data statement chunking scheme. A set of data operations are generated based at least in part on the selected data statement chunking scheme. The data operations are issued for execution over the subject dataset. The results from the data operations are consolidated in accordance with the selected data statement chunking scheme and returned to the client.

Journal ArticleDOI
TL;DR: In this paper, the authors address how three accepted and researched motor learning stages, as well as the concept of mentally chunking information, relate to acquiring and accelerating the learning process in tennis.
Abstract: The goal of this article is to address how three accepted and researched motor learning stages, as well as the concept of mentally chunking information, relate to acquiring and accelerating the learning process in tennis. Stages of learning, the role of playing vs. practicing tennis, and the interaction between biomechanics and motor learning are discussed. Specific coaching tips are provided.

30 Jun 2019
TL;DR: Chunking theory from cognitive science provides a basis for analyzing micro-behaviours in human performance in order to build models of individuals’ understanding of domain content that are richer than those available from current methods used for human-machine communication in AI systems.
Abstract: Chunking theory from cognitive science provides a basis for analyzing micro-behaviours in human performance in order to build models of individuals’ understanding of domain content that are richer than those available from current methods used for human-machine communication in AI systems.

Proceedings ArticleDOI
14 Jul 2019
TL;DR: A modified model is presented that can convert the lateral weights of the trained network into a primacy gradient usable by methods such as Competitive Queueing, thus granting the network a method for executing learned sequences.
Abstract: A Self-Organising Temporal Pooling (SOTP) network has been shown to be capable of forming declarative parallel representations of sequential events and chunking these events without supervision. However, such a network currently cannot take these declarative representations and execute the associated sequence; it is strictly a one-way sequence chunker and encoder. We present a modified model that can convert the lateral weights of the trained network into a primacy gradient usable by methods such as Competitive Queueing (CQ), thus granting the network a method for executing learned sequences. The resulting model has several benefits over traditional CQ. We further present an advanced method of executing sequences via SOTP itself, resulting in less error than CQ whilst being more flexible in replaying sequences from datasets with variable sequence lengths.

25 Sep 2019
TL;DR: This paper proposes to let a model learn to chunk in a more flexible way via reinforcement learning: a model can decide the next chunk that it wants to process in either direction and applies recurrent mechanisms to allow information to be transferred between chunks.
Abstract: In this paper, we focus on the conversational machine reading comprehension (MRC) problem, where the input to a model could be a lengthy document and a series of interconnected questions. To deal with long inputs, previous approaches usually chunk them into equally-spaced segments and predict answers based on each chunk independently without considering the information from other chunks. As a result, they may form chunks that fail to cover complete answers or have insufficient contexts around the correct answer required for question answering. Moreover, they are less capable of answering questions that need cross-chunk information. We propose to let a model learn to chunk in a more flexible way via reinforcement learning: a model can decide the next chunk that it wants to process in either direction. We also apply recurrent mechanisms to allow information to be transferred between chunks. Experiments on two conversational MRC tasks – CoQA and QuAC – demonstrate the effectiveness of our recurrent chunking mechanisms: we can obtain chunks that are more likely to contain complete answers and at the same time provide sufficient contexts around the ground truth answers for better predictions.

Patent
18 Jun 2019
TL;DR: In this paper, the authors describe a system that identifies a length of a sliding window that a data chunking routine applies to a data buffer to create data chunks, and adjusts the expected chunk boundary based on the length of the sliding window.
Abstract: Targeted chunking of data is described. A system identifies a length of a sliding window that a data chunking routine applies to a data buffer to create data chunks. The system identifies an expected chunk boundary in the data buffer. The system adjusts the expected chunk boundary, based on the length of the sliding window. The system enables the data chunking routine to start applying the sliding window at the adjusted expected chunk boundary in the data buffer instead of starting application of the sliding window at a beginning of the data buffer.


Proceedings ArticleDOI
12 Dec 2019
TL;DR: It is argued that basic nature of Phrase-labelling is not temporal but spatial in nature and the hypothesis that a CNN based model that directly extracts labelled n-grams from the input sentence would outperform standard RNN based model is proposed.
Abstract: Modern approaches address the task of Phrase-labelling within any input sentence (eg: NER, Chunking etc.) as a variant of word-tagging problem. These approaches extract the desired phrases as word-sequences which are mapped to specific tag-sequences (with Begin, Intermediate and End tag-types).However we argue that basic nature of Phrase-labelling is not temporal but spatial in nature. Thus we propose and test the hypothesis that a CNN based model that directly extracts labelled n-grams from the input sentence would outperform standard RNN based model.


01 Nov 2019
TL;DR: In this paper, a chunking approach based on coroutines is presented to mitigate the potential penalty to batch performance during migrations to microservices, which is difficult to integrate into existing batch jobs, which are traditionally executed sequentially.
Abstract: When migrating enterprise software towards microservices, batch jobs are particularly sensitive to communication overhead introduced by the distributed nature of microservices. As it is not uncommon for a single batch job to process millions of data items, even an additional millisecond of overhead per item may lead to a significant increase in runtime. A common strategy for reducing the average overhead per item is called chunking, which means that individual requests for different data items are grouped into larger requests. However, chunking is difficult to integrate into existing batch jobs, which are traditionally executed sequentially. In this paper, we present a chunking approach based on coroutines, and investigate whether it can be used to mitigate the potential penalty to batch performance during migrations to microservices.