scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2016"


Proceedings ArticleDOI
01 Nov 2016
TL;DR: This work investigates whether a neural, encoderdecoder translation system learns syntactic information on the source side as a by-product of training and proposes two methods to detect whether the encoder has learned local and global source syntax.
Abstract: We investigate whether a neural, encoderdecoder translation system learns syntactic information on the source side as a by-product of training. We propose two methods to detect whether the encoder has learned local and global source syntax. A fine-grained analysis of the syntactic structure learned by the encoder reveals which kinds of syntax are learned and which are missing.

352 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: The authors built a multi-source machine translation model and trained it to maximize the probability of a target English string given French and German sources using the neural encoder-decoder framework.
Abstract: We build a multi-source machine translation model and train it to maximize the probability of a target English string given French and German sources. Using the neural encoder-decoder framework, we explore several combination methods and report up to +4.8 Bleu increases on top of a very strong attention-based neural translation model.

289 citations


Proceedings Article
04 Nov 2016
TL;DR: Neuro-Symbolic Program Synthesis (NSPSS) as discussed by the authors is based on two neural modules: the cross correlation I/O network and the recursive-Reverse-Recursive Neural Network (R3NN).
Abstract: Recent years have seen the proposal of a number of neural architectures for the problem of Program Induction. Given a set of input-output examples, these architectures are able to learn mappings that generalize to new test inputs. While achieving impressive results, these approaches have a number of important limitations: (a) they are computationally expensive and hard to train, (b) a model has to be trained for each task (program) separately, and (c) it is hard to interpret or verify the correctness of the learnt mapping (as it is defined by a neural network). In this paper, we propose a novel technique, Neuro-Symbolic Program Synthesis, to overcome the above-mentioned problems. Once trained, our approach can automatically construct computer programs in a domain-specific language that are consistent with a set of input-output examples provided at test time. Our method is based on two novel neural modules. The first module, called the cross correlation I/O network, given a set of input-output examples, produces a continuous representation of the set of I/O examples. The second module, the Recursive-Reverse-Recursive Neural Network (R3NN), given the continuous representation of the examples, synthesizes a program by incrementally expanding partial programs. We demonstrate the effectiveness of our approach by applying it to the rich and complex domain of regular expression based string transformations. Experiments show that the R3NN model is not only able to construct programs from new input-output examples, but it is also able to construct new programs for tasks that it had never observed before during training.

248 citations


Journal ArticleDOI
TL;DR: This paper presents a distributed finite-time adaptive integral-sliding-mode (ISM) control approach for a platoon of vehicles consisting of a leader and multiple followers subjected to bounded unknown disturbances to overcome string instability caused by nonzero initial spacing errors.
Abstract: This paper presents a distributed finite-time adaptive integral-sliding-mode (ISM) control approach for a platoon of vehicles consisting of a leader and multiple followers subjected to bounded unknown disturbances. In order to avoid collisions among the vehicles, control protocols have to be designed to ensure string stability of the whole vehicle platoon. First, the constant time headway (CTH) policy known to improve string stability is applied to the case of zero initial spacing errors. Contrary to requiring zero initial spacing and zero initial velocity errors simultaneously in existing methods based on constant spacing (CS) policy, initial velocity errors here are not required to be zero. Then, since string stability condition can fail at the initial conditions, a modified CTH policy is constructed to overcome string instability caused by nonzero initial spacing errors. Moreover, the proposed adaptive ISM control schemes can be implemented without the requirement that the bounds of the disturbances be known in advance. In addition, one effective method is proposed to reduce the chattering phenomenon caused by the indicator function. Finally, simulation results are included to demonstrate its effectiveness and advantages over existing methods.

159 citations


Book ChapterDOI
08 May 2016
TL;DR: A new definition of computationally binding commitment schemes in the quantum setting, which is called "collapse-binding", applies to string commitments, composes in parallel, and works well with rewinding-based proofs.
Abstract: We present a new definition of computationally binding commitment schemes in the quantum setting, which we call "collapse-binding". The definition applies to string commitments, composes in parallel, and works well with rewinding-based proofs. We give simple constructions of collapse-binding commitments in the random oracle model, giving evidence that they can be realized from hash functions like SHA-3. We evidence the usefulness of our definition by constructing three-round statistical zero-knowledge quantum arguments of knowledge for all NP languages.

107 citations


Book ChapterDOI
04 Dec 2016
TL;DR: In this paper, the security of NIZKs in the presence of a maliciously chosen common reference string is studied and both negative and positive results are provided. But they do not address the security issues of subversion soundness and zero knowledge.
Abstract: Motivated by the subversion of "trusted" public parameters in mass-surveillance activities, this paper studies the security of NIZKs in the presence of a maliciously chosen common reference string. We provide definitions for subversion soundness, subversion witness indistinguishability and subversion zero knowledge. We then provide both negative and positive results, showing that certain combinations of goals are unachievable but giving protocols to achieve other combinations.

96 citations


Journal ArticleDOI
TL;DR: This study provides a comparison of 13 different ligand similarity functions, each of which utilizes the SMILES string of molecule representation, and proposes cosine similarity based SMilES kernels that make use of the Term Frequency and Term Frequency-Inverse Document Frequency weighting approaches.
Abstract: Molecular structures can be represented as strings of special characters using SMILES. Since each molecule is represented as a string, the similarity between compounds can be computed using SMILES-based string similarity functions. Most previous studies on drug-target interaction prediction use 2D-based compound similarity kernels such as SIMCOMP. To the best of our knowledge, using SMILES-based similarity functions, which are computationally more efficient than the 2D-based kernels, has not been investigated for this task before. In this study, we adapt and evaluate various SMILES-based similarity methods for drug-target interaction prediction. In addition, inspired by the vector space model of Information Retrieval we propose cosine similarity based SMILES kernels that make use of the Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting approaches. We also investigate generating composite kernels by combining our best SMILES-based similarity functions with the SIMCOMP kernel. With this study, we provided a comparison of 13 different ligand similarity functions, each of which utilizes the SMILES string of molecule representation. Additionally, TF and TF-IDF based cosine similarity kernels are proposed. The more efficient SMILES-based similarity functions performed similarly to the more complex 2D-based SIMCOMP kernel in terms of AUC-ROC scores. The TF-IDF based cosine similarity obtained a better AUC-PR score than the SIMCOMP kernel on the GPCR benchmark data set. The composite kernel of TF-IDF based cosine similarity and SIMCOMP achieved the best AUC-PR scores for all data sets.

95 citations


Journal ArticleDOI
TL;DR: In this article, a physically motivated Lyapunov function is employed to design boundary control law to ensure the vibration suppression and guarantee the stability of the closed-loop system with input backlash.
Abstract: In this study, the authors are concerned with the active vibration control of a flexible string system with input backlash. For vibration suppression, active control is applied at the right boundary of the flexible string. To deal with the input backlash, a novel ‘disturbance-like’ term is proposed in the control design. A physically motivated Lyapunov function is employed to design boundary control law to ensure the vibration suppression and guarantee the stability of the closed-loop system. Numerical simulations illustrate the effectiveness of the proposed control method.

88 citations


Journal ArticleDOI
TL;DR: This paper describes a general hybrid metaheuristic for combinatorial optimization labelled Construct, Merge, Solve & Adapt, which is a specific instantiation of a framework known from the literature as Generate-And-Solve based on the following general idea.

82 citations


Journal ArticleDOI
TL;DR: In this paper, a simple linear-time algorithm is presented for constructing a context-free grammar of size 4 g log 3 / 2? ( N / g ) for the input string, where N is the size of the input text and g is the length of the optimal grammar generating this text.

74 citations


Journal ArticleDOI
TL;DR: A novel method for clustering words in micro-blogs, based on the similarity of the related temporal series, using the Symbolic Aggregate ApproXimation algorithm to discretize the temporal series of terms into a small set of levels, leading to a string for each.
Abstract: In this paper we present a novel method for clustering words in micro-blogs, based on the similarity of the related temporal series. Our technique, named SAX*, uses the Symbolic Aggregate ApproXimation algorithm to discretize the temporal series of terms into a small set of levels, leading to a string for each. We then define a subset of "interesting" strings, i.e. those representing patterns of collective attention. Sliding temporal windows are used to detect co-occurring clusters of tokens with the same or similar string. To assess the performance of the method we first tune the model parameters on a 2-month 1 % Twitter stream, during which a number of world-wide events of differing type and duration (sports, politics, disasters, health, and celebrities) occurred. Then, we evaluate the quality of all discovered events in a 1-year stream, "googling" with the most frequent cluster n-grams and manually assessing how many clusters correspond to published news in the same temporal slot. Finally, we perform a complexity evaluation and we compare SAX* with three alternative methods for event discovery. Our evaluation shows that SAX* is at least one order of magnitude less complex than other temporal and non-temporal approaches to micro-blog clustering.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: An efficient technique for segmented oracles that computes information leakage for multiple runs using only the path constraints generated from a single run symbolic execution is presented.
Abstract: We present an automated approach for detecting and quantifying side channels in Java programs, which uses symbolic execution, string analysis and model counting to compute information leakage for a single run of a program. We further extend this approach to compute information leakage for multiple runs for a type of side channels called segmented oracles, where the attacker is able to explore each segment of a secret (for example each character of a password) independently. We present an efficient technique for segmented oracles that computes information leakage for multiple runs using only the path constraints generated from a single run symbolic execution. Our implementation uses the symbolic execution tool Symbolic PathFinder (SPF), SMT solver Z3, and two model counting constraint solvers LattE and ABC. Although LattE has been used before for analyzing numeric constraints, in this paper, we present an approach for using LattE for analyzing string constraints. We also extend the string constraint solver ABC for analysis of both numeric and string constraints, and we integrate ABC in SPF, enabling quantitative symbolic string analysis.

Proceedings ArticleDOI
11 Jan 2016
TL;DR: The main contribution is to show that the "straight-line fragment" of the logic is decidable, which can express the program logics of straight-line string-manipulating programs with concatenations and transductions as atomic operations, which arise when performing bounded model checking or dynamic symbolic executions.
Abstract: We study the fundamental issue of decidability of satisfiability over string logics with concatenations and finite-state transducers as atomic operations. Although restricting to one type of operations yields decidability, little is known about the decidability of their combined theory, which is especially relevant when analysing security vulnerabilities of dynamic web pages in a more realistic browser model. On the one hand, word equations (string logic with concatenations) cannot precisely capture sanitisation functions (e.g. htmlescape) and implicit browser transductions (e.g. innerHTML mutations). On the other hand, transducers suffer from the reverse problem of being able to model sanitisation functions and browser transductions, but not string concatenations. Naively combining word equations and transducers easily leads to an undecidable logic. Our main contribution is to show that the "straight-line fragment" of the logic is decidable (complexity ranges from PSPACE to EXPSPACE). The fragment can express the program logics of straight-line string-manipulating programs with concatenations and transductions as atomic operations, which arise when performing bounded model checking or dynamic symbolic executions. We demonstrate that the logic can naturally express constraints required for analysing mutation XSS in web applications. Finally, the logic remains decidable in the presence of length, letter-counting, regular, indexOf, and disequality constraints.

Book ChapterDOI
17 Jul 2016
TL;DR: A progressive search algorithm to not only mitigate the problem of non-terminating reasoning but also guide the search towards a “minimal solution” when the input formula is in fact satisfiable.
Abstract: We consider the problem of reasoning over an expressive constraint language for unbounded strings. The difficulty comes from “recursively defined” functions such as replace, making state-of-the-art algorithms non-terminating. Our first contribution is a progressive search algorithm to not only mitigate the problem of non-terminating reasoning but also guide the search towards a “minimal solution” when the input formula is in fact satisfiable. We have implemented our method using the state-of-the-art Z3 framework. Importantly, we have enabled conflict clause learning for string theory so that our solver can be used effectively in the setting of program verification. Finally, our experimental evaluation shows leadership in a large benchmark suite, and a first deployment for another benchmark suite which requires reasoning about string formulas of a class that has not been solved before.

Journal ArticleDOI
TL;DR: This shift rule leads to a new de Bruijn sequence construction that can be generated in O ( 1 ) -amortized time per bit.

Journal ArticleDOI
TL;DR: A generalization of the diminishing-return property by defining the elemental forward curvature and the notion of string-matroid is introduced, which is investigated two applications of string submodular functions with curvature constraints: choosing a string of actions to maximize the expected fraction of accomplished tasks; and designing astring of measurement matrices such that the information gain is maximized.
Abstract: Consider the problem of choosing a string of actions to optimize an objective function that is string submodular. It was shown in previous papers that the greedy strategy, consisting of a string of actions that only locally maximizes the step-wise gain in the objective function, achieves at least a $(1-e^{-1})$ -approximation to the optimal strategy. This paper improves this approximation by introducing additional constraints on curvature, namely, total backward curvature, total forward curvature, and elemental forward curvature. We show that if the objective function has total backward curvature $\sigma$ , then the greedy strategy achieves at least a $(1/\sigma)(1-e^{-\sigma})$ -approximation of the optimal strategy. If the objective function has total forward curvature $\epsilon$ , then the greedy strategy achieves at least a $(1-\epsilon)$ -approximation of the optimal strategy. Moreover, we consider a generalization of the diminishing-return property by defining the elemental forward curvature. We also introduce the notion of string-matroid and consider the problem of maximizing the objective function subject to a string-matroid constraint. We investigate two applications of string submodular functions with curvature constraints: 1) choosing a string of actions to maximize the expected fraction of accomplished tasks; and 2) designing a string of measurement matrices such that the information gain is maximized.

Patent
Jang Min Sik1
04 Aug 2016
TL;DR: In this paper, a semiconductor device consisting of source select lines, word lines, drain select lines and bit lines, which are stacked on a substrate where a first string area and a second string area are defined, is described.
Abstract: The present invention relates to a semiconductor device and a manufacturing method thereof, wherein the semiconductor device comprises: source select lines, word lines, drain select lines, and bit lines, which are stacked on a substrate where a first string area and a second string area are defined; channel films and memory films which vertically penetrate the source select lines, the word lines, and the drain select lines in the first string area and the second string area; and a common source line which vertically penetrates the source select lines, the word lines, and the drain select lines at the centers of the first string area and the second string area, and which extends to a lower part of the source select lines Thereby, the capacity of a memory device can be enhanced and electric properties can be improved

Journal ArticleDOI
01 Jun 2016
TL;DR: A set of algebraic techniques for solving constraints over a rich theory of unbounded strings natively, without reduction to other problems are presented and implemented in the SMT solver cvc4, making it the first solver able to accept a rich set of mixed constraints over strings, integers, reals, arrays and algebraic datatypes.
Abstract: An increasing number of applications in verification and security rely on or could benefit from automatic solvers that can check the satisfiability of constraints over a diverse set of data types that includes character strings. Until recently, satisfiability solvers for strings were standalone tools that could reason only about fairly restricted fragments of the theory of strings and regular expressions (e.g., strings of bounded lengths). These solvers were based on reductions to satisfiability problems over other data types such as bit vectors or to automata decision problems. We present a set of algebraic techniques for solving constraints over a rich theory of unbounded strings natively, without reduction to other problems. These techniques can be used to integrate string reasoning into general, multi-theory SMT solvers based on the common DPLL(T) architecture. We have implemented them in our SMT solver cvc4, expanding its already large set of built-in theories to include a theory of strings with concatenation, length, and membership in regular languages. This implementation makes cvc4 the first solver able to accept a rich set of mixed constraints over strings, integers, reals, arrays and algebraic datatypes. Our initial experimental results show that, in addition, on pure string problems cvc4 is highly competitive with specialized string solvers accepting a comparable input language.

Journal ArticleDOI
TL;DR: A full view of the string kernels approach is given and insights into two kinds of language transfer effects, namely, word choice (lexical transfer) and morphological differences are offered.
Abstract: The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Recently, an approach that uses only character p-grams as features has been proposed for the task of native language identification NLI. The approach obtained state-of-the-art results by combining several string kernels using multiple kernel learning. Despite the fact that the approach based on string kernels performs so well, several questions about this method remain unanswered. First, it is not clear why such a simple approach can compete with far more complex approaches that take words, lemmas, syntactic information, or even semantics into account. Second, although the approach is designed to be language independent, all experiments to date have been on English. This work is an extensive study that aims to systematically present the string kernel approach and to clarify the open questions mentioned above. A broad set of native language identification experiments were conducted to compare the string kernels approach with other state-of-the-art methods. The empirical results obtained in all of the experiments conducted in this work indicate that the proposed approach achieves state-of-the-art performance in NLI, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the results obtained on both the Arabic and the Norwegian corpora demonstrate that the proposed approach is language independent. In the Arabic native language identification task, string kernels show an increase of more than 17% over the best accuracy reported so far. The results of string kernels on Norwegian native language identification are also significantly better than the state-of-the-art approach. In addition, in a cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state-of-the-art system by 32.3%. To gain additional insights about the string kernels approach, the features selected by the classifier as being more discriminating are analyzed in this work. The analysis also offers information about localized language transfer effects, since the features used by the proposed model are p-grams of various lengths. The features captured by the model typically include stems, function words, and word prefixes and suffixes, which have the potential to generalize over purely word-based features. By analyzing the discriminating features, this article offers insights into two kinds of language transfer effects, namely, word choice lexical transfer and morphological differences. The goal of the current study is to give a full view of the string kernels approach and shed some light on why this approach works so well.

Journal ArticleDOI
TL;DR: This paper revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques, and introduces some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes.

Journal ArticleDOI
01 Sep 2016
TL;DR: This work provides a method to decide for the lowest possible order of the Padé approximation, which is sufficiently accurate in view of CACC (string) stability analysis, and compares the minimum string-stable time gaps for a CACC system with both exact and approximated delays.
Abstract: Cooperative adaptive cruise control (CACC) improves road throughput by employing intervehicle wireless communications. The inherent communication time delay and vehicle actuator delay significantly limit the minimum intervehicle distance in view of string stability requirements. Hence, controller design needs to consider both delays, which result in a nonrational transfer function representation of the CACC-controlled string. Pade approximations can be applied to arrive at a finite-dimensional model, which allows for many standard control methods. Our objective is to provide a method to decide for the lowest possible order of the Pade approximation, which is sufficiently accurate in view of CACC (string) stability analysis. The constant time gap strategy and a one-vehicle look-ahead topology are adopted to develop a CACC stable string. First, based on the stable controller parameter region, a suitable order of Pade approximations of the vehicle actuator delay can been carried out in view of individual vehicle stability. Then, the minimum string-stable time gaps for a CACC system with both exact and approximated delays have been compared. The procedure with a Proportional-derivative controller to choose the approximation order of delays has been given, followed by the time-domain simulation validation.

Patent
06 Jul 2016
TL;DR: In this article, a desktop integration framework is proposed to optimize retrieval of custom string resources from resource bundles hosted by server computer systems by using a document as a user interface to a web-server application hosted by a server computer system.
Abstract: In various embodiments, methods, systems, and non-transitory computer-readable media are disclosed that allow a desktop integration framework to optimize retrieval of custom string resources from resource bundles hosted by server computer systems. A client computer that uses a document as a user interface to a web-server application hosted by a server-computer system can determine which custom string resources are to be utilized in the document. The client computer system can request only the custom string resources that are determined to be utilized in the document from the server-computer system in a single request thereby optimizing retrieval without requesting entire resource bundles.

Patent
16 May 2016
TL;DR: In this paper, a method for determining if a user of a computer system is a human was proposed, where a processor receives an indication that a computer security program is needed and acquires at least one image depicting a first string of characters including at least a first and second set of one or more characters.
Abstract: A method for determining if a user of a computer system is a human. A processor receives an indication that a computer security program is needed and acquires at least one image depicting a first string of characters including at least a first and second set of one or more characters. A processor assigns a substitute character to be used as input for each of the second set of one or more characters. A processor presents the at least one image and an indication of the substitute character and when to use the substitute character to the user. A processor receives a second string of characters from the user. A processor determines whether the second string of characters substantially matches the first string of characters based on the substitute character assigned to each of the second set of one or more characters and determines whether the user is a human.

Proceedings ArticleDOI
19 Oct 2016
TL;DR: This work presents a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string examples, and designs an expressive DSL to represent disjunctive filter expressions needed for several real-world data filtering tasks.
Abstract: Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string examples. There are two key ideas of our approach. First, we design an expressive DSL to represent disjunctive filter expressions needed for several real-world data filtering tasks. Second, we develop an efficient synthesis algorithm for incrementally learning consistent filter expressions in the DSL from very few positive and negative examples. A DAG-based data structure is used to succinctly represent a large number of filter expressions, and two corresponding operators are defined for algorithmically handling positive and negative examples, namely, the intersection and subtraction operators. FIDEX is able to learn data filters for 452 out of 460 real-world data filtering tasks in real time (0.22s), using only 2.2 positive string instances and 2.7 negative string instances on average.

Posted Content
TL;DR: The compressed suffix array and the compressed suffix tree of a string $T$ can be built in O(n) deterministic time using $O(n\log\sigma)$ bits of space, where n is the string length and $\sigma$ is the alphabet size.
Abstract: We show that the compressed suffix array and the compressed suffix tree of a string $T$ can be built in $O(n)$ deterministic time using $O(n\log\sigma)$ bits of space, where $n$ is the string length and $\sigma$ is the alphabet size. Previously described deterministic algorithms either run in time that depends on the alphabet size or need $\omega(n\log \sigma)$ bits of working space. Our result has immediate applications to other problems, such as yielding the first linear-time LZ77 and LZ78 parsing algorithms that use $O(n \log\sigma)$ bits.

Journal ArticleDOI
TL;DR: An effective variable neighborhood search (VNS) is proposed to solve TALBP-II, which is to minimize cycle time for a given number of stations and the computational results show the promising advantage of VNS on the considered TAL BP-II.

Proceedings Article
01 Dec 2016
TL;DR: This paper presents a method that uses only character p-grams as features for the Arabic Dialect Identification (ADI) Closed Shared Task of the DSL 2016 Challenge and has an important advantage in that it is language independent and linguistic theory neutral, as it does not require any NLP tools.
Abstract: The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Unlike the common approach, we present a method that uses only character p-grams (also known as n-grams) as features for the Arabic Dialect Identification (ADI) Closed Shared Task of the DSL 2016 Challenge. The proposed approach combines several string kernels using multiple kernel learning. In the learning stage, we try both Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR), and we choose KDA as it gives better results in a 10-fold cross-validation carried out on the training set. Our approach is shallow and simple, but the empirical results obtained in the ADI Shared Task prove that it achieves very good results. Indeed, we ranked on the second place with an accuracy of 50.91% and a weighted F1 score of 51.31%. We also present improved results in this paper, which we obtained after the competition ended. Simply by adding more regularization into our model to make it more suitable for test data that comes from a different distribution than training data, we obtain an accuracy of 51.82% and a weighted F1 score of 52.18%. Furthermore, the proposed approach has an important advantage in that it is language independent and linguistic theory neutral, as it does not require any NLP tools.

Book ChapterDOI
17 Jul 2016
TL;DR: This paper proposes a new string analysis method based on a scalable logic circuit representation for (nondeterministic) finite automata to support various string and automata manipulation operations and enables both counterexample generation and filter synthesis in string constraint solving.
Abstract: Many severe security vulnerabilities in web applications can be attributed to string manipulation mistakes, which can often be avoided through formal string analysis. String analysis tools are indispensable and under active development. Prior string analysis methods are primarily automata-based or satisfiability-based. The two approaches exhibit distinct strengths and weaknesses. Specifically, existing automata-based methods have difficulty in generating counterexamples at system inputs to witness vulnerability, whereas satisfiability-based methods are inadequate to produce filters amenable for firmware or hardware implementation for real-time screening of malicious inputs to a system under protection. In this paper, we propose a new string analysis method based on a scalable logic circuit representation for (nondeterministic) finite automata to support various string and automata manipulation operations. It enables both counterexample generation and filter synthesis in string constraint solving. By using the new data structure, automata with large state spaces and/or alphabet sizes can be efficiently represented. Empirical studies on a large set of open source web applications and well-known attack patterns demonstrate the unique benefits of our method compared to prior string analysis tools.

Posted Content
TL;DR: In this paper, a signature-based sequential kernel framework for learning with sequential data, such as time series, sequences of graphs, or strings, is presented. But it does not address the problem of non-definiteness.
Abstract: We present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or strings. Our approach is based on signature features which can be seen as an ordered variant of sample (cross-)moments; it allows to obtain a "sequentialized" version of any static kernel. The sequential kernels are efficiently computable for discrete sequences and are shown to approximate a continuous moment form in a sampling sense. A number of known kernels for sequences arise as "sequentializations" of suitable static kernels: string kernels may be obtained as a special case, and alignment kernels are closely related up to a modification that resolves their open non-definiteness issue. Our experiments indicate that our signature-based sequential kernel framework may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual pre-processing.

Posted Content
TL;DR: This paper proposes a novel technique, Neuro-Symbolic Program Synthesis, that can automatically construct computer programs in a domain-specific language that are consistent with a set of input-output examples provided at test time and demonstrates the effectiveness of the approach by applying it to the rich and complex domain of regular expression based string transformations.
Abstract: Recent years have seen the proposal of a number of neural architectures for the problem of Program Induction. Given a set of input-output examples, these architectures are able to learn mappings that generalize to new test inputs. While achieving impressive results, these approaches have a number of important limitations: (a) they are computationally expensive and hard to train, (b) a model has to be trained for each task (program) separately, and (c) it is hard to interpret or verify the correctness of the learnt mapping (as it is defined by a neural network). In this paper, we propose a novel technique, Neuro-Symbolic Program Synthesis, to overcome the above-mentioned problems. Once trained, our approach can automatically construct computer programs in a domain-specific language that are consistent with a set of input-output examples provided at test time. Our method is based on two novel neural modules. The first module, called the cross correlation I/O network, given a set of input-output examples, produces a continuous representation of the set of I/O examples. The second module, the Recursive-Reverse-Recursive Neural Network (R3NN), given the continuous representation of the examples, synthesizes a program by incrementally expanding partial programs. We demonstrate the effectiveness of our approach by applying it to the rich and complex domain of regular expression based string transformations. Experiments show that the R3NN model is not only able to construct programs from new input-output examples, but it is also able to construct new programs for tasks that it had never observed before during training.