scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2019"


Journal ArticleDOI
TL;DR: The latest version of STRING more than doubles the number of organisms it covers, and offers an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input.
Abstract: Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.

10,584 citations


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This work proposes a method in which it dynamically aggregate contextualized embeddings of each unique string that the authors encounter and uses a pooling operation to distill a ”global” word representation from all contextualized instances.
Abstract: Contextual string embeddings are a recent type of contextualized word embedding that were shown to yield state-of-the-art results when utilized in a range of sequence labeling tasks. They are based on character-level language models which treat text as distributions over characters and are capable of generating embeddings for any string of characters within any textual context. However, such purely character-based approaches struggle to produce meaningful embeddings if a rare string is used in a underspecified context. To address this drawback, we propose a method in which we dynamically aggregate contextualized embeddings of each unique string that we encounter. We then use a pooling operation to distill a ”global” word representation from all contextualized instances. We evaluate these ”pooled contextualized embeddings” on common named entity recognition (NER) tasks such as CoNLL-03 and WNUT and show that our approach significantly improves the state-of-the-art for NER. We make all code and pre-trained models available to the research community for use and reproduction.

269 citations


Proceedings ArticleDOI
06 Nov 2019
TL;DR: A zero-knowledge SNARK, Sonic, which supports a universal and continually updatable structured reference string that scales linearly in size, and a generally useful technique in which untrusted "helpers" can compute advice that allows batches of proofs to be verified more efficiently.
Abstract: Ever since their introduction, zero-knowledge proofs have become an important tool for addressing privacy and scalability concerns in a variety of applications. In many systems each client downloads and verifies every new proof, and so proofs must be small and cheap to verify. The most practical schemes require either a trusted setup, as in (pre-processing) zk-SNARKs, or verification complexity that scales linearly with the complexity of the relation, as in Bulletproofs. The structured reference strings required by most zk-SNARK schemes can be constructed with multi-party computation protocols, but the resulting parameters are specific to an individual relation. Groth et al. discovered a zk-SNARK protocol with a universal structured reference string that is also updatable, but the string scales quadratically in the size of the supported relations. Here we describe a zero-knowledge SNARK, Sonic, which supports a universal and continually updatable structured reference string that scales linearly in size. We also describe a generally useful technique in which untrusted "helpers" can compute advice that allows batches of proofs to be verified more efficiently. Sonic proofs are constant size, and in the "helped" batch verification context the marginal cost of verification is comparable with the most efficient SNARKs in the literature.

235 citations


Journal ArticleDOI
TL;DR: A stacking model by combining GBDT, XGBoost and LightGBM in multiple layers is devised, which enables different models to be complementary, thus improving the performance on phishing webpage detection.

139 citations


Journal ArticleDOI
Ji Sun1, Guoliang Li1
01 Nov 2019
TL;DR: This work proposes an effective end-to-end learning-based cost estimation framework based on a tree-structured model, which can estimate both cost and cardinality simultaneously, and is likely to be the first end- to-end cost estimator based on deep learning.
Abstract: Cost and cardinality estimation is vital to query optimizer, which can guide the query plan selection. However traditional empirical cost and cardinality estimation techniques cannot provide high-quality estimation, because they may not effectively capture the correlation between multiple tables. Recently the database community shows that the learning-based cardinality estimation is better than the empirical methods. However, existing learning-based methods have several limitations. Firstly, they focus on estimating the cardinality, but cannot estimate the cost. Secondly, they are either too heavy or hard to represent complicated structures, e.g., complex predicates.To address these challenges, we propose an effective end-to-end learning-based cost estimation framework based on a tree-structured model, which can estimate both cost and cardinality simultaneously. We propose effective feature extraction and encoding techniques, which consider both queries and physical operations in feature extraction. We embed these features into our tree-structured model. We propose an effective method to encode string values, which can improve the generalization ability for predicate matching. As it is prohibitively expensive to enumerate all string values, we design a patten-based method, which selects patterns to cover string values and utilizes the patterns to embed string values. We conducted experiments on real-world datasets and experimental results showed that our method outperformed baselines.

136 citations


Journal ArticleDOI
TL;DR: In this article, a serial distributed model predictive control (MPC) approach for connected automated vehicles (CAVS) is developed with local stability (disturbance dissipation over time) and multi-criteria string stability.
Abstract: In this paper, a serial distributed model predictive control (MPC) approach for connected automated vehicles (CAVS) is developed with local stability (disturbance dissipation over time) and multi-criteria string stability (disturbance attenuation through a vehicular string). Two string stability criteria are considered within the proposed MPC: (i) the l∞-norm string stability criterion for attenuation of the maximum disturbance magnitude and (ii) l2-norm string stability criterion for attenuation of disturbance energy. The l∞-norm string stability is achieved by formulating constraints within the MPC based on the future states of the leading CAV, and the l2-norm string stability is achieved by proper weight matrix tuning over a robust positive invariant set. For rigor, mathematical proofs for asymptotical local stability and multi-criteria string stability are provided. Simulation experiments verify that the distributed serial MPC proposed in this study is effective for disturbance attenuation and performs better than traditional MPC without stability guarantee.

131 citations


Journal ArticleDOI
TL;DR: An artificial intelligence agent known as an asynchronous advantage actor-critic is utilized to explore type IIA compactifications with intersecting D6-branes to solve various string theory consistency conditions simultaneously, phrased in terms of non-linear, coupled Diophantine equations.
Abstract: We propose deep reinforcement learning as a model-free method for exploring the landscape of string vacua. As a concrete application, we utilize an artificial intelligence agent known as an asynchronous advantage actor-critic to explore type IIA compactifications with intersecting D6-branes. As different string background configurations are explored by changing D6-brane configurations, the agent receives rewards and punishments related to string consistency conditions and proximity to Standard Model vacua. These are in turn utilized to update the agent’s policy and value neural networks to improve its behavior. By reinforcement learning, the agent’s performance in both tasks is significantly improved, and for some tasks it finds a factor of $$ \mathcal{O}(200) $$ more solutions than a random walker. In one case, we demonstrate that the agent learns a human-derived strategy for finding consistent string models. In another case, where no human-derived strategy exists, the agent learns a genuinely new strategy that achieves the same goal twice as efficiently per unit time. Our results demonstrate that the agent learns to solve various string theory consistency conditions simultaneously, which are phrased in terms of non-linear, coupled Diophantine equations.

88 citations


Journal Article
TL;DR: The experiments indicate that the signature-based sequential kernel framework may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual pre-processing.
Abstract: We present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or strings. Our approach is based on signature features which can be seen as an ordered variant of sample (cross-)moments; it allows to obtain a "sequentialized" version of any static kernel. The sequential kernels are efficiently computable for discrete sequences and are shown to approximate a continuous moment form in a sampling sense. A number of known kernels for sequences arise as "sequentializations" of suitable static kernels: string kernels may be obtained as a special case, and alignment kernels are closely related up to a modification that resolves their open non-definiteness issue. Our experiments indicate that our signature-based sequential kernel framework may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual pre-processing.

77 citations


Journal ArticleDOI
TL;DR: NetGO is proposed, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information and significantly outperforms GOLabeler and other competing methods.
Abstract: Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler-a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.

73 citations


Journal ArticleDOI
TL;DR: In this paper, a robust car-following control strategy under uncertainty for connected and automated vehicles (CAVs) is presented, which is designed as a decentralized linear feedback and feed-forward controller with a focus on robust local and string stability under (i) time-varying uncertain vehicle dynamics and (ii) timevanging uncertain communication delay.
Abstract: This paper presents a robust car-following control strategy under uncertainty for connected and automated vehicles (CAVs). The proposed control is designed as a decentralized linear feedback and feedforward controller with a focus on robust local and string stability under (i) time-varying uncertain vehicle dynamics and (ii) time-varying uncertain communication delay. The former uncertainty is incorporated into the general longitudinal vehicle dynamics (GLVD) equation that regulates the difference between the desired acceleration (prescribed by the control model) and the actual acceleration by compensating for nonlinear vehicle dynamics (e.g., due to aerodynamic drag force). The latter uncertainty is incorporated into acceleration information received from the vehicle immediately ahead. As a primary contribution, this study derives and proves (i) a sufficient and necessary condition for local stability and (ii) sufficient conditions for robust string stability in the frequency domain using the Laplacian transformation. Simulation experiments verify the correctness of the mathematical proofs and demonstrate that the proposed control is effective for ensuring stability against uncertainties.

70 citations


Journal ArticleDOI
TL;DR: A model predictive control approach in combination with a feed-forward control design, which is based on a shared vector of predicted accelerations over a finite time horizon, is shown to be applicable to a heterogeneous sequence of vehicles, while the vehicle parameters remain confidential.
Abstract: Cooperative adaptive cruise control (CACC) is a potential solution to decrease traffic jams caused by shock waves, increase the road capacity, decrease fuel consumption and improve safety. This paper proposes an integrated solution to a combination of four challenges in these CACC systems. One of the technological challenges is how to guarantee string stability (the ability to avoid amplification of dynamic vehicle responses along the string of vehicles) under nominal operational conditions. The second challenge is how to apply this solution to heterogeneous vehicles. The third challenge is how to maintain confidentiality of the vehicle parameters. Finally, the fourth challenge is to find a method which improves robustness against wireless packet loss. This paper proposes a model predictive control approach in combination with a feed-forward control design, which is based on a shared vector of predicted accelerations over a finite time horizon. This approach is shown to be applicable to a heterogeneous sequence of vehicles, while the vehicle parameters remain confidential. In previous works such an approach has shown to increase robustness against packet losses. Conditions for string stability are presented for the nominal operational conditions. Experimental results are presented and indeed demonstrate string stable behavior.

Posted Content
31 May 2019
TL;DR: This paper presents a meta-anatomical simulation of the response of the immune system to the presence of Tau, a type of “ghost” animal that is able to communicate with the body through its nervous system.
Abstract: The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.

Journal ArticleDOI
TL;DR: This work generalizes two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment, and applies it to five different types of graphs and observes a speedup between 3-fold and 20-fold compared with a previous alignment algorithm.
Abstract: Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers' bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
02 Jan 2019
TL;DR: The OSTRICH solver as mentioned in this paper provides a decidable decision procedure for checking path feasibility in string-manipulating programs, which can be used to detect XSS vulnerabilities in web applications.
Abstract: The design and implementation of decision procedures for checking path feasibility in string-manipulating programs is an important problem, with such applications as symbolic execution of programs with strings and automated detection of cross-site scripting (XSS) vulnerabilities in web applications. A (symbolic) path is given as a finite sequence of assignments and assertions (i.e. without loops), and checking its feasibility amounts to determining the existence of inputs that yield a successful execution. Modern programming languages (e.g. JavaScript, PHP, and Python) support many complex string operations, and strings are also often implicitly modified during a computation in some intricate fashion (e.g. by some autoescaping mechanisms). In this paper we provide two general semantic conditions which together ensure the decidability of path feasibility: (1) each assertion admits regular monadic decomposition (i.e. is an effectively recognisable relation), and (2) each assignment uses a (possibly nondeterministic) function whose inverse relation preserves regularity. We show that the semantic conditions are expressive since they are satisfied by a multitude of string operations including concatenation, one-way and two-way finite-state transducers, replaceall functions (where the replacement string could contain variables), string-reverse functions, regular-expression matching, and some (restricted) forms of letter-counting/length functions. The semantic conditions also strictly subsume existing decidable string theories (e.g. straight-line fragments, and acyclic logics), and most existing benchmarks (e.g. most of Kaluza’s, and all of SLOG’s, Stranger’s, and SLOTH’s benchmarks). Our semantic conditions also yield a conceptually simple decision procedure, as well as an extensible architecture of a string solver in that a user may easily incorporate his/her own string functions into the solver by simply providing code for the pre-image computation without worrying about other parts of the solver. Despite these, the semantic conditions are unfortunately too general to provide a fast and complete decision procedure. We provide strong theoretical evidence for this in the form of complexity results. To rectify this problem, we propose two solutions. Our main solution is to allow only partial string functions (i.e., prohibit nondeterminism) in condition (2). This restriction is satisfied in many cases in practice, and yields decision procedures that are effective in both theory and practice. Whenever nondeterministic functions are still needed (e.g. the string function split), our second solution is to provide a syntactic fragment that provides a support of nondeterministic functions, and operations like one-way transducers, replaceall (with constant replacement string), the string-reverse function, concatenation, and regular-expression matching. We show that this fragment can be reduced to an existing solver SLOTH that exploits fast model checking algorithms like IC3. We provide an efficient implementation of our decision procedure (assuming our first solution above, i.e., deterministic partial string functions) in a new string solver OSTRICH. Our implementation provides built-in support for concatenation, reverse, functional transducers (FFT), and replaceall and provides a framework for extensibility to support further string functions. We demonstrate the efficacy of our new solver against other competitive solvers.

Journal ArticleDOI
TL;DR: In this paper, the authors developed space and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance.
Abstract: In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the Damerau-Levenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner. We develop space- and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space- and cache-efficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms. Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.

Journal ArticleDOI
TL;DR: The effectiveness of the approach is illustrated via a numerical example, where it is shown how the result can be recast as an optimization problem, allowing to design the control protocol for each vehicle independently on the other vehicles and hence leading to a bottom-up approach for the design of string stable systems able to track a time-varying reference speed.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: A conditional lower bound is proved stating that, for any constant > 0, an O(|E|1− m)-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false.
Abstract: Exact string matching in labeled graphs is the problem of searching paths of a graph G=(V,E) such that the concatenation of their node labels is equal to the given pattern string P[1..m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks. We prove a conditional lower bound stating that, for any constant epsilon>0, an O(|E|^{1 - epsilon} m)-time, or an O(|E| m^{1 - epsilon})-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. This holds even if restricted to undirected graphs with maximum node degree two, i.e. to zig-zag matching in bidirectional strings, or to deterministic directed acyclic graphs whose nodes have maximum sum of indegree and outdegree three. These restricted cases make the lower bound stricter than what can be directly derived from related bounds on regular expression matching (Backurs and Indyk, FOCS'16). In fact, our bounds are tight in the sense that lowering the degree or the alphabet size yields linear-time solvable problems. An interesting corollary is that exact and approximate matching are equally hard (quadratic time) in graphs under SETH. In comparison, the same problems restricted to strings have linear-time vs quadratic-time solutions, respectively (approximate pattern matching having also a matching SETH lower bound (Backurs and Indyk, STOC'15)).

Journal ArticleDOI
TL;DR: In this article, a Gamma-Poisson matrix factorization on substring counts and a min-hash encoder are proposed for low-dimensional encoding of high-cardinality string categorical variables.
Abstract: Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding. This strategy breaks down when the number of categories grows, as it creates high-dimensional feature vectors. Additionally, for string entries, one-hot encoding does not capture information in their representation.Here, we seek low-dimensional encoding of high-cardinality string categorical variables. Ideally, these should be: scalable to many categories; interpretable to end users; and facilitate statistical analysis. We introduce two encoding approaches for string categories: a Gamma-Poisson matrix factorization on substring counts, and the min-hash encoder, for fast approximation of string similarities. We show that min-hash turns set inclusions into inequality relations that are easier to learn. Both approaches are scalable and streamable. Experiments on real and simulated data show that these methods improve supervised learning with high-cardinality categorical variables. We recommend the following: if scalability is central, the min-hash encoder is the best option as it does not require any data fit; if interpretability is important, the Gamma-Poisson factorization is the best alternative, as it can be interpreted as one-hot encoding on inferred categories with informative feature names. Both models enable autoML on the original string entries as they remove the need for feature engineering or data cleaning.

Posted Content
TL;DR: The consequences of the string unstable ACC system on synthetic and empirical lead vehicle disturbances are identified, highlighting that commercial ACC platoons of moderate size can dampen some disturbances even while being string unstable.
Abstract: This article is motivated by the lack of empirical data on the performance of commercially available Society of Automotive Engineers level one automated driving systems. To address this, a set of car following experiments are conducted to collect data from a 2015 luxury electric vehicle equipped with a commercial adaptive cruise control (ACC) system. Velocity, relative velocity, and spacing data collected during the experiments are used to calibrate an optimal velocity relative velocity car following model for both the minimum and maximum following settings. The string stability of both calibrated models is assessed, and it is determined that the best-fit models are string unstable, indicating they are not able to prevent all traffic disturbances from amplifying into phantom jams. Based on the calibrated models, we identify the consequences of the string unstable ACC system on synthetic and empirical lead vehicle disturbances, highlighting that some disturbances can be dampened even with string unstable commercial ACC platoons of moderate size.

Proceedings ArticleDOI
23 Jun 2019
TL;DR: This paper proposes the first algorithm that breaks the O(n)-time barrier for BWT construction, based on a novel concept of string synchronizing sets, which is of independent interest and shows that this technique lets us design a data structure of the optimal size O(N/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O( n/ logn) time.
Abstract: Burrows–Wheeler transform (BWT) is an invertible text transformation that, given a text T of length n, permutes its symbols according to the lexicographic order of suffixes of T. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length n, occupying O(n/logn) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n) time and O(n/logn) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n) time. Despite the clearly suboptimal running time, the existing techniques appear to have reached their limits. In this paper, we propose the first algorithm that breaks the O(n)-time barrier for BWT construction. Given a binary string of length n, our procedure builds the Burrows–Wheeler transform in O(n/√logn) time and O(n/logn) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(m√logm)-time solution by Chan and Patrascu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn) time.

Journal ArticleDOI
TL;DR: This paper introduces a general framework of syntactic similarity measures for matching short text by dividing them into three components: character-level similarity, string segmentation, and matching technique, and provides an open-source Java toolkit of the proposed framework.
Abstract: Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.

Book ChapterDOI
08 Dec 2019
TL;DR: In this paper, the authors proposed the creation of privacyenhanced cryptocurrencies such as Monero and Zcash, which are specifically designed to counteract the tracking analysis possible in currencies like Bitcoin.
Abstract: Despite their usage of pseudonyms rather than persistent identifiers, most existing cryptocurrencies do not provide users with any meaningful levels of privacy. This has prompted the creation of privacy-enhanced cryptocurrencies such as Monero and Zcash, which are specifically designed to counteract the tracking analysis possible in currencies like Bitcoin. These cryptocurrencies, however, also suffer from some drawbacks: in both Monero and Zcash, the set of potential unspent coins is always growing, which means users cannot store a concise representation of the blockchain. Additionally, Zcash requires a common reference string and the fact that addresses are reused multiple times in Monero has led to attacks to its anonymity.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This article showed that a context-free grammar of size m that produces a single string w of length n (such a grammar is also called a string straight-line program) can be transformed in linear time into a context free grammar for w of size O(m), whose unique derivation tree has depth O(log n).
Abstract: We show that a context-free grammar of size m that produces a single string w of length n (such a grammar is also called a string straight-line program) can be transformed in linear time into a context-free grammar for w of size O(m), whose unique derivation tree has depth O(log n). This solves an open problem in the area of grammar-based compression, improves many results in this area and greatly simplifies many existing constructions. Similar results are stated for two formalisms for grammar-based tree compression: top dags and forest straight-line programs. These balancing results can be all deduced from a single meta theorem stating that the depth of an algebraic circuit over an algebra with a certain finite base property can be reduced to O(log n) with the cost of a constant multiplicative size increase. Here, n refers to the size of the unfolding (or unravelling) of the circuit. In particular, this results applies to standard arithmetic circuits over (non-commutative) semirings.A long version of the paper can be found in [1].

01 Jan 2019
TL;DR: This study derives and proves a sufficient and necessary condition for local stability and sufficient conditions for robust string stability in the frequency domain using the Laplacian transformation.
Abstract: This paper presents a robust car-following control strategy under uncertainty for connected and automated vehicles (CAVs) The proposed control is designed as a decentralized linear feedback and feedforward controller with a focus on robust local and string stability under (i) time-varying uncertain vehicle dynamics and (ii) time-varying uncertain communication delay The former uncertainty is incorporated into the general longitudinal vehicle dynamics (GLVD) equation that regulates the difference between the desired acceleration (prescribed by the control model) and the actual acceleration by compensating for nonlinear vehicle dynamics (eg, due to aerodynamic drag force) The latter uncertainty is incorporated into acceleration information received from the vehicle immediately ahead As a primary contribution, this study derives and proves (i) a sufficient and necessary condition for local stability and (ii) sufficient conditions for robust string stability in the frequency domain using the Laplacian transformation Simulation experiments verify the correctness of the mathematical proofs and demonstrate that the proposed control is effective for ensuring stability against uncertainties

Proceedings ArticleDOI
23 Jun 2019
TL;DR: In this article, fast-decodable indexing schemes for edit distance were introduced, which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I.
Abstract: We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I. In particular, for every length n and every e >0, one can in near linear time construct a string I ∈ Σ′n with |Σ′| = Oe(1), such that, indexing any string S ∈ Σn, symbol-by-symbol, with I results in a string S′ ∈ Σ″n where Σ″ = Σ × Σ′ for which edit distance computations are easy, i.e., one can compute a (1+e)-approximation of the edit distance between S′ and any other string in O(n (logn)) time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC ‘17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP ‘18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes.

Posted Content
TL;DR: This survey gives a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set and hopes it will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.
Abstract: The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the last ten years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

Book ChapterDOI
08 Apr 2019
TL;DR: This article developed a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors.
Abstract: Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors.

Proceedings ArticleDOI
06 Jan 2019
TL;DR: In this paper, the authors presented a massively parallel algorithm for edit distance and longest common subsequence in the parallel setting, which achieves an approximation factor of 1 + ϵ and round complexity of O(n 2 ).
Abstract: String similarity measures are among the most fundamental problems in computer science. The notable examples are edit distance (ED) and longest common subsequence (LCS). These problems find their applications in various contexts such as computational biology, text processing, compiler optimization, data analysis, image analysis, etc. In this work, we revisit edit distance and longest common subsequence in the parallel settings. We present massively parallel algorithms for both problems that are optimal in the following senses:• The approximation factor of our algorithms is 1 + ϵ.• The round complexity of our algorithms is constant.• The total running time of our algorithms over all machines is O(n2). This matches the running time of the best-known solutions for approximating edit distance and longest common subsequence within a 1+ϵ factor in the sequential setting.Our result for edit distance substantially improves the massively parallel algorithm of [15] in terms of approximation factor, round complexity, number of machines, and total running time. Our unified approach to tackle both problems is to divide one of the strings into smaller blocks and try to locally predict which intervals of the other string correspond to each block in an optimal solution.Our main technical contribution is a novel parallel algorithm for computing a set of compositions, and recursively decomposing each function into a set of smaller iterative compositions (in terms of memory needed to solve the problem). These two methods together give us a strong tool for approximating combinatorial problems. For instance, LCS can be formulated as a recursive composition of functions and therefore this tool enables us to approximate LCS within a factor 1 + ϵ. Indeed, we recursively decompose the problem until we are able to compute the solution on a single machine. Since our methods are quite general, we expect this technique to find its applications in other combinatorial problems as well.

Proceedings Article
01 Jan 2019
TL;DR: In this paper, it was shown that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions and that the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2π log n), where n is the length of the string.
Abstract: We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2π log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle Perez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms.