Showing papers on "String (computer science) published in 2019"

PDF

Open Access

Journal Article•DOI•

STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets.

[...]

Damian Szklarczyk¹, Annika L. Gable¹, David Lyon¹, Alexander Junge², Stefan Wyder¹, Jaime Huerta-Cepas³, Milan Simonovic¹, Nadezhda Tsankova Doncheva², John H. Morris⁴, Peer Bork, Lars Juhl Jensen², Christian von Mering¹ - Show less +8 more•Institutions (4)

Swiss Institute of Bioinformatics¹, University of Copenhagen², Technical University of Madrid³, University of California, San Francisco⁴

08 Jan 2019-Nucleic Acids Research

TL;DR: The latest version of STRING more than doubles the number of organisms it covers, and offers an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input.

...read moreread less

Abstract: Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.

...read moreread less

10,584 citations

Proceedings Article•DOI•

Pooled Contextualized Embeddings for Named Entity Recognition.

[...]

Alan Akbik¹, Tanja Bergmann², Roland Vollgraf•Institutions (2)

IBM¹, Hasso Plattner Institute²

01 Jun 2019

TL;DR: This work proposes a method in which it dynamically aggregate contextualized embeddings of each unique string that the authors encounter and uses a pooling operation to distill a ”global” word representation from all contextualized instances.

...read moreread less

Abstract: Contextual string embeddings are a recent type of contextualized word embedding that were shown to yield state-of-the-art results when utilized in a range of sequence labeling tasks. They are based on character-level language models which treat text as distributions over characters and are capable of generating embeddings for any string of characters within any textual context. However, such purely character-based approaches struggle to produce meaningful embeddings if a rare string is used in a underspecified context. To address this drawback, we propose a method in which we dynamically aggregate contextualized embeddings of each unique string that we encounter. We then use a pooling operation to distill a ”global” word representation from all contextualized instances. We evaluate these ”pooled contextualized embeddings” on common named entity recognition (NER) tasks such as CoNLL-03 and WNUT and show that our approach significantly improves the state-of-the-art for NER. We make all code and pre-trained models available to the research community for use and reproduction.

...read moreread less

269 citations

Proceedings Article•DOI•

Sonic: Zero-Knowledge SNARKs from Linear-Size Universal and Updatable Structured Reference Strings

[...]

Mary Maller¹, Sean Bowe, Markulf Kohlweiss², Sarah Meiklejohn¹•Institutions (2)

University College London¹, University of Edinburgh²

06 Nov 2019

TL;DR: A zero-knowledge SNARK, Sonic, which supports a universal and continually updatable structured reference string that scales linearly in size, and a generally useful technique in which untrusted "helpers" can compute advice that allows batches of proofs to be verified more efficiently.

...read moreread less

Abstract: Ever since their introduction, zero-knowledge proofs have become an important tool for addressing privacy and scalability concerns in a variety of applications. In many systems each client downloads and verifies every new proof, and so proofs must be small and cheap to verify. The most practical schemes require either a trusted setup, as in (pre-processing) zk-SNARKs, or verification complexity that scales linearly with the complexity of the relation, as in Bulletproofs. The structured reference strings required by most zk-SNARK schemes can be constructed with multi-party computation protocols, but the resulting parameters are specific to an individual relation. Groth et al. discovered a zk-SNARK protocol with a universal structured reference string that is also updatable, but the string scales quadratically in the size of the supported relations. Here we describe a zero-knowledge SNARK, Sonic, which supports a universal and continually updatable structured reference string that scales linearly in size. We also describe a generally useful technique in which untrusted "helpers" can compute advice that allows batches of proofs to be verified more efficiently. Sonic proofs are constant size, and in the "helped" batch verification context the marginal cost of verification is comparable with the most efficient SNARKs in the literature.

...read moreread less

235 citations

Journal Article•DOI•

A stacking model using URL and HTML features for phishing webpage detection

[...]

Li Yukun¹, Zhenguo Yang¹, Zhenguo Yang², Chen Xu¹, Yuan Huaping¹, Wenyin Liu¹ - Show less +2 more•Institutions (2)

Guangdong University of Technology¹, City University of Hong Kong²

01 May 2019-Future Generation Computer Systems

TL;DR: A stacking model by combining GBDT, XGBoost and LightGBM in multiple layers is devised, which enables different models to be complementary, thus improving the performance on phishing webpage detection.

...read moreread less

139 citations

Journal Article•DOI•

An end-to-end learning-based cost estimator

[...]

Ji Sun¹, Guoliang Li¹•Institutions (1)

Tsinghua University¹

01 Nov 2019

TL;DR: This work proposes an effective end-to-end learning-based cost estimation framework based on a tree-structured model, which can estimate both cost and cardinality simultaneously, and is likely to be the first end- to-end cost estimator based on deep learning.

...read moreread less

Abstract: Cost and cardinality estimation is vital to query optimizer, which can guide the query plan selection. However traditional empirical cost and cardinality estimation techniques cannot provide high-quality estimation, because they may not effectively capture the correlation between multiple tables. Recently the database community shows that the learning-based cardinality estimation is better than the empirical methods. However, existing learning-based methods have several limitations. Firstly, they focus on estimating the cardinality, but cannot estimate the cost. Secondly, they are either too heavy or hard to represent complicated structures, e.g., complex predicates.To address these challenges, we propose an effective end-to-end learning-based cost estimation framework based on a tree-structured model, which can estimate both cost and cardinality simultaneously. We propose effective feature extraction and encoding techniques, which consider both queries and physical operations in feature extraction. We embed these features into our tree-structured model. We propose an effective method to encode string values, which can improve the generalization ability for predicate matching. As it is prohibitively expensive to enumerate all string values, we design a patten-based method, which selects patterns to cover string values and utilizes the patterns to embed string values. We conducted experiments on real-world datasets and experimental results showed that our method outperformed baselines.

...read moreread less

136 citations

Journal Article•DOI•

Distributed model predictive control approach for cooperative car-following with guaranteed local and string stability

[...]

Yang Zhou¹, Meng Wang², Soyoung Ahn¹•Institutions (2)

University of Wisconsin-Madison¹, Delft University of Technology²

01 Oct 2019-Transportation Research Part B-methodological

TL;DR: In this article, a serial distributed model predictive control (MPC) approach for connected automated vehicles (CAVS) is developed with local stability (disturbance dissipation over time) and multi-criteria string stability.

...read moreread less

Abstract: In this paper, a serial distributed model predictive control (MPC) approach for connected automated vehicles (CAVS) is developed with local stability (disturbance dissipation over time) and multi-criteria string stability (disturbance attenuation through a vehicular string). Two string stability criteria are considered within the proposed MPC: (i) the l∞-norm string stability criterion for attenuation of the maximum disturbance magnitude and (ii) l2-norm string stability criterion for attenuation of disturbance energy. The l∞-norm string stability is achieved by formulating constraints within the MPC based on the future states of the leading CAV, and the l2-norm string stability is achieved by proper weight matrix tuning over a robust positive invariant set. For rigor, mathematical proofs for asymptotical local stability and multi-criteria string stability are provided. Simulation experiments verify that the distributed serial MPC proposed in this study is effective for disturbance attenuation and performs better than traditional MPC without stability guarantee.

...read moreread less

131 citations

Journal Article•DOI•

Branes with Brains: Exploring String Vacua with Deep Reinforcement Learning

[...]

James Halverson¹, Brent D. Nelson¹, Fabian Ruehle², Fabian Ruehle³•Institutions (3)

Northeastern University¹, University of Oxford², CERN³

27 Mar 2019-Journal of High Energy Physics

TL;DR: An artificial intelligence agent known as an asynchronous advantage actor-critic is utilized to explore type IIA compactifications with intersecting D6-branes to solve various string theory consistency conditions simultaneously, phrased in terms of non-linear, coupled Diophantine equations.

...read moreread less

Abstract: We propose deep reinforcement learning as a model-free method for exploring the landscape of string vacua. As a concrete application, we utilize an artificial intelligence agent known as an asynchronous advantage actor-critic to explore type IIA compactifications with intersecting D6-branes. As different string background configurations are explored by changing D6-brane configurations, the agent receives rewards and punishments related to string consistency conditions and proximity to Standard Model vacua. These are in turn utilized to update the agent’s policy and value neural networks to improve its behavior. By reinforcement learning, the agent’s performance in both tasks is significantly improved, and for some tasks it finds a factor of $$ \mathcal{O}(200) $$ more solutions than a random walker. In one case, we demonstrate that the agent learns a human-derived strategy for finding consistent string models. In another case, where no human-derived strategy exists, the agent learns a genuinely new strategy that achieves the same goal twice as efficiently per unit time. Our results demonstrate that the agent learns to solve various string theory consistency conditions simultaneously, which are phrased in terms of non-linear, coupled Diophantine equations.

...read moreread less

88 citations

Journal Article•

Kernels for sequentially ordered data

[...]

Franz J. Király, Harald Oberhauser

01 Jan 2019-Journal of Machine Learning Research

TL;DR: The experiments indicate that the signature-based sequential kernel framework may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual pre-processing.

...read moreread less

Abstract: We present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or strings. Our approach is based on signature features which can be seen as an ordered variant of sample (cross-)moments; it allows to obtain a "sequentialized" version of any static kernel. The sequential kernels are efficiently computable for discrete sequences and are shown to approximate a continuous moment form in a sampling sense. A number of known kernels for sequences arise as "sequentializations" of suitable static kernels: string kernels may be obtained as a special case, and alignment kernels are closely related up to a modification that resolves their open non-definiteness issue. Our experiments indicate that our signature-based sequential kernel framework may be a promising approach to learning with sequential data, such as time series, that allows to avoid extensive manual pre-processing.

...read moreread less

77 citations

Journal Article•DOI•

NetGO: improving large-scale protein function prediction with massive network information

[...]

Ronghui You¹, Ronghui You², Shuwei Yao¹, Shuwei Yao², Yi Xiong³, Xiaodi Huang⁴, Fengzhu Sun¹, Fengzhu Sun², Fengzhu Sun⁵, Hiroshi Mamitsuka⁶, Hiroshi Mamitsuka⁷, Shanfeng Zhu², Shanfeng Zhu¹ - Show less +9 more•Institutions (7)

Fudan University¹, Chinese Ministry of Education², Shanghai Jiao Tong University³, Charles Sturt University⁴, University of Southern California⁵, Aalto University⁶, Kyoto University⁷

02 Jul 2019-Nucleic Acids Research

TL;DR: NetGO is proposed, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information and significantly outperforms GOLabeler and other competing methods.

...read moreread less

Abstract: Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler-a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.

...read moreread less

73 citations

Journal Article•DOI•

Robust local and string stability for a decentralized car following control strategy for connected automated vehicles

[...]

Yang Zhou¹, Soyoung Ahn¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jul 2019-Transportation Research Part B-methodological

TL;DR: In this paper, a robust car-following control strategy under uncertainty for connected and automated vehicles (CAVs) is presented, which is designed as a decentralized linear feedback and feed-forward controller with a focus on robust local and string stability under (i) time-varying uncertain vehicle dynamics and (ii) timevanging uncertain communication delay.

...read moreread less

Abstract: This paper presents a robust car-following control strategy under uncertainty for connected and automated vehicles (CAVs). The proposed control is designed as a decentralized linear feedback and feedforward controller with a focus on robust local and string stability under (i) time-varying uncertain vehicle dynamics and (ii) time-varying uncertain communication delay. The former uncertainty is incorporated into the general longitudinal vehicle dynamics (GLVD) equation that regulates the difference between the desired acceleration (prescribed by the control model) and the actual acceleration by compensating for nonlinear vehicle dynamics (e.g., due to aerodynamic drag force). The latter uncertainty is incorporated into acceleration information received from the vehicle immediately ahead. As a primary contribution, this study derives and proves (i) a sufficient and necessary condition for local stability and (ii) sufficient conditions for robust string stability in the frequency domain using the Laplacian transformation. Simulation experiments verify the correctness of the mathematical proofs and demonstrate that the proposed control is effective for ensuring stability against uncertainties.

...read moreread less

70 citations

Journal Article•DOI•

String Stable Model Predictive Cooperative Adaptive Cruise Control for Heterogeneous Platoons

[...]

Ellen van Nunen, Joey Reinders¹, Elham Semsar-Kazerooni, Nathan van de Wouw¹•Institutions (1)

Eindhoven University of Technology¹

20 Mar 2019-IEEE Transactions on Intelligent Transportation Systems

TL;DR: A model predictive control approach in combination with a feed-forward control design, which is based on a shared vector of predicted accelerations over a finite time horizon, is shown to be applicable to a heterogeneous sequence of vehicles, while the vehicle parameters remain confidential.

...read moreread less

Abstract: Cooperative adaptive cruise control (CACC) is a potential solution to decrease traffic jams caused by shock waves, increase the road capacity, decrease fuel consumption and improve safety. This paper proposes an integrated solution to a combination of four challenges in these CACC systems. One of the technological challenges is how to guarantee string stability (the ability to avoid amplification of dynamic vehicle responses along the string of vehicles) under nominal operational conditions. The second challenge is how to apply this solution to heterogeneous vehicles. The third challenge is how to maintain confidentiality of the vehicle parameters. Finally, the fourth challenge is to find a method which improves robustness against wireless packet loss. This paper proposes a model predictive control approach in combination with a feed-forward control design, which is based on a shared vector of predicted accelerations over a finite time horizon. This approach is shown to be applicable to a heterogeneous sequence of vehicles, while the vehicle parameters remain confidential. In previous works such an approach has shown to increase robustness against packet losses. Conditions for string stability are presented for the nominal operational conditions. Experimental results are presented and indeed demonstrate string stable behavior.

...read moreread less

Posted Content•

SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry.

[...]

Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, Alán Aspuru-Guzik - Show less +1 more

31 May 2019

TL;DR: This paper presents a meta-anatomical simulation of the response of the immune system to the presence of Tau, a type of “ghost” animal that is able to communicate with the body through its nervous system.

...read moreread less

Abstract: The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.

...read moreread less

Journal Article•DOI•

Bit-parallel sequence-to-graph alignment

[...]

Mikko Rautiainen¹, Mikko Rautiainen², Veli Mäkinen³, Tobias Marschall², Tobias Marschall¹ - Show less +1 more•Institutions (3)

Max Planck Society¹, Saarland University², University of Helsinki³

01 Oct 2019-Bioinformatics

TL;DR: This work generalizes two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers’ bitvector algorithm for semi-global alignment, and applies it to five different types of graphs and observes a speedup between 3-fold and 20-fold compared with a previous alignment algorithm.

...read moreread less

Abstract: Motivation Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers' bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+⌈mw⌉|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. Availability and implementation https://github.com/maickrau/GraphAligner. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Decision procedures for path feasibility of string-manipulating programs with complex operations

[...]

Taolue Chen¹, Matthew Hague², Anthony W. Lin³, Philipp Rümmer⁴, Zhilin Wu⁵ - Show less +1 more•Institutions (5)

Birkbeck, University of London¹, Royal Holloway, University of London², University of Oxford³, Uppsala University⁴, Chinese Academy of Sciences⁵

02 Jan 2019

TL;DR: The OSTRICH solver as mentioned in this paper provides a decidable decision procedure for checking path feasibility in string-manipulating programs, which can be used to detect XSS vulnerabilities in web applications.

...read moreread less

Abstract: The design and implementation of decision procedures for checking path feasibility in string-manipulating programs is an important problem, with such applications as symbolic execution of programs with strings and automated detection of cross-site scripting (XSS) vulnerabilities in web applications. A (symbolic) path is given as a finite sequence of assignments and assertions (i.e. without loops), and checking its feasibility amounts to determining the existence of inputs that yield a successful execution. Modern programming languages (e.g. JavaScript, PHP, and Python) support many complex string operations, and strings are also often implicitly modified during a computation in some intricate fashion (e.g. by some autoescaping mechanisms). In this paper we provide two general semantic conditions which together ensure the decidability of path feasibility: (1) each assertion admits regular monadic decomposition (i.e. is an effectively recognisable relation), and (2) each assignment uses a (possibly nondeterministic) function whose inverse relation preserves regularity. We show that the semantic conditions are expressive since they are satisfied by a multitude of string operations including concatenation, one-way and two-way finite-state transducers, replaceall functions (where the replacement string could contain variables), string-reverse functions, regular-expression matching, and some (restricted) forms of letter-counting/length functions. The semantic conditions also strictly subsume existing decidable string theories (e.g. straight-line fragments, and acyclic logics), and most existing benchmarks (e.g. most of Kaluza’s, and all of SLOG’s, Stranger’s, and SLOTH’s benchmarks). Our semantic conditions also yield a conceptually simple decision procedure, as well as an extensible architecture of a string solver in that a user may easily incorporate his/her own string functions into the solver by simply providing code for the pre-image computation without worrying about other parts of the solver. Despite these, the semantic conditions are unfortunately too general to provide a fast and complete decision procedure. We provide strong theoretical evidence for this in the form of complexity results. To rectify this problem, we propose two solutions. Our main solution is to allow only partial string functions (i.e., prohibit nondeterminism) in condition (2). This restriction is satisfied in many cases in practice, and yields decision procedures that are effective in both theory and practice. Whenever nondeterministic functions are still needed (e.g. the string function split), our second solution is to provide a syntactic fragment that provides a support of nondeterministic functions, and operations like one-way transducers, replaceall (with constant replacement string), the string-reverse function, concatenation, and regular-expression matching. We show that this fragment can be reduced to an existing solver SLOTH that exploits fast model checking algorithms like IC3. We provide an efficient implementation of our decision procedure (assuming our first solution above, i.e., deterministic partial string functions) in a new string solver OSTRICH. Our implementation provides built-in support for concatenation, reverse, functional transducers (FFT), and replaceall and provides a framework for extensibility to support further string functions. We demonstrate the efficacy of our new solver against other competitive solvers.

...read moreread less

Journal Article•DOI•

String correction using the Damerau-Levenshtein distance

[...]

Chunchun Zhao¹, Sartaj Sahni¹•Institutions (1)

University of Florida¹

06 Jun 2019-BMC Bioinformatics

TL;DR: In this paper, the authors developed space and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance.

...read moreread less

Abstract: In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the Damerau-Levenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner. We develop space- and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space- and cache-efficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms. Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.

...read moreread less

Journal Article•DOI•

On L∞ string stability of nonlinear bidirectional asymmetric heterogeneous platoon systems

[...]

Julien Monteil¹, Giovanni Russo², Robert Shorten²•Institutions (2)

IBM¹, University College Dublin²

01 Jul 2019-Automatica

TL;DR: The effectiveness of the approach is illustrated via a numerical example, where it is shown how the result can be recast as an optimization problem, allowing to design the control protocol for each vehicle independently on the other vehicles and hence leading to a bottom-up approach for the design of string stable systems able to track a time-varying reference speed.

...read moreread less

Proceedings Article•DOI•

On the Complexity of String Matching for Graphs

[...]

Massimo Equi¹, Roberto Grossi², Veli Mäkinen³, Alexandru I. Tomescu¹•Institutions (3)

University of Helsinki¹, University of Pisa², Helsinki Institute for Information Technology³

01 Jan 2019

TL;DR: A conditional lower bound is proved stating that, for any constant > 0, an O(|E|1− m)-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false.

...read moreread less

Abstract: Exact string matching in labeled graphs is the problem of searching paths of a graph G=(V,E) such that the concatenation of their node labels is equal to the given pattern string P[1..m]. This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks. We prove a conditional lower bound stating that, for any constant epsilon>0, an O(|E|^{1 - epsilon} m)-time, or an O(|E| m^{1 - epsilon})-time algorithm for exact string matching in graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. This holds even if restricted to undirected graphs with maximum node degree two, i.e. to zig-zag matching in bidirectional strings, or to deterministic directed acyclic graphs whose nodes have maximum sum of indegree and outdegree three. These restricted cases make the lower bound stricter than what can be directly derived from related bounds on regular expression matching (Backurs and Indyk, FOCS'16). In fact, our bounds are tight in the sense that lowering the degree or the alphabet size yields linear-time solvable problems. An interesting corollary is that exact and approximate matching are equally hard (quadratic time) in graphs under SETH. In comparison, the same problems restricted to strings have linear-time vs quadratic-time solutions, respectively (approximate pattern matching having also a matching SETH lower bound (Backurs and Indyk, STOC'15)).

...read moreread less

Journal Article•DOI•

Encoding high-cardinality string categorical variables

[...]

Patricio Cerda¹, Gaël Varoquaux¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

03 Jul 2019-arXiv: Learning

TL;DR: In this article, a Gamma-Poisson matrix factorization on substring counts and a min-hash encoder are proposed for low-dimensional encoding of high-cardinality string categorical variables.

...read moreread less

Abstract: Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding. This strategy breaks down when the number of categories grows, as it creates high-dimensional feature vectors. Additionally, for string entries, one-hot encoding does not capture information in their representation.Here, we seek low-dimensional encoding of high-cardinality string categorical variables. Ideally, these should be: scalable to many categories; interpretable to end users; and facilitate statistical analysis. We introduce two encoding approaches for string categories: a Gamma-Poisson matrix factorization on substring counts, and the min-hash encoder, for fast approximation of string similarities. We show that min-hash turns set inclusions into inequality relations that are easier to learn. Both approaches are scalable and streamable. Experiments on real and simulated data show that these methods improve supervised learning with high-cardinality categorical variables. We recommend the following: if scalability is central, the min-hash encoder is the best option as it does not require any data fit; if interpretability is important, the Gamma-Poisson factorization is the best alternative, as it can be interpreted as one-hot encoding on inferred categories with informative feature names. Both models enable autoML on the original string entries as they remove the need for feature engineering or data cleaning.

...read moreread less

Posted Content•

Model based string stability of adaptive cruise control systems using field data

[...]

George Gunter¹, Caroline Janssen¹, William Barbour¹, Raphael Stern², Daniel B. Work¹ - Show less +1 more•Institutions (2)

Vanderbilt University¹, University of Minnesota²

13 Feb 2019-arXiv: Systems and Control

TL;DR: The consequences of the string unstable ACC system on synthetic and empirical lead vehicle disturbances are identified, highlighting that commercial ACC platoons of moderate size can dampen some disturbances even while being string unstable.

...read moreread less

Abstract: This article is motivated by the lack of empirical data on the performance of commercially available Society of Automotive Engineers level one automated driving systems. To address this, a set of car following experiments are conducted to collect data from a 2015 luxury electric vehicle equipped with a commercial adaptive cruise control (ACC) system. Velocity, relative velocity, and spacing data collected during the experiments are used to calibrate an optimal velocity relative velocity car following model for both the minimum and maximum following settings. The string stability of both calibrated models is assessed, and it is determined that the best-fit models are string unstable, indicating they are not able to prevent all traffic disturbances from amplifying into phantom jams. Based on the calibrated models, we identify the consequences of the string unstable ACC system on synthetic and empirical lead vehicle disturbances, highlighting that some disturbances can be dampened even with string unstable commercial ACC platoons of moderate size.

...read moreread less

Proceedings Article•DOI•

String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure

[...]

Dominik Kempa¹, Tomasz Kociumaka²•Institutions (2)

University of Warwick¹, University of Warsaw²

23 Jun 2019

TL;DR: This paper proposes the first algorithm that breaks the O(n)-time barrier for BWT construction, based on a novel concept of string synchronizing sets, which is of independent interest and shows that this technique lets us design a data structure of the optimal size O(N/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O( n/ logn) time.

...read moreread less

Abstract: Burrows–Wheeler transform (BWT) is an invertible text transformation that, given a text T of length n, permutes its symbols according to the lexicographic order of suffixes of T. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length n, occupying O(n/logn) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n) time and O(n/logn) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n) time. Despite the clearly suboptimal running time, the existing techniques appear to have reached their limits. In this paper, we propose the first algorithm that breaks the O(n)-time barrier for BWT construction. Given a binary string of length n, our procedure builds the Burrows–Wheeler transform in O(n/√logn) time and O(n/logn) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(m√logm)-time solution by Chan and Patrascu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn) time.

...read moreread less

Journal Article•DOI•

Framework for syntactic string similarity measures

[...]

Najlah Gali¹, Radu Mariescu-Istodor¹, Damien Hostettler¹, Pasi Fränti¹•Institutions (1)

University of Eastern Finland¹

01 Sep 2019-Expert Systems With Applications

TL;DR: This paper introduces a general framework of syntactic similarity measures for matching short text by dividing them into three components: character-level similarity, string segmentation, and matching technique, and provides an open-source Java toolkit of the proposed framework.

...read moreread less

Abstract: Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.

...read moreread less

Book Chapter•DOI•

Quisquis: A New Design for Anonymous Cryptocurrencies

[...]

Prastudy Fauzi, Sarah Meiklejohn¹, Rebekah Mercer, Claudio Orlandi²•Institutions (2)

University College London¹, Aarhus University²

08 Dec 2019

TL;DR: In this paper, the authors proposed the creation of privacyenhanced cryptocurrencies such as Monero and Zcash, which are specifically designed to counteract the tracking analysis possible in currencies like Bitcoin.

...read moreread less

Abstract: Despite their usage of pseudonyms rather than persistent identifiers, most existing cryptocurrencies do not provide users with any meaningful levels of privacy. This has prompted the creation of privacy-enhanced cryptocurrencies such as Monero and Zcash, which are specifically designed to counteract the tracking analysis possible in currencies like Bitcoin. These cryptocurrencies, however, also suffer from some drawbacks: in both Monero and Zcash, the set of potential unspent coins is always growing, which means users cannot store a concise representation of the blockchain. Additionally, Zcash requires a common reference string and the fact that addresses are reused multiple times in Monero has led to attacks to its anonymity.

...read moreread less

Journal Article•DOI•

Universal compressed text indexing

[...]

Gonzalo Navarro¹, Nicola Prezza²•Institutions (2)

University of Chile¹, University of Pisa²

01 Mar 2019-Theoretical Computer Science

TL;DR: In this article, the authors proposed a universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation.

...read moreread less

Proceedings Article•DOI•

Balancing Straight-Line Programs

[...]

Moses Ganardi¹, Artur Jeż², Markus Lohrey¹•Institutions (2)

Folkwang University of the Arts¹, University of Wrocław²

01 Nov 2019

TL;DR: This article showed that a context-free grammar of size m that produces a single string w of length n (such a grammar is also called a string straight-line program) can be transformed in linear time into a context free grammar for w of size O(m), whose unique derivation tree has depth O(log n).

...read moreread less

Abstract: We show that a context-free grammar of size m that produces a single string w of length n (such a grammar is also called a string straight-line program) can be transformed in linear time into a context-free grammar for w of size O(m), whose unique derivation tree has depth O(log n). This solves an open problem in the area of grammar-based compression, improves many results in this area and greatly simplifies many existing constructions. Similar results are stated for two formalisms for grammar-based tree compression: top dags and forest straight-line programs. These balancing results can be all deduced from a single meta theorem stating that the depth of an algebraic circuit over an algebra with a certain finite base property can be reduced to O(log n) with the cost of a constant multiplicative size increase. Here, n refers to the size of the unfolding (or unravelling) of the circuit. In particular, this results applies to standard arithmetic circuits over (non-commutative) semirings.A long version of the paper can be found in [1].

...read moreread less

Robust Local and String Stability for a Decentralized Car Following Control Strategy for Connected Automated Vehicles

[...]

Yang Zhou¹, Soyoung Ahn¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2019

TL;DR: This study derives and proves a sufficient and necessary condition for local stability and sufficient conditions for robust string stability in the frequency domain using the Laplacian transformation.

...read moreread less

Abstract: This paper presents a robust car-following control strategy under uncertainty for connected and automated vehicles (CAVs) The proposed control is designed as a decentralized linear feedback and feedforward controller with a focus on robust local and string stability under (i) time-varying uncertain vehicle dynamics and (ii) time-varying uncertain communication delay The former uncertainty is incorporated into the general longitudinal vehicle dynamics (GLVD) equation that regulates the difference between the desired acceleration (prescribed by the control model) and the actual acceleration by compensating for nonlinear vehicle dynamics (eg, due to aerodynamic drag force) The latter uncertainty is incorporated into acceleration information received from the vehicle immediately ahead As a primary contribution, this study derives and proves (i) a sufficient and necessary condition for local stability and (ii) sufficient conditions for robust string stability in the frequency domain using the Laplacian transformation Simulation experiments verify the correctness of the mathematical proofs and demonstrate that the proposed control is effective for ensuring stability against uncertainties

...read moreread less

Proceedings Article•DOI•

Near-linear time insertion-deletion codes and (1+ε)-approximating edit distance via indexing

[...]

Bernhard Haeupler¹, Aviad Rubinstein², Amirbehshad Shahrasbi¹•Institutions (2)

Carnegie Mellon University¹, Stanford University²

23 Jun 2019

TL;DR: In this article, fast-decodable indexing schemes for edit distance were introduced, which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I.

...read moreread less

Abstract: We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I. In particular, for every length n and every e >0, one can in near linear time construct a string I ∈ Σ′n with |Σ′| = Oe(1), such that, indexing any string S ∈ Σn, symbol-by-symbol, with I results in a string S′ ∈ Σ″n where Σ″ = Σ × Σ′ for which edit distance computations are easy, i.e., one can compute a (1+e)-approximation of the edit distance between S′ and any other string in O(n (logn)) time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC ‘17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP ‘18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes.

...read moreread less

Posted Content•

Data structures to represent a set of k-long DNA sequences

[...]

Rayan Chikhi, Jan Holub¹, Paul Medvedev•Institutions (1)

Czech Technical University in Prague¹

29 Mar 2019-arXiv: Data Structures and Algorithms

TL;DR: This survey gives a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set and hopes it will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

...read moreread less

Abstract: The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the last ten years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

...read moreread less

Book Chapter•DOI•

Causal Inference by String Diagram Surgery

[...]

Bart Jacobs¹, Aleks Kissinger¹, Fabio Zanasi²•Institutions (2)

Radboud University Nijmegen¹, University College London²

08 Apr 2019

TL;DR: This article developed a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors.

...read moreread less

Abstract: Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors.

...read moreread less

Proceedings Article•DOI•

Massively parallel approximation algorithms for edit distance and longest common subsequence

[...]

MohammadTaghi Hajiaghayi¹, Saeed Seddighin, Xiaorui Sun²•Institutions (2)

University of Maryland, College Park¹, University of Illinois at Chicago²

06 Jan 2019

TL;DR: In this paper, the authors presented a massively parallel algorithm for edit distance and longest common subsequence in the parallel setting, which achieves an approximation factor of 1 + ϵ and round complexity of O(n 2 ).

...read moreread less

Abstract: String similarity measures are among the most fundamental problems in computer science. The notable examples are edit distance (ED) and longest common subsequence (LCS). These problems find their applications in various contexts such as computational biology, text processing, compiler optimization, data analysis, image analysis, etc. In this work, we revisit edit distance and longest common subsequence in the parallel settings. We present massively parallel algorithms for both problems that are optimal in the following senses:• The approximation factor of our algorithms is 1 + ϵ.• The round complexity of our algorithms is constant.• The total running time of our algorithms over all machines is O(n2). This matches the running time of the best-known solutions for approximating edit distance and longest common subsequence within a 1+ϵ factor in the sequential setting.Our result for edit distance substantially improves the massively parallel algorithm of [15] in terms of approximation factor, round complexity, number of machines, and total running time. Our unified approach to tackle both problems is to divide one of the strings into smaller blocks and try to locally predict which intervals of the other string correspond to each block in an optimal solution.Our main technical contribution is a novel parallel algorithm for computing a set of compositions, and recursively decomposing each function into a set of smaller iterative compositions (in terms of memory needed to solve the problem). These two methods together give us a strong tool for approximating combinatorial problems. For instance, LCS can be formulated as a recursive composition of functions and therefore this tool enables us to approximate LCS within a factor 1 + ϵ. Indeed, we recursively decompose the problem until we are able to compute the solution on a single machine. Since our methods are quite general, we expect this technique to find its applications in other combinatorial problems as well.

...read moreread less

Proceedings Article•

Random deep neural networks are biased towards simple functions

[...]

Giacomo De Palma¹, Bobak Toussi Kiani¹, Seth Lloyd¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2019

TL;DR: In this paper, it was shown that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions and that the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2π log n), where n is the length of the string.

...read moreread less

Abstract: We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2π log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle Perez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms.

...read moreread less

Collapse