Showing papers on "String (computer science) published in 2021"

PDF

Open Access

Journal Article•DOI•

The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.

[...]

Damian Szklarczyk¹, Annika L. Gable¹, Katerina C. Nastou², David Lyon¹, Rebecca Kirsch², Sampo Pyysalo³, Nadezhda Tsankova Doncheva², Marc Legeay², Tao Fang¹, Peer Bork, Lars Juhl Jensen², Christian von Mering¹ - Show less +8 more•Institutions (3)

Swiss Institute of Bioinformatics¹, University of Copenhagen², University of Turku³

08 Jan 2021-Nucleic Acids Research

TL;DR: Changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks are described.

...read moreread less

Abstract: Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.

...read moreread less

3,253 citations

Posted Content•

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.

[...]

Pengfei Liu¹, Weizhe Yuan¹, Jinlan Fu, Zhengbao Jiang¹, Hiroaki Hayashi¹, Graham Neubig¹ - Show less +2 more•Institutions (1)

Carnegie Mellon University¹

28 Jul 2021-arXiv: Computation and Language

TL;DR: The authors surveys and organizes research works in a new paradigm in natural language processing, which they dub "prompt-based learning" and describe a unified set of mathematical notations that can cover a wide variety of existing work.

...read moreread less

Abstract: This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning". Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input x is modified using a template into a textual string prompt x' that has some unfilled slots, and then the language model is used to probabilistically fill the unfilled information to obtain a final string x, from which the final output y can be derived. This framework is powerful and attractive for a number of reasons: it allows the language model to be pre-trained on massive amounts of raw text, and by defining a new prompting function the model is able to perform few-shot or even zero-shot learning, adapting to new scenarios with few or no labeled data. In this paper we introduce the basics of this promising paradigm, describe a unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions, e.g.the choice of pre-trained models, prompts, and tuning strategies. To make the field more accessible to interested beginners, we not only make a systematic review of existing works and a highly structured typology of prompt-based concepts, but also release other resources, e.g., a website this http URL including constantly-updated survey, and paperlist.

...read moreread less

93 citations

Journal Article•DOI•

Causality extraction based on self-attentive BiLSTM-CRF with transferred embeddings

[...]

Zhaoning Li¹, Qi Li¹, Xiaotian Zou¹, Jiangtao Ren¹•Institutions (1)

Sun Yat-sen University¹

29 Jan 2021-Neurocomputing

TL;DR: This paper formulate causality extraction as a sequence labeling problem based on a novel causality tagging scheme, and proposes a neural causality extractor with the BiLSTM-CRF model as the backbone, named SCITE (Self-attentive BiL STM- CRF wIth Transferred Embeddings), which can directly extract cause and effect without extracting candidate causal pairs and identifying their relations separately.

...read moreread less

80 citations

Journal Article•DOI•

From homogeneous to heterogeneous traffic flows: Lp String stability under uncertain model parameters

[...]

Marcello Montanino¹, Julien Monteil², Vincenzo Punzo¹•Institutions (2)

University of Naples Federico II¹, IBM²

01 Apr 2021-Transportation Research Part B-methodological

TL;DR: In this paper, an analytical sufficient condition for the L ∞ string stability of heterogeneous vehicles, which move according to a general class of car-following models, is derived.

...read moreread less

Abstract: This paper shows that the heterogeneity of drivers’ and vehicles characteristics makes platoons, on average, more string-unstable. However, the string instability degree of unstable platoons is much higher in a homogeneous flow than in a heterogeneous one. These results are based on an L ∞ characterization of string stability, which is shown to be the most appropriate one from a traffic safety viewpoint. Mechanisms and conditions are discussed in which an L 2 characterization is not able to capture the amplification of a speed drop through a string of vehicles. An analytical sufficient condition for the L ∞ string stability of heterogeneous vehicles, which move according to a general class of car-following models, is derived. Above all, a thorough comparison of L ∞ and L 2 string stability characterizations between a homogeneous and a heterogenous flow, is performed. To this aim, the L p norms of heterogeneous platoons are calculated within a quasi-Monte Carlo framework. The variability of the L p norm values due to the platoon length, the equilibrium speed, and the probability distribution model of the uncertain vehicle parameters, is analysed. Overall, it is shown that the platoon stability behaviour sensibly changes with the shape and the correlation structure of vehicle model parameter distributions. Therefore, traffic heterogeneity needs to be modelled in order to correctly characterize the string stability of a mixed traffic flow.

...read moreread less

42 citations

Book Chapter•DOI•

The Mathematics of Text Structure

[...]

Bob Coecke¹•Institutions (1)

University of Oxford¹

01 Jan 2021-arXiv: Computation and Language

TL;DR: Both the compositional formalism and suggested meaning model are highly quantum-inspired, and implementation on a quantum computer would come with a range of benefits.

...read moreread less

Abstract: In previous work we gave a mathematical foundation, referred to as DisCoCat, for how words interact in a sentence in order to produce the meaning of that sentence. To do so, we exploited the perfect structural match of grammar and categories of meaning spaces. Here, we give a mathematical foundation, referred to as DisCoCirc, for how sentences interact in texts in order to produce the meaning of that text. First we revisit DisCoCat. While in DisCoCat all meanings are fixed as states (i.e. have no input), in DisCoCirc word meanings correspond to a type, or system, and the states of this system can evolve. Sentences are gates within a circuit which update the variable meanings of those words. Like in DisCoCat, word meanings can live in a variety of spaces e.g. propositional, vectorial, or cognitive. The compositional structure are string diagrams representing information flows, and an entire text yields a single string diagram in which word meanings lift to the meaning of the entire text. While the developments in this paper are independent of a physical embodiment (cf. classical vs. quantum computing), both the compositional formalism and suggested meaning model are highly quantum-inspired, and implementation on a quantum computer would come with a range of benefits. We also praise Jim Lambek for his role in mathematical linguistics in general, and the development of the DisCo program more specifically.

...read moreread less

33 citations

Journal Article•DOI•

Multilabel Classification Models for the Prediction of Cross-Coupling Reaction Conditions.

[...]

Michael R. Maser¹, Alexander Y. Cui¹, Serim Ryou¹, Travis J. DeLano¹, Yisong Yue¹, Sarah E. Reisman¹ - Show less +2 more•Institutions (1)

California Institute of Technology¹

08 Jan 2021-Journal of Chemical Information and Modeling

TL;DR: It is found that relational graph convolutional networks and gradient-boosting machines are very effective for this learning task, and a novel reaction-level graph-attention operation is disclosed in the top-performing model.

...read moreread less

Abstract: Machine-learned ranking models have been developed for the prediction of substrate-specific cross-coupling reaction conditions. Data sets of published reactions were curated for Suzuki, Negishi, and C-N couplings, as well as Pauson-Khand reactions. String, descriptor, and graph encodings were tested as input representations, and models were trained to predict the set of conditions used in a reaction as a binary vector. Unique reagent dictionaries categorized by expert-crafted reaction roles were constructed for each data set, leading to context-aware predictions. We find that relational graph convolutional networks and gradient-boosting machines are very effective for this learning task, and we disclose a novel reaction-level graph attention operation in the top-performing model.

...read moreread less

28 citations

Journal Article•DOI•

A combinatorial view on string attractors

[...]

Sabrina Mantaci¹, Antonio Restivo¹, Giuseppe Romana¹, Giovanna Rosone², Marinella Sciortino¹ - Show less +1 more•Institutions (2)

University of Palermo¹, University of Pisa²

04 Jan 2021-Theoretical Computer Science

TL;DR: It is shown how the size of the smallest string attractor of a word varies when combinatorial operations are applied and it is deduced that such a measure is not monotone.

...read moreread less

26 citations

Journal Article•DOI•

DisCoPy: Monoidal Categories in Python

[...]

Giovanni de Felice, Alexis Toumi, Bob Coecke

08 Feb 2021

TL;DR: DisCoPy, an open source toolbox for computing with monoidal categories, provides an intuitive syntax for defining string diagrams and monoidal functors that allows the efficient implementation of computational experiments in the various applications of category theory.

...read moreread less

Abstract: We introduce DisCoPy, an open source toolbox for computing with monoidal categories. The library provides an intuitive syntax for defining string diagrams and monoidal functors. Its modularity allows the efficient implementation of computational experiments in the various applications of category theory where diagrams have become a lingua franca. As an example, we used DisCoPy to perform natural language processing on quantum hardware for the first time.

...read moreread less

20 citations

Journal Article•DOI•

Representation of k-Mer Sets Using Spectrum-Preserving String Sets.

[...]

Amatur Rahman¹, Paul Medevedev¹•Institutions (1)

Pennsylvania State University¹

20 Apr 2021-Journal of Computational Biology

TL;DR: This work proves a lower bound on the size of the optimal SPSS and proposes a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to the lower bound.

...read moreread less

Abstract: Given the popularity and elegance of $k$-mer based tools, finding a space-efficient way to represent a set of $k$-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of $k$-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of $k$-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static $k$-mer membership index, UST-FM, which we show improves index size by 10–44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

...read moreread less

19 citations

Proceedings Article•DOI•

Web question answering with neurosymbolic program synthesis

[...]

Qiaochu Chen¹, Aaron Lamoreaux¹, Xinyu Wang², Greg Durrett¹, Osbert Bastani³, Isil Dillig¹ - Show less +2 more•Institutions (3)

University of Texas at Austin¹, University of Michigan², University of Pennsylvania³

19 Jun 2021

TL;DR: WebQA as mentioned in this paper employs a neurosymbolic DSL that incorporates both neural NLP models as well as standard language constructs for tree navigation and string manipulation, and uses transductive learning to select programs with good generalization power.

...read moreread less

Abstract: In this paper, we propose a new technique based on program synthesis for extracting information from webpages. Given a natural language query and a few labeled webpages, our method synthesizes a program that can be used to extract similar types of information from other unlabeled webpages. To handle websites with diverse structure, our approach employs a neurosymbolic DSL that incorporates both neural NLP models as well as standard language constructs for tree navigation and string manipulation. We also propose an optimal synthesis algorithm that generates all DSL programs that achieve optimal F1 score on the training examples. Our synthesis technique is compositional, prunes the search space by exploiting a monotonicity property of the DSL, and uses transductive learning to select programs with good generalization power. We have implemented these ideas in a new tool called WebQA and evaluate it on 25 different tasks across multiple domains. Our experiments show that WebQA significantly outperforms existing tools such as state-of-the-art question answering models and wrapper induction systems.

...read moreread less

18 citations

Journal Article•DOI•

On Extensions for Gentle Algebras

[...]

Ilke Canakci, David Pauksztello¹, Sibylle Schroll²•Institutions (2)

Lancaster University¹, University of Leicester²

01 Feb 2021-Canadian Journal of Mathematics

TL;DR: In this article, a complete description of a basis of the extension spaces between indecomposable string and quasi-simple band modules in the module category of a gentle algebra is given.

...read moreread less

Abstract: We give a complete description of a basis of the extension spaces between indecomposable string and quasi-simple band modules in the module category of a gentle algebra.

...read moreread less

Journal Article•DOI•

Semantic programming by example with pre-trained models

[...]

Gust Verbruggen¹, Vu Le², Sumit Gulwani²•Institutions (2)

Katholieke Universiteit Leuven¹, Microsoft²

15 Oct 2021

TL;DR: In this paper, the authors propose a framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies, and demonstrate the generality of their approach via a case study in the domain of string profiling.

...read moreread less

Abstract: The ability to learn programs from few examples is a powerful technology with disruptive applications in many domains, as it allows users to automate repetitive tasks in an intuitive way. Existing frameworks on inductive synthesis only perform syntactic manipulations, where they rely on the syntactic structure of the given examples and not their meaning. Any semantic manipulations, such as transforming dates, have to be manually encoded by the designer of the inductive programming framework. Recent advances in large language models have shown these models to be very adept at performing semantic transformations of its input by simply providing a few examples of the task at hand. When it comes to syntactic transformations, however, these models are limited in their expressive power. In this paper, we propose a novel framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies. In particular, the inductive synthesis is tasked with breaking down the problem in smaller subproblems, among which those that cannot be solved syntactically are passed to the language model. We formalize three semantic operators that can be integrated with inductive synthesizers. To minimize invoking expensive semantic operators during learning, we introduce a novel deferred query execution algorithm that considers the operators to be oracles during learning. We evaluate our approach in the domain of string transformations: the combination methodology can automate tasks that cannot be handled using either technologies by themselves. Finally, we demonstrate the generality of our approach via a case study in the domain of string profiling.

...read moreread less

Proceedings Article•DOI•

Separating words and trace reconstruction

[...]

Zachary Chase¹•Institutions (1)

University of Oxford¹

15 Jun 2021

TL;DR: For any distinct x,y ∈ {0, 1}n, there is a deterministic finite automata with O(n 1/3) states that accepts x but not y as mentioned in this paper.

...read moreread less

Abstract: We prove that for any distinct x,y ∈ {0,1}n, there is a deterministic finite automaton with O(n1/3) states that accepts x but not y. This improves Robson’s 1989 bound of O(n2/5). Using a similar complex analytic technique, we improve the upper bound on worst case trace reconstruction, showing that any unknown string x ∈ {0,1}n can be reconstructed with high probability from exp(O(n1/5)) independently generated traces.

...read moreread less

Journal Article•DOI•

STOUT: SMILES to IUPAC names using neural machine translation.

[...]

Kohulan Rajan¹, Achim Zielesny, Christoph Steinbeck¹•Institutions (1)

University of Jena¹

27 Apr 2021-Journal of Cheminformatics

TL;DR: In this paper, a deep learning neural machine translation approach is proposed to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation.

...read moreread less

Abstract: Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.

...read moreread less

Proceedings Article•DOI•

Symbolic Boolean derivatives for efficiently solving extended regular expression constraints

[...]

Caleb Stanford¹, Margus Veanes², Nikolaj Bjørner²•Institutions (2)

University of Pennsylvania¹, Microsoft²

19 Jun 2021

TL;DR: This work proposes a new theory of derivatives of symbolic extended regular expressions that both handles intersection and complement and works symbolically over an arbitrary character theory, and unifies existing approaches involving derivatives of extended regular expression, alternating automata and Boolean automata by lifting them to a common symbolic platform.

...read moreread less

Abstract: The manipulation of raw string data is ubiquitous in security-critical software, and verification of such software relies on efficiently solving string and regular expression constraints via SMT. However, the typical case of Boolean combinations of regular expression constraints exposes blowup in existing techniques. To address solvability of such constraints, we propose a new theory of derivatives of symbolic extended regular expressions (extended meaning that complement and intersection are incorporated), and show how to apply this theory to obtain more efficient decision procedures. Our implementation of these ideas, built on top of Z3, matches or outperforms state-of-the-art solvers on standard and handwritten benchmarks, showing particular benefits on examples with Boolean combinations. Our work is the first formalization of derivatives of regular expressions which both handles intersection and complement and works symbolically over an arbitrary character theory. It unifies existing approaches involving derivatives of extended regular expressions, alternating automata and Boolean automata by lifting them to a common symbolic platform. It relies on a parsimonious augmentation of regular expressions: a construct for symbolic conditionals is shown to be sufficient to obtain relevant closure properties for derivatives over extended regular expressions.

...read moreread less

Posted Content•

String Diagram Rewrite Theory II: Rewriting with Symmetric Monoidal Structure.

[...]

Filippo Bonchi¹, Fabio Gadducci¹, Aleks Kissinger², Pawel Sobocinski, Fabio Zanasi³ - Show less +1 more•Institutions (3)

University of Pisa¹, University of Oxford², University College London³

29 Apr 2021-arXiv: Logic in Computer Science

TL;DR: This work shows how an analogous correspondence may be established for arbitrary SMTs, once an appropriate notion of DPO rewriting (which the authors call convex) is identified, and uses the approach to show termination of two SMTs of interest: Frobenius semi-algebras and bialgebra.

...read moreread less

Abstract: Symmetric monoidal theories (SMTs) generalise algebraic theories in a way that make them suitable to express resource-sensitive systems, in which variables cannot be copied or discarded at will. In SMTs, traditional tree-like terms are replaced by string diagrams, topological entities that can be intuitively thoughts as diagrams of wires and boxes. Recently, string diagrams have become increasingly popular as a graphical syntax to reason about computational models across diverse fields, including programming language semantics, circuit theory, quantum mechanics, linguistics, and control theory. In applications, it is often convenient to implement the equations appearing in SMTs as rewriting rules. This poses the challenge of extending the traditional theory of term rewriting, which has been developed for algebraic theories, to string diagrams. In this paper, we develop a mathematical theory of string diagram rewriting for SMTs. Our approach exploits the correspondence between string diagram rewriting and double pushout (DPO) rewriting of certain graphs, introduced in the first paper of this series. Such a correspondence is only sound when the SMT includes a Frobenius algebra structure. In the present work, we show how an analogous correspondence may be established for arbitrary SMTs, once an appropriate notion of DPO rewriting (which we call convex) is identified. As proof of concept, we use our approach to show termination of two SMTs of interest: Frobenius semi-algebras and bialgebras.

...read moreread less

Journal Article•DOI•

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

[...]

Gonzalo Navarro¹•Institutions (1)

University of Chile¹

05 Mar 2021-ACM Computing Surveys

TL;DR: In this article, the authors present a survey of the algorithmic developments that have led to these data structures, including the distinct compression paradigms that have been used to exploit repetitiveness, and algorithmic techniques that provide direct access to the compressed strings.

...read moreread less

Abstract: Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey.

...read moreread less

Proceedings Article•DOI•

Approximate Trace Reconstruction: Algorithms

[...]

Sami Davies¹, Miklos Z. Racz², Benjamin G. Schiffer², Cyrus Rashtchian³•Institutions (3)

University of Washington¹, Princeton University², University of California, San Diego³

12 Jul 2021

TL;DR: In this paper, the authors introduce approximate trace reconstruction, a relaxed version of the original trace reconstruction problem, where instead of learning a binary string perfectly from noisy samples, instead of outputting a string that is close in edit distance to the original string using few traces, they present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within $n$ / poly log n) edit distance and where we only use polylog (n) traces.

...read moreread less

Abstract: We introduce approximate trace reconstruction, a relaxed version of the trace reconstruction problem. Here, instead of learning a binary string perfectly from noisy samples, as in the original trace reconstruction problem, the goal is to output a string that is close in edit distance to the original string using few traces. We present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within $n$ / polylog (n) edit distance and where we only use polylog (n) traces (or sometimes just a single trace). These classes contain strings that require a linear number of traces for exact reconstruction and that are quite different from a typical random string. From a technical point of view, our algorithms approximately reconstruct consecutive substrings of the unknown string by aligning dense regions of traces and using a run of a suitable length to approximate each region. A full version of this paper is accessible at: https://arxiv.org/abs/2012.06713.pdf

...read moreread less

Book Chapter•DOI•

Absent Subsequences in Words

[...]

Maria Kosche¹, Tore Koß¹, Florin Manea¹, Stefan Siemer¹•Institutions (1)

University of Göttingen¹

25 Oct 2021

TL;DR: The notion of absent subsequences was introduced in this paper, where a string u is an absent subsequence of a string w if u does not occur as subsequence (a.k.a. scattered factor) inside w.

...read moreread less

Abstract: An absent factor of a string w is a string u which does not occur as a contiguous substring (a.k.a. factor) inside w. We extend this well-studied notion and define absent subsequences: a string u is an absent subsequence of a string w if u does not occur as subsequence (a.k.a. scattered factor) inside w. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.

...read moreread less

Journal Article•DOI•

Multi look‐ahead consensus of vehicular networks in the presence of random data missing:

[...]

Hossein Chehardoli

01 Mar 2021-Journal of Vibration and Control

TL;DR: It will be shown that the proposed controller always guarantees the string stability of platoon without any limitations, and the constant spacing strategy is used to adjust the inter-vehicle spacing.

...read moreread less

Abstract: This study deals with internal and string stability analyses of vehicular platoons with centralized multi-look ahead network topology in the presence of communication and parasitic delays and rando...

...read moreread less

Proceedings Article•DOI•

On crossing fitness valleys with majority-vote crossover and estimation-of-distribution algorithms

[...]

Carsten Witt¹•Institutions (1)

Technical University of Denmark¹

06 Sep 2021

TL;DR: In this paper, the authors investigate variants of the Jump function where the gap is shifted and appears in the typical search trajectory, and derive limits on the gap size allowing efficient runtimes for the EDA.

...read moreread less

Abstract: The benefits of using crossover in crossing fitness gaps have been studied extensively in evolutionary computation. Recent runtime results show that majority-vote crossover is particularly efficient at optimizing the well-known Jump benchmark function that includes a fitness gap next to the global optimum. Also estimation-of-distribution algorithms (EDAs), which use an implicit crossover, are much more efficient on Jump than typical mutation-based algorithms. However, the allowed gap size for polynomial runtimes with EDAs is at most logarithmic in the problem dimension n. In this paper, we investigate variants of the Jump function where the gap is shifted and appears in the middle of the typical search trajectory. Such gaps can still be overcome efficiently in time O (n log n) by majority-vote crossover and an estimation-of-distribution algorithm, even for gap sizes almost [EQUATION]. However, if the global optimum is located in the gap instead of the usual all-ones string, majority-vote crossover would nevertheless approach the all-ones string and be highly inefficient. In sharp contrast, an EDA can still find such a shifted optimum efficiently. Thanks to a general property called fair sampling, the EDA will with high probability sample from almost every fitness level of the function, including levels in the gap, and sample the global optimum even though the overall search trajectory points towards the all-ones string. Finally, we derive limits on the gap size allowing efficient runtimes for the EDA.

...read moreread less

Journal Article•DOI•

SRD: A Tree Structure Based Decoder for Online Handwritten Mathematical Expression Recognition

[...]

Jianshu Zhang¹, Jun Du¹, Yongxin Yang², Yi-Zhe Song², Li-Rong Dai¹ - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, University of Surrey²

01 Jan 2021-IEEE Transactions on Multimedia

TL;DR: This paper proposes a novel sequential relation decoder (SRD) that aims to decode expressions into tree structures for online handwritten mathematical expression recognition and demonstrates how the proposed SRD outperforms state-of-the-art string decoders through a set of experiments on CROHME database.

...read moreread less

Abstract: Recently, recognition of online handwritten mathe- matical expression has been greatly improved by employing encoder-decoder based methods. Existing encoder-decoder models use string decoders to generate LaTeX strings for mathematical expression recognition. However, in this paper, we importantly argue that string representations might not be the most natural for mathematical expressions – mathematical expressions are inherently tree structures other than flat strings. For this purpose, we propose a novel sequential relation decoder (SRD) that aims to decode expressions into tree structures for online handwritten mathematical expression recognition. At each step of tree construction, a sub-tree structure composed of a relation node and two symbol nodes is computed based on previous sub-tree structures. This is the first work that builds a tree structure based decoder for encoder-decoder based mathematical expression recognition. Compared with string decoders, a decoder that better understands tree structures is crucial for mathematical expression recognition as it brings a more reasonable learning objective and improves overall generalization ability. We demonstrate how the proposed SRD outperforms state-of-the-art string decoders through a set of experiments on CROHME database, which is currently the largest benchmark for online handwritten mathematical expression recognition.

...read moreread less

Book Chapter•DOI•

An SMT Solver for Regular Expressions and Linear Arithmetic over String Length

[...]

Murphy Berzish¹, Mitja Kulczynski², Federico Mora³, Florin Manea⁴, Joel D. Day⁵, Dirk Nowotka², Vijay Ganesh¹ - Show less +3 more•Institutions (5)

University of Waterloo¹, University of Kiel², University of California, Berkeley³, University of Göttingen⁴, Loughborough University⁵

18 Jul 2021

TL;DR: In this paper, a length-aware solving algorithm for the quantifier-free first-order theory over regex membership predicate and linear arithmetic over string length is presented, which can be used very effectively to simplify operations on automata representing regular expressions.

...read moreread less

Abstract: We present a novel length-aware solving algorithm for the quantifier-free first-order theory over regex membership predicate and linear arithmetic over string length. We implement and evaluate this algorithm and related heuristics in the Z3 theorem prover. A crucial insight that underpins our algorithm is that real-world regex and string formulas contain a wealth of information about upper and lower bounds on lengths of strings, and such information can be used very effectively to simplify operations on automata representing regular expressions. Additionally, we present a number of novel general heuristics, such as the prefix/suffix method, that can be used to make a variety of regex solving algorithms more efficient in practice. We showcase the power of our algorithm and heuristics via an extensive empirical evaluation over a large and diverse benchmark of 57256 regex-heavy instances, almost 75% of which are derived from industrial applications or contributed by other solver developers. Our solver outperforms five other state-of-the-art string solvers, namely, CVC4, OSTRICH, Z3seq, Z3str3, and Z3-Trau, over this benchmark, in particular achieving a speedup of 2.4$\times $ over CVC4, 4.4$\times $ over Z3seq, 6.4$\times $ over Z3-Trau, 9.1$\times $ over Z3str3, and 13$\times $ over OSTRICH.

...read moreread less

Journal Article•DOI•

String Stability of Connected Vehicle Platoons Under Lossy V2V Communication

[...]

Vamsi Vegamoor¹, Sivakumar Rathinam¹, Swaroop Darbha¹•Institutions (1)

Texas A&M University¹

16 Jun 2021-IEEE Transactions on Intelligent Transportation Systems

TL;DR: The focus of this work is to reconsider string stability from a safety perspective and develop an upper limit on the maximum spacing error in a homogeneous platoon as a function of the acceleration maneuver of the lead vehicle.

...read moreread less

Abstract: Recent advances in vehicle connectivity have allowed formation of autonomous vehicle platoons for improved mobility and traffic throughput. In order to avoid a pile-up in such platoons, it is important to ensure platoon (string) stability, which is the focus of this work. As per conventional definition of string stability, the power (2-norm) of the spacing error signals should not amplify downstream in a platoon. But in practice, it is the infinity-norm of the spacing error signal that dictates whether a collision occurs. We address this discrepancy in the first part of our work, where we reconsider string stability from a safety perspective and develop an upper limit on the maximum spacing error in a homogeneous platoon as a function of the acceleration maneuver of the lead vehicle. In the second part of this paper, we extend our previous results by providing the minimum achievable time headway for platoons with two-predecessor lookup schemes experiencing burst-noise packet losses. Finally, we utilize throttle and brake maps to develop a longitudinal vehicle model and validate it against a Lincoln MKZ which is then used for numerical corroboration of the proposed time headway selection algorithms.

...read moreread less

Journal Article•DOI•

Fault Detection, Classification and Localization Algorithm for Photovoltaic Array

[...]

Ahsan Mehmood¹, Hadeed Ahmed Sher¹, Ali Faisal Murtaza², Kamal Al Haddad³•Institutions (3)

Ghulam Ishaq Khan Institute of Engineering Sciences and Technology¹, University of Central Punjab², École Normale Supérieure³

25 Feb 2021-IEEE Transactions on Energy Conversion

TL;DR: The proposed method can detect faults of any mismatch level and size of the PV array by using only one current sensor per string, and is also capable of working with finite fault-resistance as well as in the presence/absence of blocking diodes.

...read moreread less

Abstract: In this study, analysis of photovoltaic (PV) string currents is performed to understand the behavior of the PV array under faults. The whole analysis is summed up in just two simple statements, and an algorithm is devised on the basis of these statements. The proposed algorithm can detect, classify and localize line-to-ground (L-G) & line-to-line (L-L) faults in the PV array. The proposed method can detect faults of any mismatch level and size of the PV array by using only one current sensor per string. It is also capable of working with finite fault-resistance as well as in the presence/absence of blocking diodes. To check the accuracy of the algorithm, it has been thoroughly tested through experimental setup.

...read moreread less

Journal Article•DOI•

A charge equalization scheme for battery string with charging current allocation

[...]

Yao-Ching Hsieh¹, Li-Ren Yu¹, Meng-Feng Yang²•Institutions (2)

National Sun Yat-sen University¹, Delta Electronics²

16 May 2021-International Journal of Circuit Theory and Applications

Journal Article•DOI•

Semi-Supervised Scene Text Recognition

[...]

Yunze Gao¹, Yingying Chen¹, Jinqiao Wang¹, Hanqing Lu¹•Institutions (1)

Chinese Academy of Sciences¹

20 Jan 2021-IEEE Transactions on Image Processing

TL;DR: In this article, a semi-supervised method for scene text recognition is proposed, where the image feature and string feature are embedded into a common space and the embedding reward is defined by the similarity between the input image and generated string.

...read moreread less

Abstract: Scene text recognition has been widely researched with supervised approaches. Most existing algorithms require a large amount of labeled data and some methods even require character-level or pixel-wise supervision information. However, labeled data is expensive, unlabeled data is relatively easy to collect, especially for many languages with fewer resources. In this paper, we propose a novel semi-supervised method for scene text recognition. Specifically, we design two global metrics, i.e., edit reward and embedding reward, to evaluate the quality of generated string and adopt reinforcement learning techniques to directly optimize these rewards. The edit reward measures the distance between the ground truth label and the generated string. Besides, the image feature and string feature are embedded into a common space and the embedding reward is defined by the similarity between the input image and generated string. It is natural that the generated string should be the nearest with the image it is generated from. Therefore, the embedding reward can be obtained without any ground truth information. In this way, we can effectively exploit a large number of unlabeled images to improve the recognition performance without any additional laborious annotations. Extensive experimental evaluations on the five challenging benchmarks, the Street View Text, IIIT5K, and ICDAR datasets demonstrate the effectiveness of the proposed approach, and our method significantly reduces annotation effort while maintaining competitive recognition performance.

...read moreread less

Journal Article•DOI•

Color image DNA encryption using mRNA properties and non-adjacent coupled map lattices

[...]

Hossein Movafegh Ghadirli¹, Ali Nodehi¹, Rasul Enayatifar¹•Institutions (1)

Islamic Azad University¹

01 Mar 2021-Multimedia Tools and Applications

TL;DR: The results of tests and security analysis showed that the results of encryption with this scheme are effective, and the key space is large enough to withstand common attacks.

...read moreread less

Abstract: This paper proposes a novel algorithm for encrypting color images. The innovation in this study is the use of messenger ribonucleic acid (mRNA) encoding to import into Deoxyribonucleic acid (DNA) encoding. For permutation of the plain image bits, we use Arnold’s Cat Map at the bit-level. Then, using Non-Adjacent Coupled Map Lattices (NCML), we apply diffusion operations to the permuted color channels. We also provide the upgrade of the diffusion phase with DNA encoding. In the proposed algorithm, the choices are random depending on the secret key, which is implemented using a simple logistic map. Hashing the string entered by the user, the secret key, parameters, and initial values are generated by the Double MD5 method. The results of tests and security analysis showed that the results of encryption with this scheme are effective, and the key space is large enough to withstand common attacks.

...read moreread less

Journal Article•DOI•

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

[...]

Gonzalo Navarro¹•Institutions (1)

University of Chile¹

09 Feb 2021-ACM Computing Surveys

TL;DR: In this paper, the authors present a survey of the algorithmic developments that have led to these data structures and discuss the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

...read moreread less

Abstract: Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability of handling them even in compressed form It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations We conclude with the current challenges in this fascinating field

...read moreread less

Journal Article•DOI•

Data Structures to Represent a Set of k-long DNA Sequences

[...]

Rayan Chikhi, Jan Holub¹, Paul Medvedev•Institutions (1)

Czech Technical University in Prague¹

08 Mar 2021-ACM Computing Surveys

TL;DR: A survey of the data structures that have been proposed to store and query a k-mer set can be found in this paper, where the authors present a unified presentation and comparison of data structures for storing and querying k-mers.

...read moreread less

Abstract: The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

...read moreread less

Collapse