scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2021"


Journal ArticleDOI
TL;DR: Changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks are described.
Abstract: Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.

3,253 citations


Posted Content
TL;DR: The authors surveys and organizes research works in a new paradigm in natural language processing, which they dub "prompt-based learning" and describe a unified set of mathematical notations that can cover a wide variety of existing work.
Abstract: This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning". Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input x is modified using a template into a textual string prompt x' that has some unfilled slots, and then the language model is used to probabilistically fill the unfilled information to obtain a final string x, from which the final output y can be derived. This framework is powerful and attractive for a number of reasons: it allows the language model to be pre-trained on massive amounts of raw text, and by defining a new prompting function the model is able to perform few-shot or even zero-shot learning, adapting to new scenarios with few or no labeled data. In this paper we introduce the basics of this promising paradigm, describe a unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions, e.g.the choice of pre-trained models, prompts, and tuning strategies. To make the field more accessible to interested beginners, we not only make a systematic review of existing works and a highly structured typology of prompt-based concepts, but also release other resources, e.g., a website this http URL including constantly-updated survey, and paperlist.

93 citations


Journal ArticleDOI
TL;DR: This paper formulate causality extraction as a sequence labeling problem based on a novel causality tagging scheme, and proposes a neural causality extractor with the BiLSTM-CRF model as the backbone, named SCITE (Self-attentive BiL STM- CRF wIth Transferred Embeddings), which can directly extract cause and effect without extracting candidate causal pairs and identifying their relations separately.

80 citations


Journal ArticleDOI
TL;DR: In this paper, an analytical sufficient condition for the L ∞ string stability of heterogeneous vehicles, which move according to a general class of car-following models, is derived.
Abstract: This paper shows that the heterogeneity of drivers’ and vehicles characteristics makes platoons, on average, more string-unstable. However, the string instability degree of unstable platoons is much higher in a homogeneous flow than in a heterogeneous one. These results are based on an L ∞ characterization of string stability, which is shown to be the most appropriate one from a traffic safety viewpoint. Mechanisms and conditions are discussed in which an L 2 characterization is not able to capture the amplification of a speed drop through a string of vehicles. An analytical sufficient condition for the L ∞ string stability of heterogeneous vehicles, which move according to a general class of car-following models, is derived. Above all, a thorough comparison of L ∞ and L 2 string stability characterizations between a homogeneous and a heterogenous flow, is performed. To this aim, the L p norms of heterogeneous platoons are calculated within a quasi-Monte Carlo framework. The variability of the L p norm values due to the platoon length, the equilibrium speed, and the probability distribution model of the uncertain vehicle parameters, is analysed. Overall, it is shown that the platoon stability behaviour sensibly changes with the shape and the correlation structure of vehicle model parameter distributions. Therefore, traffic heterogeneity needs to be modelled in order to correctly characterize the string stability of a mixed traffic flow.

42 citations


Book ChapterDOI
Bob Coecke1
TL;DR: Both the compositional formalism and suggested meaning model are highly quantum-inspired, and implementation on a quantum computer would come with a range of benefits.
Abstract: In previous work we gave a mathematical foundation, referred to as DisCoCat, for how words interact in a sentence in order to produce the meaning of that sentence. To do so, we exploited the perfect structural match of grammar and categories of meaning spaces. Here, we give a mathematical foundation, referred to as DisCoCirc, for how sentences interact in texts in order to produce the meaning of that text. First we revisit DisCoCat. While in DisCoCat all meanings are fixed as states (i.e. have no input), in DisCoCirc word meanings correspond to a type, or system, and the states of this system can evolve. Sentences are gates within a circuit which update the variable meanings of those words. Like in DisCoCat, word meanings can live in a variety of spaces e.g. propositional, vectorial, or cognitive. The compositional structure are string diagrams representing information flows, and an entire text yields a single string diagram in which word meanings lift to the meaning of the entire text. While the developments in this paper are independent of a physical embodiment (cf. classical vs. quantum computing), both the compositional formalism and suggested meaning model are highly quantum-inspired, and implementation on a quantum computer would come with a range of benefits. We also praise Jim Lambek for his role in mathematical linguistics in general, and the development of the DisCo program more specifically.

33 citations


Journal ArticleDOI
TL;DR: It is found that relational graph convolutional networks and gradient-boosting machines are very effective for this learning task, and a novel reaction-level graph-attention operation is disclosed in the top-performing model.
Abstract: Machine-learned ranking models have been developed for the prediction of substrate-specific cross-coupling reaction conditions. Data sets of published reactions were curated for Suzuki, Negishi, and C-N couplings, as well as Pauson-Khand reactions. String, descriptor, and graph encodings were tested as input representations, and models were trained to predict the set of conditions used in a reaction as a binary vector. Unique reagent dictionaries categorized by expert-crafted reaction roles were constructed for each data set, leading to context-aware predictions. We find that relational graph convolutional networks and gradient-boosting machines are very effective for this learning task, and we disclose a novel reaction-level graph attention operation in the top-performing model.

28 citations


Journal ArticleDOI
TL;DR: It is shown how the size of the smallest string attractor of a word varies when combinatorial operations are applied and it is deduced that such a measure is not monotone.

26 citations


Journal ArticleDOI
08 Feb 2021
TL;DR: DisCoPy, an open source toolbox for computing with monoidal categories, provides an intuitive syntax for defining string diagrams and monoidal functors that allows the efficient implementation of computational experiments in the various applications of category theory.
Abstract: We introduce DisCoPy, an open source toolbox for computing with monoidal categories. The library provides an intuitive syntax for defining string diagrams and monoidal functors. Its modularity allows the efficient implementation of computational experiments in the various applications of category theory where diagrams have become a lingua franca. As an example, we used DisCoPy to perform natural language processing on quantum hardware for the first time.

20 citations


Journal ArticleDOI
TL;DR: This work proves a lower bound on the size of the optimal SPSS and proposes a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to the lower bound.
Abstract: Given the popularity and elegance of \(k\)-mer based tools, finding a space-efficient way to represent a set of \(k\)-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of \(k\)-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of \(k\)-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static \(k\)-mer membership index, UST-FM, which we show improves index size by 10–44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

19 citations


Proceedings ArticleDOI
19 Jun 2021
TL;DR: WebQA as mentioned in this paper employs a neurosymbolic DSL that incorporates both neural NLP models as well as standard language constructs for tree navigation and string manipulation, and uses transductive learning to select programs with good generalization power.
Abstract: In this paper, we propose a new technique based on program synthesis for extracting information from webpages. Given a natural language query and a few labeled webpages, our method synthesizes a program that can be used to extract similar types of information from other unlabeled webpages. To handle websites with diverse structure, our approach employs a neurosymbolic DSL that incorporates both neural NLP models as well as standard language constructs for tree navigation and string manipulation. We also propose an optimal synthesis algorithm that generates all DSL programs that achieve optimal F1 score on the training examples. Our synthesis technique is compositional, prunes the search space by exploiting a monotonicity property of the DSL, and uses transductive learning to select programs with good generalization power. We have implemented these ideas in a new tool called WebQA and evaluate it on 25 different tasks across multiple domains. Our experiments show that WebQA significantly outperforms existing tools such as state-of-the-art question answering models and wrapper induction systems.

18 citations


Journal ArticleDOI
TL;DR: In this article, a complete description of a basis of the extension spaces between indecomposable string and quasi-simple band modules in the module category of a gentle algebra is given.
Abstract: We give a complete description of a basis of the extension spaces between indecomposable string and quasi-simple band modules in the module category of a gentle algebra.

Journal ArticleDOI
15 Oct 2021
TL;DR: In this paper, the authors propose a framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies, and demonstrate the generality of their approach via a case study in the domain of string profiling.
Abstract: The ability to learn programs from few examples is a powerful technology with disruptive applications in many domains, as it allows users to automate repetitive tasks in an intuitive way. Existing frameworks on inductive synthesis only perform syntactic manipulations, where they rely on the syntactic structure of the given examples and not their meaning. Any semantic manipulations, such as transforming dates, have to be manually encoded by the designer of the inductive programming framework. Recent advances in large language models have shown these models to be very adept at performing semantic transformations of its input by simply providing a few examples of the task at hand. When it comes to syntactic transformations, however, these models are limited in their expressive power. In this paper, we propose a novel framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies. In particular, the inductive synthesis is tasked with breaking down the problem in smaller subproblems, among which those that cannot be solved syntactically are passed to the language model. We formalize three semantic operators that can be integrated with inductive synthesizers. To minimize invoking expensive semantic operators during learning, we introduce a novel deferred query execution algorithm that considers the operators to be oracles during learning. We evaluate our approach in the domain of string transformations: the combination methodology can automate tasks that cannot be handled using either technologies by themselves. Finally, we demonstrate the generality of our approach via a case study in the domain of string profiling.

Proceedings ArticleDOI
15 Jun 2021
TL;DR: For any distinct x,y ∈ {0, 1}n, there is a deterministic finite automata with O(n 1/3) states that accepts x but not y as mentioned in this paper.
Abstract: We prove that for any distinct x,y ∈ {0,1}n, there is a deterministic finite automaton with O(n1/3) states that accepts x but not y. This improves Robson’s 1989 bound of O(n2/5). Using a similar complex analytic technique, we improve the upper bound on worst case trace reconstruction, showing that any unknown string x ∈ {0,1}n can be reconstructed with high probability from exp(O(n1/5)) independently generated traces.

Journal ArticleDOI
TL;DR: In this paper, a deep learning neural machine translation approach is proposed to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation.
Abstract: Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.

Proceedings ArticleDOI
19 Jun 2021
TL;DR: This work proposes a new theory of derivatives of symbolic extended regular expressions that both handles intersection and complement and works symbolically over an arbitrary character theory, and unifies existing approaches involving derivatives of extended regular expression, alternating automata and Boolean automata by lifting them to a common symbolic platform.
Abstract: The manipulation of raw string data is ubiquitous in security-critical software, and verification of such software relies on efficiently solving string and regular expression constraints via SMT. However, the typical case of Boolean combinations of regular expression constraints exposes blowup in existing techniques. To address solvability of such constraints, we propose a new theory of derivatives of symbolic extended regular expressions (extended meaning that complement and intersection are incorporated), and show how to apply this theory to obtain more efficient decision procedures. Our implementation of these ideas, built on top of Z3, matches or outperforms state-of-the-art solvers on standard and handwritten benchmarks, showing particular benefits on examples with Boolean combinations. Our work is the first formalization of derivatives of regular expressions which both handles intersection and complement and works symbolically over an arbitrary character theory. It unifies existing approaches involving derivatives of extended regular expressions, alternating automata and Boolean automata by lifting them to a common symbolic platform. It relies on a parsimonious augmentation of regular expressions: a construct for symbolic conditionals is shown to be sufficient to obtain relevant closure properties for derivatives over extended regular expressions.

Posted Content
TL;DR: This work shows how an analogous correspondence may be established for arbitrary SMTs, once an appropriate notion of DPO rewriting (which the authors call convex) is identified, and uses the approach to show termination of two SMTs of interest: Frobenius semi-algebras and bialgebra.
Abstract: Symmetric monoidal theories (SMTs) generalise algebraic theories in a way that make them suitable to express resource-sensitive systems, in which variables cannot be copied or discarded at will. In SMTs, traditional tree-like terms are replaced by string diagrams, topological entities that can be intuitively thoughts as diagrams of wires and boxes. Recently, string diagrams have become increasingly popular as a graphical syntax to reason about computational models across diverse fields, including programming language semantics, circuit theory, quantum mechanics, linguistics, and control theory. In applications, it is often convenient to implement the equations appearing in SMTs as rewriting rules. This poses the challenge of extending the traditional theory of term rewriting, which has been developed for algebraic theories, to string diagrams. In this paper, we develop a mathematical theory of string diagram rewriting for SMTs. Our approach exploits the correspondence between string diagram rewriting and double pushout (DPO) rewriting of certain graphs, introduced in the first paper of this series. Such a correspondence is only sound when the SMT includes a Frobenius algebra structure. In the present work, we show how an analogous correspondence may be established for arbitrary SMTs, once an appropriate notion of DPO rewriting (which we call convex) is identified. As proof of concept, we use our approach to show termination of two SMTs of interest: Frobenius semi-algebras and bialgebras.

Journal ArticleDOI
TL;DR: In this article, the authors present a survey of the algorithmic developments that have led to these data structures, including the distinct compression paradigms that have been used to exploit repetitiveness, and algorithmic techniques that provide direct access to the compressed strings.
Abstract: Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures. In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey.

Proceedings ArticleDOI
12 Jul 2021
TL;DR: In this paper, the authors introduce approximate trace reconstruction, a relaxed version of the original trace reconstruction problem, where instead of learning a binary string perfectly from noisy samples, instead of outputting a string that is close in edit distance to the original string using few traces, they present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within $n$ / poly log n) edit distance and where we only use polylog (n) traces.
Abstract: We introduce approximate trace reconstruction, a relaxed version of the trace reconstruction problem. Here, instead of learning a binary string perfectly from noisy samples, as in the original trace reconstruction problem, the goal is to output a string that is close in edit distance to the original string using few traces. We present several algorithms that can approximately reconstruct strings that belong to certain classes, where the estimate is within $n$ / polylog (n) edit distance and where we only use polylog (n) traces (or sometimes just a single trace). These classes contain strings that require a linear number of traces for exact reconstruction and that are quite different from a typical random string. From a technical point of view, our algorithms approximately reconstruct consecutive substrings of the unknown string by aligning dense regions of traces and using a run of a suitable length to approximate each region. A full version of this paper is accessible at: https://arxiv.org/abs/2012.06713.pdf

Book ChapterDOI
25 Oct 2021
TL;DR: The notion of absent subsequences was introduced in this paper, where a string u is an absent subsequence of a string w if u does not occur as subsequence (a.k.a. scattered factor) inside w.
Abstract: An absent factor of a string w is a string u which does not occur as a contiguous substring (a.k.a. factor) inside w. We extend this well-studied notion and define absent subsequences: a string u is an absent subsequence of a string w if u does not occur as subsequence (a.k.a. scattered factor) inside w. Of particular interest to us are minimal absent subsequences, i.e., absent subsequences whose every subsequence is not absent, and shortest absent subsequences, i.e., absent subsequences of minimal length. We show a series of combinatorial and algorithmic results regarding these two notions. For instance: we give combinatorial characterisations of the sets of minimal and, respectively, shortest absent subsequences in a word, as well as compact representations of these sets; we show how we can test efficiently if a string is a shortest or minimal absent subsequence in a word, and we give efficient algorithms computing the lexicographically smallest absent subsequence of each kind; also, we show how a data structure for answering shortest absent subsequence-queries for the factors of a given string can be efficiently computed.

Journal ArticleDOI
TL;DR: It will be shown that the proposed controller always guarantees the string stability of platoon without any limitations, and the constant spacing strategy is used to adjust the inter-vehicle spacing.
Abstract: This study deals with internal and string stability analyses of vehicular platoons with centralized multi-look ahead network topology in the presence of communication and parasitic delays and rando...

Proceedings ArticleDOI
06 Sep 2021
TL;DR: In this paper, the authors investigate variants of the Jump function where the gap is shifted and appears in the typical search trajectory, and derive limits on the gap size allowing efficient runtimes for the EDA.
Abstract: The benefits of using crossover in crossing fitness gaps have been studied extensively in evolutionary computation. Recent runtime results show that majority-vote crossover is particularly efficient at optimizing the well-known Jump benchmark function that includes a fitness gap next to the global optimum. Also estimation-of-distribution algorithms (EDAs), which use an implicit crossover, are much more efficient on Jump than typical mutation-based algorithms. However, the allowed gap size for polynomial runtimes with EDAs is at most logarithmic in the problem dimension n. In this paper, we investigate variants of the Jump function where the gap is shifted and appears in the middle of the typical search trajectory. Such gaps can still be overcome efficiently in time O (n log n) by majority-vote crossover and an estimation-of-distribution algorithm, even for gap sizes almost [EQUATION]. However, if the global optimum is located in the gap instead of the usual all-ones string, majority-vote crossover would nevertheless approach the all-ones string and be highly inefficient. In sharp contrast, an EDA can still find such a shifted optimum efficiently. Thanks to a general property called fair sampling, the EDA will with high probability sample from almost every fitness level of the function, including levels in the gap, and sample the global optimum even though the overall search trajectory points towards the all-ones string. Finally, we derive limits on the gap size allowing efficient runtimes for the EDA.

Journal ArticleDOI
TL;DR: This paper proposes a novel sequential relation decoder (SRD) that aims to decode expressions into tree structures for online handwritten mathematical expression recognition and demonstrates how the proposed SRD outperforms state-of-the-art string decoders through a set of experiments on CROHME database.
Abstract: Recently, recognition of online handwritten mathe- matical expression has been greatly improved by employing encoder-decoder based methods. Existing encoder-decoder models use string decoders to generate LaTeX strings for mathematical expression recognition. However, in this paper, we importantly argue that string representations might not be the most natural for mathematical expressions – mathematical expressions are inherently tree structures other than flat strings. For this purpose, we propose a novel sequential relation decoder (SRD) that aims to decode expressions into tree structures for online handwritten mathematical expression recognition. At each step of tree construction, a sub-tree structure composed of a relation node and two symbol nodes is computed based on previous sub-tree structures. This is the first work that builds a tree structure based decoder for encoder-decoder based mathematical expression recognition. Compared with string decoders, a decoder that better understands tree structures is crucial for mathematical expression recognition as it brings a more reasonable learning objective and improves overall generalization ability. We demonstrate how the proposed SRD outperforms state-of-the-art string decoders through a set of experiments on CROHME database, which is currently the largest benchmark for online handwritten mathematical expression recognition.

Book ChapterDOI
18 Jul 2021
TL;DR: In this paper, a length-aware solving algorithm for the quantifier-free first-order theory over regex membership predicate and linear arithmetic over string length is presented, which can be used very effectively to simplify operations on automata representing regular expressions.
Abstract: We present a novel length-aware solving algorithm for the quantifier-free first-order theory over regex membership predicate and linear arithmetic over string length. We implement and evaluate this algorithm and related heuristics in the Z3 theorem prover. A crucial insight that underpins our algorithm is that real-world regex and string formulas contain a wealth of information about upper and lower bounds on lengths of strings, and such information can be used very effectively to simplify operations on automata representing regular expressions. Additionally, we present a number of novel general heuristics, such as the prefix/suffix method, that can be used to make a variety of regex solving algorithms more efficient in practice. We showcase the power of our algorithm and heuristics via an extensive empirical evaluation over a large and diverse benchmark of 57256 regex-heavy instances, almost 75% of which are derived from industrial applications or contributed by other solver developers. Our solver outperforms five other state-of-the-art string solvers, namely, CVC4, OSTRICH, Z3seq, Z3str3, and Z3-Trau, over this benchmark, in particular achieving a speedup of 2.4\(\times \) over CVC4, 4.4\(\times \) over Z3seq, 6.4\(\times \) over Z3-Trau, 9.1\(\times \) over Z3str3, and 13\(\times \) over OSTRICH.

Journal ArticleDOI
TL;DR: The focus of this work is to reconsider string stability from a safety perspective and develop an upper limit on the maximum spacing error in a homogeneous platoon as a function of the acceleration maneuver of the lead vehicle.
Abstract: Recent advances in vehicle connectivity have allowed formation of autonomous vehicle platoons for improved mobility and traffic throughput. In order to avoid a pile-up in such platoons, it is important to ensure platoon (string) stability, which is the focus of this work. As per conventional definition of string stability, the power (2-norm) of the spacing error signals should not amplify downstream in a platoon. But in practice, it is the infinity-norm of the spacing error signal that dictates whether a collision occurs. We address this discrepancy in the first part of our work, where we reconsider string stability from a safety perspective and develop an upper limit on the maximum spacing error in a homogeneous platoon as a function of the acceleration maneuver of the lead vehicle. In the second part of this paper, we extend our previous results by providing the minimum achievable time headway for platoons with two-predecessor lookup schemes experiencing burst-noise packet losses. Finally, we utilize throttle and brake maps to develop a longitudinal vehicle model and validate it against a Lincoln MKZ which is then used for numerical corroboration of the proposed time headway selection algorithms.

Journal ArticleDOI
TL;DR: The proposed method can detect faults of any mismatch level and size of the PV array by using only one current sensor per string, and is also capable of working with finite fault-resistance as well as in the presence/absence of blocking diodes.
Abstract: In this study, analysis of photovoltaic (PV) string currents is performed to understand the behavior of the PV array under faults. The whole analysis is summed up in just two simple statements, and an algorithm is devised on the basis of these statements. The proposed algorithm can detect, classify and localize line-to-ground (L-G) & line-to-line (L-L) faults in the PV array. The proposed method can detect faults of any mismatch level and size of the PV array by using only one current sensor per string. It is also capable of working with finite fault-resistance as well as in the presence/absence of blocking diodes. To check the accuracy of the algorithm, it has been thoroughly tested through experimental setup.


Journal ArticleDOI
TL;DR: In this article, a semi-supervised method for scene text recognition is proposed, where the image feature and string feature are embedded into a common space and the embedding reward is defined by the similarity between the input image and generated string.
Abstract: Scene text recognition has been widely researched with supervised approaches. Most existing algorithms require a large amount of labeled data and some methods even require character-level or pixel-wise supervision information. However, labeled data is expensive, unlabeled data is relatively easy to collect, especially for many languages with fewer resources. In this paper, we propose a novel semi-supervised method for scene text recognition. Specifically, we design two global metrics, i.e., edit reward and embedding reward, to evaluate the quality of generated string and adopt reinforcement learning techniques to directly optimize these rewards. The edit reward measures the distance between the ground truth label and the generated string. Besides, the image feature and string feature are embedded into a common space and the embedding reward is defined by the similarity between the input image and generated string. It is natural that the generated string should be the nearest with the image it is generated from. Therefore, the embedding reward can be obtained without any ground truth information. In this way, we can effectively exploit a large number of unlabeled images to improve the recognition performance without any additional laborious annotations. Extensive experimental evaluations on the five challenging benchmarks, the Street View Text, IIIT5K, and ICDAR datasets demonstrate the effectiveness of the proposed approach, and our method significantly reduces annotation effort while maintaining competitive recognition performance.

Journal ArticleDOI
TL;DR: The results of tests and security analysis showed that the results of encryption with this scheme are effective, and the key space is large enough to withstand common attacks.
Abstract: This paper proposes a novel algorithm for encrypting color images. The innovation in this study is the use of messenger ribonucleic acid (mRNA) encoding to import into Deoxyribonucleic acid (DNA) encoding. For permutation of the plain image bits, we use Arnold’s Cat Map at the bit-level. Then, using Non-Adjacent Coupled Map Lattices (NCML), we apply diffusion operations to the permuted color channels. We also provide the upgrade of the diffusion phase with DNA encoding. In the proposed algorithm, the choices are random depending on the secret key, which is implemented using a simple logistic map. Hashing the string entered by the user, the secret key, parameters, and initial values are generated by the Double MD5 method. The results of tests and security analysis showed that the results of encryption with this scheme are effective, and the key space is large enough to withstand common attacks.

Journal ArticleDOI
TL;DR: In this paper, the authors present a survey of the algorithmic developments that have led to these data structures and discuss the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.
Abstract: Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability of handling them even in compressed form It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations We conclude with the current challenges in this fascinating field

Journal ArticleDOI
TL;DR: A survey of the data structures that have been proposed to store and query a k-mer set can be found in this paper, where the authors present a unified presentation and comparison of data structures for storing and querying k-mers.
Abstract: The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.