Showing papers on "String (computer science) published in 2018"

PDF

Open Access

Posted Content•

MolGAN: An implicit generative model for small molecular graphs

[...]

30 May 2018-arXiv: Machine Learning

TL;DR: MolGAN is introduced, an implicit, likelihood-free generative model for small molecular graphs that circumvents the need for expensive graph matching procedures or node ordering heuris-tics of previous likelihood-based methods.

...read moreread less

Abstract: eep generative models for graph-structured data offer a new angle on the problem of chemical synthesis: by optimizing differentiable models that directly generate molecular graphs, it is pos-sible to side-step expensive search procedures in the discrete and vast space of chemical structures. We introduce MolGAN, an implicit, likelihood-free generative model for small molecular graphs that circumvents the need for expensive graph matching procedures or node ordering heuris-tics of previous likelihood-based methods. Our method adapts generative adversarial networks (GANs) to operate directly on graph-structured data. We combine our approach with a reinforce-ment learning objective to encourage the genera-tion of molecules with specific desired chemical properties. In experiments on the QM9 chemi-cal database, we demonstrate that our model is capable of generating close to 100% valid com-pounds. MolGAN compares favorably both to recent proposals that use string-based (SMILES) representations of molecules and to a likelihood-based method that directly generates graphs, al-beit being susceptible to mode collapse.

...read moreread less

631 citations

Proceedings Article•DOI•

Edit Probability for Scene Text Recognition

[...]

Fan Bai¹, Zhanzhan Cheng, Yi Niu, Shiliang Pu, Shuigeng Zhou¹ - Show less +1 more•Institutions (1)

Fudan University¹

18 Jun 2018

TL;DR: Zhang et al. as discussed by the authors proposed a novel method called edit probability (EP) for scene text recognition, which tries to estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing/superfluous characters.

...read moreread less

Abstract: We consider the scene text recognition problem under the attention-based encoder-decoder framework, which is the state of the art. The existing methods usually employ a frame-wise maximal likelihood loss to optimize the models. When we train the model, the misalignment between the ground truth strings and the attention's output sequences of probability distribution, which is caused by missing or superfluous characters, will confuse and mislead the training process, and consequently make the training costly and degrade the recognition accuracy. To handle this problem, we propose a novel method called edit probability (EP) for scene text recognition. EP tries to effectively estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing/superfluous characters. The advantage lies in that the training process can focus on the missing, superfluous and unrecognized characters, and thus the impact of the misalignment problem can be alleviated or even overcome. We conduct extensive experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets. Experimental results show that the EP can substantially boost scene text recognition performance.

...read moreread less

151 citations

Book Chapter•DOI•

Two-Round Multiparty Secure Computation from Minimal Assumptions

[...]

Sanjam Garg¹, Akshayaram Srinivasan¹•Institutions (1)

University of California¹

29 Apr 2018

TL;DR: These protocols are provided assuming the minimal assumption that two-round oblivious transfer (OT) exists and that the protocol is secure against semi-honest adversaries and malicious adversaries.

...read moreread less

Abstract: We provide new two-round multiparty secure computation (MPC) protocols assuming the minimal assumption that two-round oblivious transfer (OT) exists. If the assumed two-round OT protocol is secure against semi-honest adversaries (in the plain model) then so is our two-round MPC protocol. Similarly, if the assumed two-round OT protocol is secure against malicious adversaries (in the common random/reference string model) then so is our two-round MPC protocol. Previously, two-round MPC protocols were only known under relatively stronger computational assumptions. Finally, we provide several extensions.

...read moreread less

135 citations

Proceedings Article•

Distantly Supervised NER with Partial Annotation Learning and Reinforcement Learning

[...]

Yang Yaosheng¹, Wenliang Chen¹, Zhenghua Li¹, Zhengqiu He¹, Min Zhang¹ - Show less +1 more•Institutions (1)

Soochow University (Suzhou)¹

01 Aug 2018

TL;DR: This paper proposes a novel approach which can partially solve the above problems of distant supervision for NER, and applies partial annotation learning to reduce the effect of unknown labels of characters in incomplete and noisy annotations.

...read moreread less

Abstract: A bottleneck problem with Chinese named entity recognition (NER) in new domains is the lack of annotated data. One solution is to utilize the method of distant supervision, which has been widely used in relation extraction, to automatically populate annotated training data without humancost. The distant supervision assumption here is that if a string in text is included in a predefined dictionary of entities, the string might be an entity. However, this kind of auto-generated data suffers from two main problems: incomplete and noisy annotations, which affect the performance of NER models. In this paper, we propose a novel approach which can partially solve the above problems of distant supervision for NER. In our approach, to handle the incomplete problem, we apply partial annotation learning to reduce the effect of unknown labels of characters. As for noisy annotation, we design an instance selector based on reinforcement learning to distinguish positive sentences from auto-generated annotations. In experiments, we create two datasets for Chinese named entity recognition in two domains with the help of distant supervision. The experimental results show that the proposed approach obtains better performance than the comparison systems on both two datasets.

...read moreread less

130 citations

Journal Article•DOI•

An investigation of byte n-gram features for malware classification

[...]

Edward Raff¹, Richard Zak¹, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward, Anna Tracy, Mark McLean, Charles Nicholas¹ - Show less +5 more•Institutions (1)

University of Maryland, Baltimore County¹

01 Feb 2018-Journal of Computer Virology and Hacking Techniques

TL;DR: This work discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy, and discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways.

...read moreread less

Abstract: Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work using n-gram features, in this work we use orders of magnitude more data, and we perform feature selection during model building using Elastic-Net regularized Logistic Regression. We compute a regularization path and analyze novel multi-byte identifiers. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy. Second, we discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways. Finally, we demonstrate that n-gram features promote overfitting, even with linear models and extreme regularization.

...read moreread less

128 citations

Book Chapter•DOI•

A Multi-party Protocol for Constructing the Public Parameters of the Pinocchio zk-SNARK

[...]

Sean Bowe, Ariel Gabizon, Matthew Green¹•Institutions (1)

Johns Hopkins University¹

26 Feb 2018

TL;DR: Recent efficient constructions of zero-knowledge Succinct Non-interactive Arguments of Knowledge (zk-SNARKs), require a setup phase in which a common-reference string (CRS) with a certain structure is generated.

...read moreread less

Abstract: Recent efficient constructions of zero-knowledge Succinct Non-interactive Arguments of Knowledge (zk-SNARKs), require a setup phase in which a common-reference string (CRS) with a certain structure is generated. This CRS is sometimes referred to as the public parameters of the system, and is used for constructing and verifying proofs. A drawback of these constructions is that whomever runs the setup phase subsequently possesses trapdoor information enabling them to produce fraudulent pseudoproofs.

...read moreread less

116 citations

Proceedings Article•DOI•

Character Level based Detection of DGA Domain Names

[...]

Bin Yu, Jie Pan¹, Jiaming Hu¹, Anderson C. A. Nascimento¹, Martine De Cock¹ - Show less +1 more•Institutions (1)

University of Washington¹

08 Jul 2018

TL;DR: Training and evaluating on a dataset with 2M domain names shows that there is surprisingly little difference between various convolutional neural network and recurrent neural network based architectures in terms of accuracy, prompting a preference for the simpler architectures, since they are faster to train and to score, and less prone to overfitting.

...read moreread less

Abstract: Recently several different deep learning architectures have been proposed that take a string of characters as the raw input signal and automatically derive features for text classification Few studies are available that compare the effectiveness of these approaches for character based text classification with each other In this paper we perform such an empirical comparison for the important cybersecurity problem of DGA detection: classifying domain names as either benign vs produced by malware (ie, by a Domain Generation Algorithm) Training and evaluating on a dataset with 2M domain names shows that there is surprisingly little difference between various convolutional neural network (CNN) and recurrent neural network (RNN) based architectures in terms of accuracy, prompting a preference for the simpler architectures, since they are faster to train and to score, and less prone to overfitting

...read moreread less

113 citations

Journal Article•DOI•

Infrastructure assisted adaptive driving to stabilise heterogeneous vehicle strings

[...]

Meng Wang¹•Institutions (1)

Delft University of Technology¹

01 Jun 2018-Transportation Research Part C-emerging Technologies

TL;DR: This work develops a novel adaptive driving strategy for CAVs to stabilise heterogeneous vehicle strings by controlling one CAV under vehicle-to-infrastructure (V2I) communications and demonstrates the predictive power of the analytical string stability conditions.

...read moreread less

Abstract: Literature has shown potentials of Connected/Cooperative Automated Vehicles (CAVs) in improving highway operations, especially on roadway capacity and flow stability. However, benefits were also shown to be negligible at low market penetration rates. This work develops a novel adaptive driving strategy for CAVs to stabilise heterogeneous vehicle strings by controlling one CAV under vehicle-to-infrastructure (V2I) communications. Assumed is a roadside system with V2I communications, which receives control parameters of the CAV in the string and estimates parameters imperfectly of non-connected automated vehicles. It determines the adaptive control parameters (e.g. desired time gap and feedback gains) of the CAV if a downstream disturbance is identified and sends them to the CAV. The CAV changes its behaviour based on the adaptive parameters commanded by the roadside system to suppress the disturbance. The proposed adaptive driving strategy is based on string stability analysis of heterogeneous vehicle strings. To this end, linearised vehicle dynamics model and control law are used in the controller parametrisation and Laplace transform of the speed and gap error dynamics in time domain to frequency domain enables the determination of sufficient string stability criteria of heterogeneous strings. The analytical string stability conditions give new insights into automated vehicular string stability properties in relation to the system properties of time delays and controller design parameters of feedback gains and desired time gap. It further allows the quantification of a stability margin, which is subsequently used to adapt the feedback control gains and desired time gap of the CAV to suppress the amplification of gap and speed errors through the string. Analytical results are verified via systematic simulation of both homogeneous and heterogeneous strings. Simulation demonstrates the predictive power of the analytical string stability conditions. The performance of the adaptive driving strategy under V2I cooperation is tested in simulation. Results show that even the estimation of control parameters of non-connected automated vehicles are imperfect and there is mismatch between the model used in analytical derivation and that in simulation, the proposed adaptive driving strategy suppresses disturbances in a wide range of situations.

...read moreread less

111 citations

Journal Article•DOI•

Completely Decentralized Active Balancing Battery Management System

[...]

Damien F. Frost¹, David A. Howey¹•Institutions (1)

University of Oxford¹

01 Jan 2018-IEEE Transactions on Power Electronics

TL;DR: In this paper, the authors present a decentralized battery management system with no communication requirement based on a modular multilevel converter topology with a distributed inductor and distributed controller running on a local microprocessor.

...read moreread less

Abstract: The performance of a string of series-connected batteries is typically restricted by the worst cell in the string and a single failure point will render the entire string unusable. To address these issues, we present a decentralized battery management system with no communication requirement based on a modular multilevel converter topology with a distributed inductor and distributed controller running on a local microprocessor. This configuration is referred to as a “smart cell.” By sensing the voltage across the local distributed inductor, each smart cell is able to: first, determine its optimal switching pattern in order to minimize the output voltage ripple; and second, adjust its duty cycle to synchronize its state of charge (SOC) with the average SOC of the series string of cells. The decentralized controller is derived using the theory of Kuramoto oscillators, and the stability of a system of smart cells is investigated. We experimentally show that a system of three smart cells with their decentralized controllers can accurately synchronize the SOC while minimizing their output voltage ripple.

...read moreread less

100 citations

Posted Content•

Edit Probability for Scene Text Recognition

[...]

Fan Bai¹, Zhanzhan Cheng, Yi Niu, Shiliang Pu, Shuigeng Zhou¹ - Show less +1 more•Institutions (1)

Fudan University¹

09 May 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a novel method called edit probability (EP) for scene text recognition, which tries to estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing/superfluous characters.

...read moreread less

99 citations

Journal Article•DOI•

Neural network based boundary control of a vibrating string system with input deadzone

[...]

Zhijia Zhao¹, Xiaogang Wang¹, Chunliang Zhang¹, Zhijie Liu², Jingfeng Yang - Show less +1 more•Institutions (2)

Guangzhou University¹, Beihang University²

31 Jan 2018-Neurocomputing

TL;DR: Under the proposed control, the uniformly ultimately bounded stability of the closed loop system is achieved through rigorous Lyapunov analysis without any discretization or simplification of the dynamics in the time and space.

...read moreread less

Journal Article•DOI•

Edit Distance Cannot Be Computed in Strongly Subquadratic Time (Unless SETH is False)

[...]

Arturs Backurs, Piotr Indyk

26 Jun 2018-SIAM Journal on Computing

TL;DR: Evidence is provided that the near-quadratic running time bounds known for the problem of computing edit distance might be tight, and it is shown that if the edit distance can be computed in time $O(n^{2-\delta})$ for some constant $\delta>0$, then the satisfiability of conjunctive normal form formulas with $N$ variables and $M$ clauses can be solved in time.

...read moreread less

Abstract: The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions, or substitutions of symbols needed to transform one string into an...

...read moreread less

Proceedings Article•DOI•

Genax: a genome sequencing accelerator

[...]

Daichi Fujiki¹, Aran Subramaniyan¹, Tianjun Zhang¹, Yu Zeng¹, Reetuparna Das¹, David Blaauw¹, Satish Narayanasamy¹ - Show less +3 more•Institutions (1)

University of Michigan¹

02 Jun 2018

TL;DR: GenAx is presented, an accelerator for read alignment, a time-consuming step in genome sequencing which achieves 31.7× speedup over the standard BWA-MEM sequence aligner running on a 56-thread dualsocket 14-core Xeon E5 server processor, while reducing power consumption and area.

...read moreread less

Abstract: Genomics can transform health-care through precision medicine. Plummeting sequencing costs would soon make genome testing affordable to the masses. Compute efficiency, however, has to improve by orders of magnitude to sequence and analyze the raw genome data. Sequencing software used today can take several hundreds to thousands of CPU hours to align reads to a reference sequence. This paper presents GenAx, an accelerator for read alignment, a time-consuming step in genome sequencing. It consists of a seeding and seed-extension accelerator. The latter is based on an innovative automata design that was designed from the ground-up to enable hardware acceleration. Unlike conventional Levenshtein automata, it is string independent and scales quadratically with edit distance, instead of string length. It supports critical features commonly used in sequencing such as affine gap scoring and traceback. GenAx provides a throughput of 4,058K reads/s for Illumina 101 bp reads. GenAx achieves 31.7x speedup over the standard BWA-MEM sequence aligner running on a 56--thread dualsocket 14-core Xeon E5 server processor, while reducing power consumption by 12 x and area by 5.6 x.

...read moreread less

Proceedings Article•DOI•

Automated essay scoring with string kernels and word embeddings.

[...]

Madalina Cozma, Andrei M. Butnaru¹, Radu Tudor Ionescu¹•Institutions (1)

University of Bucharest¹

01 Apr 2018

TL;DR: This article used bag-of-super-word embeddings for automatic essay scoring, which achieved the best performance on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings, surpassing recent state of the art deep learning approaches.

...read moreread less

Abstract: In this work, we present an approach based on combining string kernels and word embeddings for automatic essay scoring. String kernels capture the similarity among strings based on counting common character n-grams, which are a low-level yet powerful type of feature, demonstrating state-of-the-art results in various text classification tasks such as Arabic dialect identification or native language identification. To our best knowledge, we are the first to apply string kernels to automatically score essays. We are also the first to combine them with a high-level semantic feature representation, namely the bag-of-super-word-embeddings. We report the best performance on the Automated Student Assessment Prize data set, in both in-domain and cross-domain settings, surpassing recent state-of-the-art deep learning approaches.

...read moreread less

Journal Article•DOI•

Toponym matching through deep neural networks

[...]

Rui Santos¹, Patricia Murrieta-Flores², Pável Calado¹, Bruno Martins¹•Institutions (2)

Instituto Superior Técnico¹, University of Chester²

01 Feb 2018-International Journal of Geographical Information Science

TL;DR: This article presents a novel matching approach, leveraging a deep neural network to classify pairs of toponyms as either matching or nonmatching, and shows that the proposed method can significantly outperform individual similarity metrics from previous studies, as well as previous methods based on supervised machine learning for combining multiple metrics.

...read moreread less

Abstract: Toponym matching, i.e. pairing strings that represent the same real-world location, is a fundamental problemfor several practical applications. The current state-of-the-art relies on string similar...

...read moreread less

Journal Article•DOI•

Feedforward Strategies for Cooperative Adaptive Cruise Control in Heterogeneous Vehicle Strings

[...]

Ahmed M. H. Al-Jhayyish¹, Klaus Werner Schmidt²•Institutions (2)

Çankaya University¹, Middle East Technical University²

01 Jan 2018-IEEE Transactions on Intelligent Transportation Systems

TL;DR: This paper focuses on the fulfillment of string stability in the practical case of heterogeneous vehicle strings that comprise vehicles with different dynamic properties using the idea of predecessor following, acceleration feedforward, predicted accelerationfeedforward, and input signal feedforward.

...read moreread less

Abstract: String stability is an essential property to ensure that the fluctuations are attenuated along vehicle strings. This paper focuses on the fulfillment of string stability in the practical case of heterogeneous vehicle strings that comprise vehicles with different dynamic properties. Using the idea of predecessor following, acceleration feedforward, predicted acceleration feedforward, and input signal feedforward are considered as different possible feedforward strategies. For all strategies, the parameter ranges of predecessor vehicles that ensure string stability of a given vehicle are characterized, computed, and validated by simulation.

...read moreread less

Book Chapter•DOI•

Subversion-Zero-Knowledge SNARKs

[...]

Georg Fuchsbauer¹•Institutions (1)

PSL Research University¹

25 Mar 2018

TL;DR: SNarks are proof systems with succinct proofs, which are at the core of the cryptocurrency Zcash, whose anonymity relies on ZK-SNARKs; they are also used for ZK contingent payments in Bitcoin.

...read moreread less

Abstract: Subversion zero knowledge for non-interactive proof systems demands that zero knowledge (ZK) be maintained even when the common reference string (CRS) is chosen maliciously. SNARKs are proof systems with succinct proofs, which are at the core of the cryptocurrency Zcash, whose anonymity relies on ZK-SNARKs; they are also used for ZK contingent payments in Bitcoin.

...read moreread less

Posted Content•

Fast Prefix Search in Little Space, with Applications

[...]

Djamal Belazzougui¹, Paolo Boldi², Rasmus Pagh³, Sebastiano Vigna²•Institutions (3)

Paris Diderot University¹, University of Milan², University of Copenhagen³

12 Apr 2018-arXiv: Data Structures and Algorithms

TL;DR: For very large collections stored in slow-access memory, this work proposes extremely compact data structures that solve weak prefix searches--they return the correct result only if some string in S starts with the given prefix.

...read moreread less

Abstract: It has been shown in the indexing literature that there is an essential difference between prefix/range searches on the one hand, and predecessor/rank searches on the other hand, in that the former provably allows faster query resolution. Traditionally, prefix search is solved by data structures that are also dictionaries---they actually contain the strings in $S$. For very large collections stored in slow-access memory, we propose much more compact data structures that support \emph{weak} prefix searches---they return the ranks of matching strings provided that \emph{some} string in $S$ starts with the given prefix. In fact, we show that our most space-efficient data structure is asymptotically space-optimal. Previously, data structures such as String B-trees (and more complicated cache-oblivious string data structures) have implicitly supported weak prefix queries, but they all have query time that grows logarithmically with the size of the string collection. In contrast, our data structures are simple, naturally cache-efficient, and have query time that depends only on the length of the prefix, all the way down to constant query time for strings that fit in one machine word. We give several applications of weak prefix searches, including exact prefix counting and approximate counting of tuples matching conjunctive prefix conditions.

...read moreread less

Journal Article•DOI•

Neural ParsCit: a deep learning-based reference string parser

[...]

Animesh Prasad¹, Manpreet Kaur¹, Min-Yen Kan¹•Institutions (1)

National University of Singapore¹

19 May 2018-International Journal on Digital Libraries

TL;DR: A deep learning approach for the core digital libraries task of parsing bibliographic reference strings by deploying the state-of-the-art long short-term memory (LSTM) neural network architecture, a variant of a recurrent neural network to capture long-range dependencies in reference strings.

...read moreread less

Abstract: We present a deep learning approach for the core digital libraries task of parsing bibliographic reference strings. We deploy the state-of-the-art long short-term memory (LSTM) neural network architecture, a variant of a recurrent neural network to capture long-range dependencies in reference strings. We explore word embeddings and character-based word embeddings as an alternative to handcrafted features. We incrementally experiment with features, architectural configurations, and the diversity of the dataset. Our final model is an LSTM-based architecture, which layers a linear chain conditional random field (CRF) over the LSTM output. In extensive experiments in both English in-domain (computer science) and out-of-domain (humanities) test cases, as well as multilingual data, our results show a significant gain ( $$p<0.01$$ ) over the reported state-of-the-art CRF-only-based parser.

...read moreread less

Journal Article•DOI•

Attribute CNNs for word spotting in handwritten documents

[...]

Sebastian Sudholt¹, Gernot A. Fink¹•Institutions (1)

Technical University of Dortmund¹

01 Sep 2018-International Journal on Document Analysis and Recognition

TL;DR: By taking a probabilistic perspective on training CNNs, this work derives two different loss functions for binary and real-valued word string embeddings and proposes two different CNN architectures, specifically designed for word spotting.

...read moreread less

Abstract: Word spotting has become a field of strong research interest in document image analysis over the last years. Recently, AttributeSVMs were proposed which predict a binary attribute representation (Almazan et al. in IEEE Trans Pattern Anal Mach Intell 36(12):2552---2566, 2014). At their time, this influential method defined the state of the art in segmentation-based word spotting. In this work, we present an approach for learning attribute representations with convolutional neural networks(CNNs). By taking a probabilistic perspective on training CNNs, we derive two different loss functions for binary and real-valued word string embeddings. In addition, we propose two different CNN architectures, specifically designed for word spotting. These architectures are able to be trained in an end-to-end fashion. In a number of experiments, we investigate the influence of different word string embeddings and optimization strategies. We show our attribute CNNs to achieve state-of-the-art results for segmentation-based word spotting on a large variety of data sets.

...read moreread less

Book Chapter•DOI•

Non-malleable Secret Sharing for General Access Structures

[...]

Vipul Goyal¹, Ashutosh Kumar²•Institutions (2)

Carnegie Mellon University¹, University of California, Los Angeles²

19 Aug 2018

TL;DR: Goyal and Kumar as mentioned in this paper proposed constructions of 2-out-of-2 non-malleable secret sharing (NMSS) codes in the 2 split-state model.

...read moreread less

Abstract: Goyal and Kumar (STOC’18) recently introduced the notion of non-malleable secret sharing. Very roughly, the guarantee they seek is the following: the adversary may potentially tamper with all of the shares, and still, either the reconstruction procedure outputs the original secret, or, the original secret is “destroyed” and the reconstruction outputs a string which is completely “unrelated” to the original secret. Prior works on non-malleable codes in the 2 split-state model imply constructions which can be seen as 2-out-of-2 non-malleable secret sharing (NMSS) schemes. Goyal and Kumar proposed constructions of t-out-of-n NMSS schemes. These constructions have already been shown to have a number of applications in cryptography.

...read moreread less

Proceedings Article•

Optimal dynamic strings

[...]

Paweł Gawrychowski¹, Adam Karczmarz², Tomasz Kociumaka², Jakub Łącki³, Piotr Sankowski² - Show less +1 more•Institutions (3)

University of Wrocław¹, University of Warsaw², Google³

07 Jan 2018

TL;DR: This paper presents an efficient data structure for maintaining a dynamic collection of strings under the following operations, and proves that even if the only possible query is checking equality of two strings, either updates or queries take amortized $\Omega(\log n)$ time; hence the implementation is optimal.

...read moreread less

Abstract: In this paper, we study the fundamental problem of maintaining a dynamic collection of strings under the following operations: • make_string - add a string of constant length, • concat - concatenate two strings, • split - split a string into two at a given position, • compare - find the lexicographical order (less, equal, greater) between two strings, • LCP - calculate the longest common prefix of two strings. We develop a generic framework for dynamizing the recompression method recently introduced by Jez [J. ACM, 2016]. It allows us to present an efficient data structure for the above problem, where an update requires only O(log n) worst-case time with high probability, with n being the total length of all strings in the collection, and a query takes constant worst-case time. On the lower bound side, we prove that even if the only possible query is checking equality of two strings, either updates or queries must take amortized Ω(log n) time; hence our implementation is optimal.

...read moreread less

Journal Article•DOI•

Document spanners: from expressive power to decision problems

[...]

Dominik D. Freydenberger¹, Mario Holldack²•Institutions (2)

Loughborough University¹, Goethe University Frankfurt²

01 May 2018-Theory of Computing Systems \/ Mathematical Systems Theory

TL;DR: This work examines document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren, and compares the expressive power of core spanners to three models – namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator).

...read moreread less

Abstract: We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015) A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string) We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection First, we compare the expressive power of core spanners to three models – namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator) These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators

...read moreread less

Journal Article•DOI•

Fractional-order-based ACC/CACC algorithm for improving string stability

[...]

Carlos Flores¹, Carlos Flores², Vicente Milanés², Vicente Milanés¹•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Renault²

01 Oct 2018-Transportation Research Part C-emerging Technologies

TL;DR: Fractional-order-based control algorithms to enhance the car-following and string stability performance for both ACC and CACC vehicle strings, including communication temporal delay effects are presented.

...read moreread less

Abstract: Traffic flow optimization and driver comfort enhancement are the main contributions of an Adaptive Cruise Control (ACC) system. If communication links are added, more safety and shorter gaps can be reached performing a Cooperative-ACC (CACC). Although shortening the inter-vehicular distances directly improves traffic flow, it can cause string unstable behavior. This paper presents fractional-order-based control algorithms to enhance the car-following and string stability performance for both ACC and CACC vehicle strings, including communication temporal delay effects. The proposed controller is compared with state-of-the-art implementations, exhibiting better performance. Simulation and real experiments have been conducted for validating the approach.

...read moreread less

Posted Content•

Topological Data Analysis for the String Landscape

[...]

Alex Cole¹, Gary Shiu¹•Institutions (1)

University of Wisconsin-Madison¹

17 Dec 2018-arXiv: High Energy Physics - Theory

TL;DR: In this article, the authors use persistent homology to characterize distributions of Type IIB flux vacua on moduli space for three examples: the rigid Calabi-Yau, a hypersurface in weighted projective space, and the symmetric six-torus.

...read moreread less

Abstract: Persistent homology computes the multiscale topology of a data set by using a sequence of discrete complexes. In this paper, we propose that persistent homology may be a useful tool for studying the structure of the landscape of string vacua. As a scaled-down version of the program, we use persistent homology to characterize distributions of Type IIB flux vacua on moduli space for three examples: the rigid Calabi-Yau, a hypersurface in weighted projective space, and the symmetric six-torus $T^6=(T^2)^3$. These examples suggest that persistence pairing and multiparameter persistence contain useful information for characterization of the landscape in addition to the usual information contained in standard persistent homology. We also study how restricting to special vacua with phenomenologically interesting low-energy properties affects the topology of a distribution.

...read moreread less

Journal Article•DOI•

SQL Injection Attack classification through the feature extraction of SQL query strings using a Gap-Weighted String Subsequence Kernel

[...]

Paul R. McWhirter¹, Kashif Kifayat¹, Qi Shi¹, Bob Askwith¹•Institutions (1)

Liverpool John Moores University¹

25 Apr 2018

TL;DR: This paper presents a novel solution of classifying SQL queries purely on the features of the initial query string using a Gap-Weighted String Subsequence Kernel algorithm and a Support Vector Machine trained on the similarity metrics between known query strings.

...read moreread less

Abstract: SQL Injection Attacks are one of the most common methods behind data security breaches. Previous research has attempted to produce viable detection solutions in order to filter SQL Injection Attacks from regular queries. Unfortunately it has proven to be a challenging problem with many solutions suffering from disadvantages such as being unable to process in real time as a preventative solution, a lack of adaptability to differing types of attack and the requirement for access to difficult-to-obtain information about the source application. This paper presents a novel solution of classifying SQL queries purely on the features of the initial query string. A Gap-Weighted String Subsequence Kernel algorithm is implemented to identify subsequences of shared characters between query strings for the output of a similarity metric. Finally a Support Vector Machine is trained on the similarity metrics between known query strings which are then used to classify unknown test queries. By gathering all feature data from the query strings, additional information from the source application is not required. The probabilistic nature of the learned models allows the solution to adapt to new threats whilst in operation. The proposed solution is evaluated using a number of test datasets derived from the Amnesia testbed datasets. The demonstration software achieved 97.07% accuracy for Select type queries and 92.48% accuracy for Insert type queries. This limited success rate is due to unsanitised quotation marks within legitimate inputs confusing the feature extraction. Using a test dataset that denies legitimate queries the use of unsanitised quotation marks, the Select and Insert query accuracy rose.

...read moreread less

Posted Content•DOI•

Viruses.STRING: A virus-host protein-protein interaction database

[...]

Helen Cook¹, Nadezhda Tsankova Doncheva¹, Damian Szklarczyk², Christian von Mering², Lars Juhl Jensen¹ - Show less +1 more•Institutions (2)

University of Copenhagen¹, University of Zurich²

20 Aug 2018-bioRxiv

TL;DR: This work introduces Viruses.STRING, a protein–protein interaction database specifically catering to virus-virus and virus-host interactions, which combines evidence from experimental and text-mining channels to provide combined probabilities for interactions between viral and host proteins.

...read moreread less

Abstract: As viruses continue to pose risks to global health, having a better un-derstanding of virus–host protein–protein interactions aids in the development of treatments and vaccines. Here, we introduce Viruses.STRING, a protein–protein interaction database specifically catering to virus-virus and virus-host interactions. This database combines evidence from experimental and text-mining channels to provide combined probabilities for interactions between viral and host proteins. The database contains 177,425 interactions between 239 viruses and 319 hosts. The database is publicly available at viruses.string-db.org, and the interaction data can also be accessed through the latest version of the Cytoscape STRING app.

...read moreread less

Journal Article•DOI•

EERTREE: An efficient data structure for processing palindromes in strings

[...]

Mikhail Rubinchik¹, Arseny M. Shur¹•Institutions (1)

Ural Federal University¹

01 Feb 2018-European Journal of Combinatorics

TL;DR: A new linear-size data structure is proposed which provides a fast access to all palindromic substrings of a string or a set of strings that inherits some ideas from the construction of both the suffix trie and suffix tree.

...read moreread less

Book Chapter•DOI•

The Satisfiability of Word Equations: Decidable and Undecidable Theories

[...]

Joel D. Day¹, Vijay Ganesh², Paul He², Florin Manea¹, Dirk Nowotka¹ - Show less +1 more•Institutions (2)

University of Kiel¹, University of Waterloo²

24 Sep 2018

TL;DR: It is shown that when extended with several natural predicates on words, the existential fragment becomes undecidable and deciding whether solutions exist for a restricted class of equations, augmented with many of the predicates leading to undecidability in the general case, is possible in non-deterministic polynomial time.

...read moreread less

Abstract: The study of word equations is a central topic in mathematics and theoretical computer science. Recently, the question of whether a given word equation, augmented with various constraints/extensions, has a solution has gained critical importance in the context of string SMT solvers for security analysis. We consider the decidability of this question in several natural variants and thus shed light on the boundary between decidability and undecidability for many fragments of the first order theory of word equations and their extensions. In particular, we show that when extended with several natural predicates on words, the existential fragment becomes undecidable. On the other hand, the positive $\varSigma _2$ fragment is decidable, and in the case that at most one terminal symbol appears in the equations, remains so even when length constraints are added. Moreover, if negation is allowed, it is possible to model arbitrary equations with length constraints using only equations containing a single terminal symbol and length constraints. Finally, we show that deciding whether solutions exist for a restricted class of equations, augmented with many of the predicates leading to undecidability in the general case, is possible in non-deterministic polynomial time.

...read moreread less

Journal Article•DOI•

Scalable Input-to-State Stability for Performance Analysis of Large-Scale Networks

[...]

Bart Besselink¹, Steffi Knorn²•Institutions (2)

University of Groningen¹, Uppsala University²

01 Jun 2018

TL;DR: This letter investigates networks of interconnected systems and introduces the notion of “scalable input-to-state stability” (sISS), which can be interpreted as an extension of the well-known concept of string stability from simple line graphs to general graphs.

...read moreread less

Abstract: This letter investigates networks of interconnected systems and introduces the notion of “scalable input-to-state stability” (sISS). This concept is based on input-to-state stability (ISS) and can be interpreted as an extension of the well-known concept of string stability from simple line graphs to general graphs. It guarantees that the trajectories of all states are bounded at all times independently of the network’s size and structure and can hence be regarded as an important performance notion. Further, sufficient conditions are derived to guarantee sISS of homogeneous networks with well-defined interconnection structures. In fact, the conditions depend on local ISS Lyapunov functions but guarantee the global condition of sISS. Hence, a first step is made towards developing suitable extensions of string stability to general networks. Two examples are discussed to illustrate the theoretical result.

...read moreread less

Collapse