scispace - formally typeset
Search or ask a question

Showing papers by "Ming-Yang Kao published in 2006"


Proceedings ArticleDOI
Zhichun Li1, Manan Sanghi1, Yan Chen1, Ming-Yang Kao1, B. Chavez1 
21 May 2006
TL;DR: Hamsa is proposed, a network-based automated signature generation system for polymorphic worms which is fast, noise-tolerant and attack-resilient, and significantly outperforms Polygraph in terms of efficiency, accuracy, and attack resilience.
Abstract: Zero-day polymorphic worms pose a serious threat to the security of Internet infrastructures. Given their rapid propagation, it is crucial to detect them at edge networks and automatically generate signatures in the early stages of infection. Most existing approaches for automatic signature generation need host information and are thus not applicable for deployment on high-speed network links. In this paper, we propose Hamsa, a network-based automated signature generation system for polymorphic worms which is fast, noise-tolerant and attack-resilient. Essentially, we propose a realistic model to analyze the invariant content of polymorphic worms which allows us to make analytical attack-resilience guarantees for the signature generation algorithm. Evaluation based on a range of polymorphic worms and polymorphic engines demonstrates that Hamsa significantly outperforms Polygraph (J. Newsome et al., 2005) in terms of efficiency, accuracy, and attack resilience.

313 citations


Proceedings ArticleDOI
22 Jan 2006
TL;DR: This work suggests that temperature change can constitute a natural, dynamic method for providing input to self-assembly systems that is potentially superior to the current technique of designing large tile sets with specific inputs hardwired into the tileset.
Abstract: We consider the tile self-assembly model and how tile complexity can be eliminated by permitting the temperature of the self-assembly system to be adjusted throughout the assembly process. To do this, we propose novel techniques for designing tile sets that permit an arbitrary length m binary number to be encoded into a sequence of O(m) temperature changes such that the tile set uniquely assembles a supertile that precisely encodes the corresponding binary number. As an application, we show how this provides a general tile set of size O(1) that is capable of uniquely assembling essentially any n X n square, where the assembled square is determined by a temperature sequence of length O(log n) that encodes a binary description of n. This yields an important decrease in tile complexity from the required Ω(log n/log log n) for almost all n when the temperature of the system is fixed. We further show that for almost all n, no tile system can simultaneously achieve both o(log n) temperature complexity and O(log n/log log n) tile complexity, showing that both versions of an optimal square building scheme have been discovered. This work suggests that temperature change can constitute a natural, dynamic method for providing input to self-assembly systems that is potentially superior to the current technique of designing large tile sets with specific inputs hardwired into the tileset.

95 citations


Posted Content
TL;DR: In this paper, the authors consider the tile self-assembly model and show how to reduce tile complexity by allowing the temperature of the self-assembling system to be adjusted throughout the assembly process.
Abstract: We consider the tile self-assembly model and how tile complexity can be eliminated by permitting the temperature of the self-assembly system to be adjusted throughout the assembly process. To do this, we propose novel techniques for designing tile sets that permit an arbitrary length $m$ binary number to be encoded into a sequence of $O(m)$ temperature changes such that the tile set uniquely assembles a supertile that precisely encodes the corresponding binary number. As an application, we show how this provides a general tile set of size O(1) that is capable of uniquely assembling essentially any $n\times n$ square, where the assembled square is determined by a temperature sequence of length $O(\log n)$ that encodes a binary description of $n$. This yields an important decrease in tile complexity from the required $\Omega(\frac{\log n}{\log\log n})$ for almost all $n$ when the temperature of the system is fixed. We further show that for almost all $n$, no tile system can simultaneously achieve both $o(\log n)$ temperature complexity and $o(\frac{\log n}{\log\log n})$ tile complexity, showing that both versions of an optimal square building scheme have been discovered. This work suggests that temperature change can constitute a natural, dynamic method for providing input to self-assembly systems that is potentially superior to the current technique of designing large tile sets with specific inputs hardwired into the tileset.

86 citations


Journal ArticleDOI
TL;DR: A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements, and two algorithms for finding an optimal tile path composed of longer sequence tiles are developed.
Abstract: A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.

60 citations


Proceedings ArticleDOI
23 Apr 2006
TL;DR: Both the analytical and experimental results show that the proposed reverse hashing scheme is able to achieve online traffic monitoring and accurate change/intrusion detection over massive data streams on high speed links, all in a manner that scales to large key space size.
Abstract: A key function for network traffic monitoring and analysis is the ability to perform aggregate queries over multiple data streams. Change detection is an important primitive which can be extended to construct many aggregate queries. The recently proposed sketches (Krishnamurthy, 2003) are among the very few that can detect heavy changes online for high speed links, and thus support various aggregate queries in both temporal and spatial domains. However, it does not preserve the keys (e.g., source IP address) of flows, making it difficult to reconstruct the desired set of anomalous keys. In an earlier abstract we proposed a framework for a reversible sketch data structure that offers hope for efficient extraction of keys (Schweller, 2004). However, this scheme is only able to detect a single heavy change key and places restrictions on the statistical properties of the key space. To address these challenges, we propose an efficient reverse hashing scheme to infer the keys of culprit flows from reversible sketches. There are two phases. The first operates online, recording the packet stream in a compact representation with negligible extra memory and few extra memory accesses. Our prototype single FPGA board implementation can achieve a throughput of over 16 Gbps for 40-byte-packet streams (the worst case). The second phase identifies heavy changes and their keys from the representation in nearly real time. We evaluate our scheme using traces from large edge routers with OC-12 or higher links. Both the analytical and experimental results show that we are able to achieve online traffic monitoring and accurate change/intrusion detection over massive data streams on high speed links, all in a manner that scales to large key space size. To the best of our knowledge, our system is the first to achieve these properties simultaneously.

47 citations


Book ChapterDOI
11 Sep 2006
TL;DR: A linear-time algorithm, which is optimal, is presented to solve the haplotype inference problem for pedigree data when there are no recombinations and the pedigree has no mating loops.
Abstract: In this paper, a linear-time algorithm, which is optimal, is presented to solve the haplotype inference problem for pedigree data when there are no recombinations and the pedigree has no mating loops. The approach is based on the use of graphs to capture SNP, Mendelian and parity constraints of the given pedigree.

26 citations


Book ChapterDOI
29 May 2006
TL;DR: An approximation algorithm is provided for this bottleneck version of the Traveling Salesman Problem by exploiting the underlying geometry in a novel fashion and achieving an approximation ratio of (2+γ) where f(x)=g(x), the approximation ratio is 3.
Abstract: Consider a truck running along a road. It picks up a load Li at point βi and delivers it at αi, carrying at most one load at a time. The speed on the various parts of the road in one direction is given by f(x) and that in the other direction is given by g(x). Minimizing the total time spent to deliver loads L1,...,Ln is equivalent to solving the Traveling Salesman Problem (TSP) where the cities correspond to the loads Li with coordinates (αi, βi) and the distance from Li to Lj is given by $\int^{\beta_j}_{\alpha_i} f(x)dx$ if βj ≥ αi and by $\int^{\alpha_i}_{\beta_j} g(x)dx$ if βj < αi. This case of TSP is polynomially solvable with significant real-world applications. Gilmore and Gomory obtained a polynomial time solution for this TSP [6]. However, the bottleneck version of the problem (BTSP) was left open. Recently, Vairaktarakis showed that BTSP with this distance metric is NP-complete [10]. We provide an approximation algorithm for this BTSP by exploiting the underlying geometry in a novel fashion. This also allows for an alternate analysis of Gilmore and Gomory's polynomial time algorithm for the TSP. We achieve an approximation ratio of (2+γ) where $\gamma \geq \frac{f(x)}{g(x)} \geq \frac{1}{\gamma} \; \forall x$. Note that when f(x)=g(x), the approximation ratio is 3.

8 citations


Journal Article
TL;DR: This work considers a generalization of the code word design problem in which an input graph is given which must be labeled with equal length binary strings of minimal length such that the Hamming distance is small between words of adjacent nodes and large between Words of non-adjacent nodes.
Abstract: Motivated by emerging applications for DNA code word design, we consider a generalization of the code word design problem in which an input graph is given which must be labeled with equal length binary strings of minimal length such that the Hamming distance is small between words of adjacent nodes and large between words of non-adjacent nodes. For general graphs we provide algorithms that bound the word length with respect to either the maximum degree of any vertex or the number of edges in either the input graph or its complement. We further provide multiple types of recursive, deterministic algorithms for trees and forests, and provide an improvement for forests that makes use of randomization.

5 citations


Book ChapterDOI
18 Dec 2006
TL;DR: In this article, a generalization of the code word design problem is considered, in which an input graph is given which must be labeled with equal length binary strings of minimal length such that the Hamming distance is small between words of adjacent nodes and large between word of non-adjacent nodes.
Abstract: Motivated by emerging applications for DNA code word design, we consider a generalization of the code word design problem in which an input graph is given which must be labeled with equal length binary strings of minimal length such that the Hamming distance is small between words of adjacent nodes and large between words of non-adjacent nodes. For general graphs we provide algorithms that bound the word length with respect to either the maximum degree of any vertex or the number of edges in either the input graph or its complement. We further provide multiple types of recursive, deterministic algorithms for trees and forests, and provide an improvement for forests that makes use of randomization.

5 citations


Journal Article
TL;DR: In this article, a polynomial-time greedy algorithm with approximation ratio 6.5 was proposed for the smallest common AoN-supertree problem, which aims to find the smallest possible node-labeled rooted tree such that every tree T ι in T is an all-or-nothing subtree of LCST.
Abstract: A node-labeled rooted tree T (with root r) is an all-or-nothing subtree (called AoN-subtree) of a node-labeled rooted tree T' if (1) T is a subtree of the tree rooted at some node u (with the same label as r) of T', (2) for each internal node v of T, all the neighbors of v in T' are the neighbors of v in T. Tree T' is then called an AoN-supertree of T. Given a set T = {Ti, T2, ..., T n } of n node-labeled rooted trees, smallest common AoN-supertree problem seeks the smallest possible node-labeled rooted tree (denoted as LCST) such that every tree T ι in T is an AoN-subtree of LCST. It generalizes the smallest superstring problem and it has applications in glycobiology. We present a polynomial-time greedy algorithm with approximation ratio 6.

1 citations


Book ChapterDOI
18 Dec 2006
TL;DR: The smallest common AoN-supertree problem seeks the smallest possible node-labeled rooted tree (denoted as ${\textbf{LCST}}$) such that every tree Ti in ${\mathcal {T}}$ is an AoN
Abstract: A node-labeled rooted tree T (with root r) is an all-or-nothing subtree (called AoN-subtree) of a node-labeled rooted tree T′ if (1) T is a subtree of the tree rooted at some node u (with the same label as r) of T′, (2) for each internal node v of T, all the neighbors of v in T′ are the neighbors of v in T. Tree T′ is then called an AoN-supertree of T. Given a set ${\mathcal {T}}=\{{T}_1,{T}_2,\cdots, {T}_n\}$ of nnode-labeled rooted trees, smallest common AoN-supertree problem seeks the smallest possible node-labeled rooted tree (denoted as ${\textbf{LCST}}$) such that every tree Ti in ${\mathcal {T}}$ is an AoN-subtree of ${\textbf{LCST}}$. It generalizes the smallest superstring problem and it has applications in glycobiology. We present a polynomial-time greedy algorithm with approximation ratio 6.

Posted Content
TL;DR: In this article, a natural optimization formulation of the DNA code design problem is proposed, in which the goal is to design n strings that satisfy a given set of constraints while minimizing the length of the strings.
Abstract: We consider the problem of efficiently designing sets (codes) of equal-length DNA strings (words) that satisfy certain combinatorial constraints. This problem has numerous motivations including DNA computing and DNA self-assembly. Previous work has extended results from coding theory to obtain bounds on code size for new biologically motivated constraints and has applied heuristic local search and genetic algorithm techniques for code design. This paper proposes a natural optimization formulation of the DNA code design problem in which the goal is to design n strings that satisfy a given set of constraints while minimizing the length of the strings. For multiple sets of constraints, we provide high-probability algorithms that run in time polynomial in n and any given constraint parameters, and output strings of length within a constant factor of the optimal. To the best of our knowledge, this work is the first to consider this type of optimization problem in the context of DNA code design.


01 Jan 2006
TL;DR: In this paper, a polynomial-time greedy algorithm for the smallest common AoN-supertree problem with approximation ratio 6.5 is presented. But the algorithm is not applicable to the smallest superstring problem.
Abstract: A node-labeled rooted tree T (with root r) is an all-or-nothing subtree (called AoN-subtree) of a node-labeled rooted tree T′ if (1) T is a subtree of the tree rooted at some node u (with the same label as r) of T′, (2) for each internal node v of T, all the neighbors of v in T′ are the neighbors of v in T. Tree T′ is then called an AoN-supertree of T. Given a set ${\mathcal {T}}=\{{T}_1,{T}_2,\cdots, {T}_n\}$ of nnode-labeled rooted trees, smallest common AoN-supertree problem seeks the smallest possible node-labeled rooted tree (denoted as ${\textbf{LCST}}$) such that every tree Ti in ${\mathcal {T}}$ is an AoN-subtree of ${\textbf{LCST}}$. It generalizes the smallest superstring problem and it has applications in glycobiology. We present a polynomial-time greedy algorithm with approximation ratio 6.