TL;DR: This work presents NP-hardness as well as fixed-parameter tractability results for different variants of Colorful Components and develops an efficient and very accurate heuristic algorithm clearly outperforming a previous min-cut-based heuristic on multiple sequence alignment data.
Abstract: The NP-hard Colorful Components problem is, given a vertex-colored graph, to delete a minimum number of edges such that no connected component contains two vertices of the same color. It has applications in multiple sequence alignment and in multiple network alignment where the colors correspond to species. We initiate a systematic complexity-theoretic study of Colorful Components by presenting NP-hardness as well as fixed-parameter tractability results for different variants of Colorful Components. We also perform experiments with our algorithms and additionally develop an efficient and very accurate heuristic algorithm clearly outperforming a previous min-cut-based heuristic on multiple sequence alignment data.
The authors study a maximum parsimony approach to the discovery of heterogeneous components in vertex-colored graphs: Colorful Components Instance: Colorful Components is an edge modification problem originating from biological applications in sequence and network alignment as described next.
The first application of Colorful Components stems from Multiple Sequence Alignment.
Thus, it is a special case of the well-known NP-hard Multicut problem, which has as input an undirected graph and a set of vertex pairs and asks for a minimum number of edges to delete to disconnect each given vertex pair.
First, the authors observe that Colorful Components is NP-hard even in trees.
2 Computational Hardness
The authors present hardness results for two restricted variants of Colorful Components.
Proposition 1. Colorful Components is NP-hard even in trees with diameter four.
Cj of φ containing the variables xp, xq, and xr, the authors connect the three corresponding variable cycles by a clause gadget.
Now, since at least 6m edges are deleted in the variable cycles, this means that for each clause Cj exactly four edges incident with aj are deleted by S. Consequently, for each variable cycle either all even or all odd edges are deleted.
Altogether, this shows the correctness of the reduction.
3 Algorithms
While Theorem 1 shows that Colorful Components is NP-hard for three colors, for two colors it can be solved in polynomial time via computing a maximum matching in bipartite graphs.
First, the authors describe a simple O(ck ·m)-time search tree algorithm.
Now, branch into the c cases to destroy this bad path by edge deletion, and for each case recursively solve the resulting instance.
In the first case, the authors have visited at most c+1 vertices until a vertex pair with the same color has been found.
The authors note that Rule 1 provides a trivial kernelization [9]5 for Colorful Components with respect to the combined parameter (k, c): obviously, after exhaustive data reduction, the instance has at most 2kc vertices, since an edge deletion can produce at most two colorful components, each of size at most c.
4 Formulation as Weighted Multi-Multiway Cut
In the Colorful Components formulation, it is not possible to simplify a graph based on the knowledge that two vertices belong to the same connected component; the authors would like to be able to merge two such vertices.
For this, the authors first need to allow not just a single color per vertex, but a set; moreover, they need to allow edge weights.
Using the merge operation, the authors can do a simple branching on an edge [3]: either delete the edge, or merge its endpoints; in the experimental part this will be referred to as edge branching.
Note that merging does not necessarily decrease the parameter; but it is easy to see that if the authors branch on each edge of a forbidden path successively, then the last edge of the path cannot be merged since it connects vertices with an intersecting color set.
The factor 3 has been tuned heuristically.
5 Experiments
The authors performed experiments with instances from the multiple sequence alignment application.
The source code and the test instances are available under the GNU GPL license at http://fpt.akt.tu-berlin.de/colcom/.
To efficiently find data reduction opportunities with Rule 2 and Rule 3, the authors try starting with each vertex and successively add more vertices with disjoint colors that minimize the cut to other edges, until they have either found a reduction opportunity or no more vertices can be added.
For the heuristics, the authors compare the solution quality for the 112 instances for which they know the optimal solution.
Finally, for the instances for which an exact solution was found, the authors compared the solution quality of the alignments obtained by using DIALIGN with and without the partial alignment columns indicated by an exact solution for Colorful Components, by the merge heuristic, and by the min-cut heuristic.
6 Outlook
In preliminary experiments with network alignment data, the authors found that allowing only one protein of each species to be matched was, while a natural model, too strict.
Generalizing Color Components to allow a constant number of occurrences of each color for the connected components could result in improved network alignments.
TL;DR: Koivisto et al. as discussed by the authors presented an O(2k n2 + n m) algorithm for the Steiner tree problem in graphs with n vertices, k terminals, and m edges with bounded integer weights.
Abstract: We present a fast algorithm for the subset convolution problem:given functions f and g defined on the lattice of subsets of ann-element set n, compute their subset convolution f*g, defined for S⊆ N by [ (f * g)(S) = [T ⊆ S] f(T) g(S/T),,]where addition and multiplication is carried out in an arbitrary ring. Via Mobius transform and inversion, our algorithm evaluates the subset convolution in O(n2 2n) additions and multiplications, substanti y improving upon the straightforward O(3n) algorithm. Specifically, if the input functions have aninteger range [-M,-M+1,...,M], their subset convolution over the ordinary sum--product ring can be computed in O(2n log M) time; the notation O suppresses polylogarithmic factors.Furthermore, using a standard embedding technique we can compute the subset convolution over the max--sum or min--sum semiring in O(2n M) time.To demonstrate the applicability of fast subset convolution, wepresent the first O(2k n2 + n m) algorithm for the Steiner tree problem in graphs with n vertices, k terminals, and m edges with bounded integer weights, improving upon the O(3kn + 2k n2 + n m) time bound of the classical Dreyfus-Wagner algorithm. We also discuss extensions to recent O(2n)-time algorithms for covering and partitioning problems (Bjorklund and Husfeldt, FOCS 2006; Koivisto, FOCS 2006).
TL;DR: This paper presents a new algorithm, C3Part-M, based on the work by Boyer et al.
Abstract: Recent experimental progress is once again producing a huge quantity of data in various areas of biology, in particular on protein interactions. In order to extract meaningful information from this data, researchers typically use a graph representation to which they apply network alignment tools. Because of the combinatorial difficulty of the network alignment problem, most of the algorithms developed so far are heuristics, and the exact ones are of no use in practice on large numbers of networks. In this paper, we propose a unified scheme on the question of network alignment and we present a new algorithm, C3Part-M , based on the work by Boyer et al. [2], that is much more efficient than the original one in the case of multiple networks. We compare it as concerns protein-protein interaction networks to a recently proposed alignment tool, NetworkBLAST-M [10], and show that we recover similar results, while using a different but exact approach.
TL;DR: This work identifies a new application of Colorful Components in the correction of Wikipedia interlanguage links, and describes and compares three exact and two heuristic approaches to solve this NP-hard graph partitioning problem.
Abstract: The NP-hard Colorful Components problem is a graph partitioning problem on vertex-colored graphs. We identify a new application of Colorful Components in the correction of Wikipedia interlanguage links, and describe and compare three exact and two heuristic approaches. In particular, we devise two ILP formulations, one based on Hitting Set and one based on Clique Partition. Furthermore, we use the recently proposed implicit hitting set framework [Karp, JCSS 2011; Chandrasekaran et al., SODA 2011] to solve Colorful Components. Finally, we study a move-based and a merge-based heuristic for Colorful Components. We can optimally solve Colorful Components for Wikipedia link correction data; while the Clique Partition-based ILP outperforms the other two exact approaches, the implicit hitting set is a simple and competitive alternative. The merge-based heuristic is very accurate and outperforms the move-based one. The above results for Wikipedia data are confirmed by experiments with synthetic instances.
15 citations
Cites background or methods or result from "Partitioning into colorful componen..."
...Previously, we showed that it is NP-hard even in three-colored graphs with maximum degree six [4], and proposed an exact branching algorithm with running time O((c− 1) · |E|) where k is the number of deleted edges....
[...]
...2(a), we compare the running times for the three approaches and additionally the branching algorithm from [4], with a time limit of 15 minutes....
[...]
...Similar to our previous results for multiple sequence alignment [4], the mergebased heuristic gives an excellent approximation here....
[...]
...Before starting the solver, we use data reduction as described before [4]....
[...]
...For completeness, we briefly recall this greedy heuristic [4]....
TL;DR: In this article, the problem of supporting queries on a string $S$ of length $n$ within a space bounded by the size of a string attractor for the query was studied.
Abstract: We study the problem of supporting queries on a string $S$ of length $n$ within a space bounded by the size $\gamma$ of a string attractor for $S$. Recent works showed that random access on $S$ can be supported in optimal $O(\log(n/\gamma)/\log\log n)$ time within $O\left (\gamma\ \rm{polylog}\ n \right)$ space. In this paper, we extend this result to \emph{rank} and \emph{select} queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a $\log\log n$ time-factor in \emph{select} queries. We also provide matching lower and upper bounds for \emph{partial sum} and \emph{predecessor} queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations.
TL;DR: In this article, a polynomial-time algorithm was proposed for the problem of removing a collection of edges from an undirected vertex-colored graph such that in the resulting graph all the connected components are colorful.
Abstract: In this paper we investigate the colorful components framework, motivated by applications emerging from comparative genomics The general goal is to remove a collection of edges from an undirected vertex-colored graph $$G$$G such that in the resulting graph $$G'$$G? all the connected components are colorful (ie, any two vertices of the same color belong to different connected components) We want $$G'$$G? to optimize an objective function, the selection of this function being specific to each problem in the framework We analyze three objective functions, and thus, three different problems, which are believed to be relevant for the biological applications: minimizing the number of singleton vertices, maximizing the number of edges in the transitive closure, and minimizing the number of connected components Our main result is a polynomial-time algorithm for the first problem This result disproves the conjecture of Zheng et al that the problem is $$ NP$$NP-hard (assuming $$P
e NP$$P?NP) Then, we show that the second problem is $$ APX$$APX-hard, thus proving and strengthening the conjecture of Zheng et al that the problem is $$ NP$$NP-hard Finally, we show that the third problem does not admit polynomial-time approximation within a factor of $$|V|^{1/14 - \epsilon }$$|V|1/14-∈ for any $$\epsilon > 0$$∈>0, assuming $$P
e NP$$P?NP (or within a factor of $$|V|^{1/2 - \epsilon }$$|V|1/2-∈, assuming $$ZPP
e NP$$ZPP?NP)
TL;DR: The latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions is presented, including new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences.
Abstract: Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, high-throughput technologies such as genome sequencing and structural proteomics have lead to an explosion in the amount of sequence and structure information available. In response, several new multiple alignment methods have been developed that improve both the efficiency and the quality of protein alignments. Consequently, the benchmarks used to evaluate and compare these methods must also evolve. We present here the latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences. Using a novel, semiautomatic update protocol, the number of protein families in the benchmark has been increased and representative test cases are now available that cover most of the protein fold space. The total number of proteins in BAliBASE has also been significantly increased from 1444 to 6255 sequences. In addition, full-length sequences are now provided for all test cases, which represent difficult cases for both global and local alignment programs. Finally, the BAliBASE Web site (http://www-bio3d-igbmc.u-strasbg.fr/balibase) has been completely redesigned to provide a more user-friendly, interactive interface for the visualization of the BAliBASE reference alignments and the associated annotations.
424 citations
"Partitioning into colorful componen..." refers background in this paper
...0 benchmark [14], using the diafragm 1....
[...]
...0 benchmark [14] each time within five minutes on a standard PC, with up to 5 000 vertices and 13 000 edges....
TL;DR: A brief survey is presented that presents data reduction and problem kernelization as a promising research field for algorithm and complexity theory.
Abstract: To solve NP-hard problems, polynomial-time preprocessing is a natural and promising approach. Preprocessing is based on data reduction techniques that take a problem's input instance and try to perform a reduction to a smaller, equivalent problem kernel. Problem kernelization is a methodology that is rooted in parameterized computational complexity. In this brief survey, we present data reduction and problem kernelization as a promising research field for algorithm and complexity theory.
406 citations
"Partitioning into colorful componen..." refers background in this paper
...We note that Rule 1 provides a trivial kernelization [8](5) for Colorful Components with respect to the combined parameter (k, c): obviously, after exhaustive data reduction, the instance has at most 2kc vertices, since an edge deletion can produce at most two colorful components, each of size at most c....
TL;DR: This chapter proves lower bounds based on ETH for the time needed to solve various problems, and in many cases these lower bounds match the running time of the best known algorithms for the problem.
Abstract: The Exponential Time Hypothesis (ETH) is a conjecture stating that, roughly speaking, n-variable 3-SAT cannot be solved in time 2o(n). In this chapter, we prove lower bounds based on ETH for the time needed to solve various problems. In many cases, these lower bounds match (up to small factors) the running time of the best known algorithms for the problem.
TL;DR: It is shown that both the maximum integral multicommodity flow and the minimum multicut problem are NP-hard and MAX SNP-hard on trees, although themaximum integral flow can be computed in polynomial time if the edges have unit capacity.
Abstract: We study the maximum integral multicommodity flow problem and the minimum multicut problem restricted to trees. This restriction is quite rich and contains as special cases classical optimization problems such as matching and vertex cover for general graphs. It is shown that both the maximum integral multicommodity flow and the minimum multicut problem are NP-hard and MAX SNP-hard on trees, although the maximum integral flow can be computed in polynomial time if the edges have unit capacity. We present an efficient algorithm that computes a multicut and integral flow such that the weight of the multicut is at most twice the value of the flow. This gives a 2-approximation algorithm for minimum multicut and a 1/2-approximation algorithm for maximum integral multicommodity flow in trees.
391 citations
"Partitioning into colorful componen..." refers background in this paper
...Note that Multicut is NP-hard and MaxSNPhard even if the input is a star, that is, a tree consisting of a central vertex with attached degree-1 vertices [7]....
Q1. What contributions have the authors mentioned in the paper "Partitioning into colorful components by minimum edge deletions" ?
The authors initiate a systematic complexity-theoretic study of Colorful Components by presenting NP-hardness as well as fixed-parameter tractability results for different variants of Colorful Components. The authors also perform experiments with their algorithms and additionally develop an efficient and very accurate heuristic algorithm clearly outperforming a previous min-cut-based heuristic on multiple sequence alignment data.