scispace - formally typeset
Search or ask a question

Showing papers by "Nello Cristianini published in 2008"


01 Jan 2008

236 citations


Proceedings ArticleDOI
19 Jun 2008
TL;DR: An extensive experimental study of a Statistical Machine Translation system, Moses, from the point of view of its learning capabilities is presented, most notably the integration of linguistic rules into the model inference phase, and the development of active learning procedures.
Abstract: We present an extensive experimental study of a Statistical Machine Translation system, Moses (Koehn et al., 2007), from the point of view of its learning capabilities. Very accurate learning curves are obtained, by using high-performance computing, and extrapolations are provided of the projected performance of the system under different conditions. We provide a discussion of learning curves, and we suggest that: 1) the representation power of the system is not currently a limitation to its performance, 2) the inference of its models from finite sets of i.i.d. data is responsible for current performance limitations, 3) it is unlikely that increasing dataset sizes will result in significant improvements (at least in traditional i.i.d. setting), 4) it is unlikely that novel statistical estimation methods will result in significant improvements. The current performance wall is mostly a consequence of Zipf's law, and this should be taken into account when designing a statistical machine translation system. A few possible research directions are discussed as a result of this investigation, most notably the integration of linguistic rules into the model inference phase, and the development of active learning procedures.

39 citations


Journal Article
TL;DR: Two novel learning algorithms are presented that rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training.
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction.

12 citations



Journal Article
TL;DR: In this article, a 2-stage detector was designed to find rho-independent transcription terminators in the Escherichia coli genome, which includes a Stochastic Context Free Grammar (SCFG) component and a Support Vector Machine (SVM) component.
Abstract: Terminator Detection by Support Vector Machine Utilizing a Stochastic Context-Free Grammar Patricia Francis-Lyon, Nello Cristianini: University of California at Davis Stephen Holbrook: Lawrence Berkeley National Laboratory Abstract A 2-stage detector was designed to find rho-independent transcription terminators in the Escherichia coli genome. The detector includes a Stochastic Context Free Grammar (SCFG) component and a Support Vector Machine (SVM) component. To find terminators, the SCFG searches the intergenic regions of nucleotide sequence for local matches to a terminator grammar that was designed and trained utilizing examples of known terminators. The grammar selects sequences that are the best candidates for terminators and assigns them a prefix, stem-loop, suffix structure using the Cocke-Younger-Kasaami (CYK) algorithm, modified to incorporate energy affects of base pairing. The parameters from this inferred structure are passed to the SVM classifier, which distinguishes terminators from non-terminators that score high according to the terminator grammar. The SVM was trained with negative examples drawn from intergenic sequences that include both featureless and RNA gene regions (which were assigned prefix, stem-loop, suffix structure by the SCFG), so that it successfully distinguishes terminators from either of these. The classifier was found to be 96.4% successful during testing. Introduction Two types of transcription terminators, named for their operating mechanisms, have been found to exist in bacteria: rho-dependent and rho-independent terminators. Detection of terminators has been challenging due to the lack of clear signals in their genetic sequence, such as is provided to protein gene detection by start and stop codons. However, there are structural features present in the class of rho- independent terminators that may be exploited to aid in their detection. For a rho-independent terminator, the ability to function effectively is largely due to formation of a stem-loop. This secondary structure, rather than sequence, is the phenotype selected for in the evolutionary process. The same structures may result from different sequences of nucleotides adenine, cytosine guanine and uracil (a,c,g and u). Therefore sequences may be evolutionarily related while not conserved, as long as their structures are conserved by compensatory mutations. (For example, a stem cg pair can be replaced by gc, au, ua gu or ug pair.) Unsurprisingly, it has been found that rho- independent terminators do not share general consensus sequence [1]. Our approach to terminator detection is to infer structural information from sequence alone, then use both sequence and inferred structural parameters to classify the sequence as terminator or non-terminator. Background Terminator Detection Transcription is the process by which a copy of the coding (nontemplate) strand of a gene is produced, except that thymine (t) in DNA is replaced by uracil (u) in RNA, resulting in an RNA transcript. The final phase of transcription is termination, which can be signaled in rho-independent terminators by the formation of a stem-loop within the RNA polymerase (RNAP), inducing the pausing of the transcription elongation complex (TEC) just as the RNAP encounters weak au bonds at the terminator tail, causing the dissociation of the TEC from the RNAP and the release the protein or RNA gene. A model attributed to Carafa et al [2] describes DNA sequence for rho-independent terminators. An RNA hairpin (stem-loop) is followed by a 15 nucleotide (nt) long region rich in thymidine (the nucleoside of thymine) which may be separated by a spacer region of up to 2 nts. An adenoside-rich region was described upstream of the hairpin (but not used in their scoring system). Fig. 0 depicts the canonical terminator, based upon the Carafa model, that was used in this project. Carafa et al developed a 2-stage process to detect and classify candidate terminators which takes into account structural information such as free energy of the RNA hairpin, along with stem and loop length. Sequence information such as the number and positions of thymidine residues, and the fraction of cg pairs in the stem is also used. This algorithm

1 citations