scispace - formally typeset
Search or ask a question

Showing papers by "Nello Cristianini published in 2012"


Journal ArticleDOI
TL;DR: A general methodology for inferring the occurrence and magnitude of an event or phenomenon by exploring the rich amount of unstructured textual information on the social part of the Web by investigating two case studies of geo-tagged user posts on the microblogging service of Twitter.
Abstract: We present a general methodology for inferring the occurrence and magnitude of an event or phenomenon by exploring the rich amount of unstructured textual information on the social part of the Web. Having geo-tagged user posts on the microblogging service of Twitter as our input data, we investigate two case studies. The first consists of a benchmark problem, where actual levels of rainfall in a given location and time are inferred from the content of tweets. The second one is a real-life task, where we infer regional Influenza-like Illness rates in the effort of detecting timely an emerging epidemic disease. Our analysis builds on a statistical learning framework, which performs sparse learning via the bootstrapped version of LASSO to select a consistent subset of textual features from a large amount of candidates. In both case studies, selected features indicate close semantic correlation with the target topics and inference, conducted by regression, has a significant performance, especially given the short length --approximately one year-- of Twitter’s data time series.

209 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: A collection of 484 million tweets generated by more than 9.8 million users from the United Kingdom over the past 31 months, a period marked by economic downturn and some social tensions, shows that periodic events such as Christmas and Halloween evoke similar mood patterns every year.
Abstract: Large scale analysis of social media content allows for real time discovery of macro-scale patterns in public opinion and sentiment. In this paper we analyse a collection of 484 million tweets generated by more than 9.8 million users from the United Kingdom over the past 31 months, a period marked by economic downturn and some social tensions. Our findings, besides corroborating our choice of method for the detection of public mood, also present intriguing patterns that can be explained in terms of events and social changes. On the one hand, the time series we obtain show that periodic events such as Christmas and Halloween evoke similar mood patterns every year. On the other hand, we see that a significant increase in negative mood indicators coincide with the announcement of the cuts to public spending by the government, and that this effect is still lasting. We also detect events such as the riots of summer 2011, as well as a possible calming effect coinciding with the run up to the royal wedding.

77 citations


Journal ArticleDOI
TL;DR: Vast data‐streams from social networks like Twitter and Facebook contain a people's opinions, fears and dreams and Thomas Lansdall‐Welfare, Vasileios Lampos and Nello Cristianini exploit a whole new tool for social scientists.
Abstract: Vast data-streams from social networks like Twitter and Facebook contain a people's opinions, fears and dreams. Thomas Lansdall-Welfare, Vasileios Lampos and Nello Cristianini exploit a whole new tool for social scientists.

33 citations


Journal ArticleDOI
TL;DR: An extensive experimental study of Phrase-based Statistical Machine Translation, from the point of view of its learning capabilities, which confirms existing and mostly unpublished beliefs about the learning capabilities and provides insight into the way statistical machine translation learns from data.
Abstract: We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the point of view of its learning capabilities. Very accurate Learning Curves are obtained, using high-performance computing, and extrapolations of the projected performance of the system under different conditions are provided. Our experiments confirm existing and mostly unpublished beliefs about the learning capabilities of statistical machine translation systems. We also provide insight into the way statistical machine translation learns from data, including the respective influence of translation and language models, the impact of phrase length on performance, and various unlearning and perturbation analyses. Our results support and illustrate the fact that performance improves by a constant amount for each doubling of the data, across different language pairs, and different systems. This fundamental limitation seems to be a direct consequence of Zipf law governing textual data. Although the rate of improvement may depend on both the data and the estimation method, it is unlikely that the general shape of the learning curve will change withoutmajor changes in the modeling and inference phases. Possible research directions that address this issue include the integration of linguistic rules or the development of active learning procedures.

17 citations




Journal ArticleDOI
TL;DR: The authors' results reproduce correctly all the established major language groups and subgroups present in the dataset, are compatible with the Indo-European benchmark tree and include also some of the supported higher-level structures.
Abstract: We apply to the task of linguistic phylogenetic inference a successful cognate identification learning model based on point accepted mutation (PAM)-like matrices. We train our system and we employ the learned parameters for measuring the lexical distance between languages. We estimate phylogenetic trees using distance-based methods on an Indo-European database. Our results reproduce correctly all the established major language groups and subgroups present in the dataset, are compatible with the Indo-European benchmark tree and include also some of the supported higher-level structures. We review and compare other studies reported in the literature with regard to recognized aspects of the Indo-European language family.

9 citations


Proceedings Article
23 Apr 2012
TL;DR: A web tool that allows users to explore news stories concerning the 2012 US Presidential Elections via an interactive interface based on concepts of "narrative analysis", where the key actors of a narration are identified, along with their relations, in what are sometimes called "semantic triplets".
Abstract: We present a web tool that allows users to explore news stories concerning the 2012 US Presidential Elections via an interactive interface. The tool is based on concepts of "narrative analysis", where the key actors of a narration are identified, along with their relations, in what are sometimes called "semantic triplets" (one example of a triplet of this kind is "Romney Criticised Obama"). The network of actors and their relations can be mined for insights about the structure of the narration, including the identification of the key players, of the network of political support of each of them, a representation of the similarity of their political positions, and other information concerning their role in the media narration of events. The interactive interface allows the users to retrieve news report supporting the relations of interest.

8 citations


Proceedings Article
01 Jan 2012
TL;DR: This paper compares the effectiveness of various approaches to graph construction by building graphs of 800,000 vertices based on the Reuters corpus, showing that relation-based classification is competitive with Support Vector Machines, which can be considered as state of the art.
Abstract: The efficient annotation of documents in vast corpora calls for scalable methods of text classification. Representing the documents in the form of graph vertices, rather than in the form of vectors in a bag of words space, allows for the necessary information to be pre-computed and stored. It also fundamentally changes the problem definition, from a content-based to a relation-based classification problem. Efficiently creating a graph where nearby documents are likely to have the same annotation is the central task of this paper. We compare the effectiveness of various approaches to graph construction by building graphs of 800,000 vertices based on the Reuters corpus, showing that relation-based classification is competitive with Support Vector Machines, which can be considered as state of the art. We further show that the combination of our relation-based approach and Support Vector Machines leads to an improvement over the methods individually.

7 citations



Journal ArticleDOI
TL;DR: The design of an autonomous agent that can teach itself how to translate from a foreign language, by first assembling its own training set, then using it to improve its vocabulary and language model is described.
Abstract: We describe the design of an autonomous agent that can teach itself how to translate from a foreign language, by first assembling its own training set, then using it to improve its vocabulary and language model. The key idea is that a Statistical Machine Translation package can be used for the Cross-Language Retrieval Task of assembling a training set from a vast amount of available text e.g. a large multilingual corpus, or the Web and then train on that data, repeating the process several times. The stability issues related to such a feedback loop are addressed by a mathematical model, connecting statistical and control-theoretic aspects of the system. We test it on controlled environment and real-world tasks, showing that indeed this agent can improve its translation performance autonomously and in a stable fashion, when seeded with a very small initial training set. We develop a multiprocessor version of the agent that directly accesses the Web using a Web search engine and taking advantage of the big amount of data available there. The modelling approach we develop for this agent is general, and we believe that it will be useful for an entire class of self-learning autonomous agents working on the Web.

01 Jan 2012
TL;DR: This work analyzes and characterize the way in which the in-domain and out-of-domain performance of PBSMT is impacted when the amount of training data increases and indicates the translation model contributes about 30% more to the performance gain than the language model.
Abstract: The performance of Phrase-Based Statistical Machine Translation (PBSMT) systems mostly depends on training data. Many papers have investigated how to create new resources in order to increase the size of the training corpus in an attempt to improve PBSMT performance. In this work, we analyse and characterize the way in which the in-domain and outof-domain performance of PBSMT is impacted when the amount of training data increases. Two different PBSMT systems, Moses and Portage, two of the largest parallel corpora, Giga (French-English) and UN (Chinese-English) datasets and several in- and out-of-domain test sets were used to build high quality learning curves showing consistent logarithmic growth in performance. These results are stable across language pairs, PBSMT systems and domains. We also analyse the respective impact of additional training data for estimating the language and translation models. Our proposed model approximates learning curves very well and indicates the translation model contributes about 30% more to the performance gain than the language model.

Journal ArticleDOI
TL;DR: The NetCover algorithm is presented, a method for the reconstruction of networks based on the order of nodes visited by a stochastic branching process, and it is shown that, crucially, the neighbourhood of each node may be inferred in turn, with global consistency between network and data achieved through purely local considerations.

Proceedings Article
01 Jan 2012
TL;DR: It is discovered that UK tabloids and the website of the “People” magazine contain more appealing content for all audiences than broadsheet newspapers, news aggregators and newswires, and that this measure of readers’ preferences correlates with a measure of linguistic subjectivity at the level of outlets.
Abstract: We model readers’ preferences for online news, and use these models to compare different news outlets with each other. The models are based on linear scoring functions, and are inferred by exploiting aggregate behavioural information about readers’ click choices for textual content of six given news outlets over one year of time. We generate one model per outlet, and while not extremely accurate – due to limited information – these models are shown to predict the click choices of readers, as well as to being stable over time. We use those six audience preference models in several ways: to compare how the audiences’ preferences of different outlets relate to each other; to score different news topics with respect to user appeal; to rank a large number of other news outlets with respect to their content appeal to all audiences; and to explain this measure by relating it to other metrics. We discover that UK tabloids and the website of the “People” magazine contain more appealing content for all audiences than broadsheet newspapers, news aggregators and newswires, and that this measure of readers’ preferences correlates with a measure of linguistic subjectivity at the level of outlets.