scispace - formally typeset
Search or ask a question

Showing papers by "John Platt published in 2010"


Proceedings Article
09 Oct 2010
TL;DR: This work uses discriminative training to create a projection of documents from multiple languages into a single translingual vector space and evaluates these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters.
Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

126 citations


Patent
18 Jun 2010
TL;DR: In this article, a query is configured to search over a plurality of documents belonging to a particular domain, and the data is provided based at least in part upon a statistical analysis undertaken with respect to structured data pertaining to the particular domain.
Abstract: A method described herein includes an act of receiving a query from a user, wherein the query is configured to search over a plurality of documents belonging to a particular domain. The method also includes an act of providing data to the user for display on a display screen of a computing apparatus, wherein the data is provided based at least in part upon a statistical analysis undertaken with respect to structured data pertaining to the particular domain, wherein the structured data is based at least in part upon data included in the plurality of documents.

7 citations


Patent
12 May 2010
TL;DR: In this paper, a user pastes selected data into a command line of a program, including when the selected data is non-textual, and a variable name is automatically generated and inserted at the current point in the command line, where it acts as a proxy for the pasted data itself.
Abstract: Described is a technology by which a user pastes selected data into a command line of a program, including when the selected data is non-textual. Upon detecting the paste (or drop) action, a variable name is automatically generated and inserted at the current point in a command line, where it acts as a proxy for the pasted data itself. A data structure comprising the selected data or transformed data corresponding to that selected data is maintained in program storage, e.g., RAM allocated to the program. In one aspect, a handler may be used to transform the data from one format into another that may be used by a particular program. For example, text may be reformatted into an array on which the program operates. The handler may be selected from a plurality of possible handlers, including customized handlers.

3 citations