scispace - formally typeset
Search or ask a question

Showing papers by "John Platt published in 2009"


Patent
23 Jan 2009
TL;DR: In this article, a method of identifying a malware file using multiple classifiers is disclosed, which includes receiving a file at a client computer and applying a set of metadata classifier weights are applied to the static metadata to generate a first classifier output.
Abstract: A method of identifying a malware file using multiple classifiers is disclosed. The method includes receiving a file at a client computer. The file includes static metadata. A set of metadata classifier weights are applied to the static metadata to generate a first classifier output. A dynamic classifier is initiated to evaluate the file and to generate a second classifier output. The method includes automatically identifying the file as potential malware based on at least the first classifier output and the second classifier output.

75 citations


Patent
John Platt1, Ilya Sutskever1
19 Jun 2009
TL;DR: In this article, a method of creating translingual text representations takes in documents in a first language and in a second language and creates a matrix using the words in the documents to represent which words are present in which language.
Abstract: A method of creating translingual text representations takes in documents in a first language and in a second language and creates a matrix using the words in the documents to represent which words are present in which language. An algorithm is applied to each matrix such that like documents are placed close to each other and unlike documents are moved far from each other.

13 citations


01 Jan 2009
TL;DR: This paper investigates automated traffic in the query stream of a large search engine provider, and develops many different features that distinguish between queries generated by people searching for information, and those generated by automated processes.
Abstract: As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a web site’s rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic (sometimes referred to as bot traffic) in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by automated means. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. Using the these detection features, we next classify the query stream using multiple binary classifiers. In addition, a multiclass classifier is then developed to identify subclasses of both normal and automated traffic. An active learning algorithm is used to suggest which user sessions to label to improve the accuracy of the multiclass classifier, while also seeking to discover new classes of automated traffic. Performance analysis are then provided. Finally, the multiclass classifier is used to predict the subclass distribution for the search query stream.

11 citations


Book ChapterDOI
01 Jan 2009
TL;DR: This paper investigates automated traffic in the query stream of a large search engine provider, and develops many different features that distinguish between queries generated by people searching for information, and those generated by automated processes.
Abstract: As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a web site’s rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic (sometimes referred to as bot traffic) in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by automated means. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. Using the these detection features, we next classify the query stream using multiple binary classifiers. In addition, a multiclass classifier is then developed to identify subclasses of both normal and automated traffic. An active learning algorithm is used to suggest which user sessions to label to improve the accuracy of the multiclass classifier, while also seeking to discover new classes of automated traffic. Performance analysis are then provided. Finally, the multiclass classifier is used to predict the subclass distribution for the search query stream.

6 citations


Patent
26 May 2009
TL;DR: In this paper, a document frequency process and a boosting process are used to determine indicative features for document frequency and then a second set of features may be determined using a boosting method.
Abstract: Determining indicative features may be provided First, a first set of features may be determined using a document frequency process Then a second set of features may be determined using a boosting process Using the boosting process may comprise using an approximation for a one-dimensional optimization The approximation may include an upper bound Next, the first set of features and the second set of features may be combined into a combined set of features The combined set of features may comprise a union of the first set of features and the second set of features At least one document may then be classified based on the combined set of features

2 citations