scispace - formally typeset
Search or ask a question
Institution

Amazon.com

CompanySeattle, Washington, United States
About: Amazon.com is a company organization based out in Seattle, Washington, United States. It is known for research contribution in the topics: Computer science & Service (business). The organization has 13363 authors who have published 17317 publications receiving 266589 citations.


Papers
More filters
Posted Content
TL;DR: It is proven that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct, and it is proved that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially.
Abstract: Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32\times$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

121 citations

Patent
21 Mar 2008
TL;DR: In this paper, a search query term may be received (6902, 7002) and text of a collection of electronic items stored in memory of the user device may be searched (6912, 7004) for the queried term.
Abstract: Search may be performed on a user device (104), such as a handheld electronic book reader device. A search query term may be received (6902, 7002). Text of a collection of electronic items stored in memory of the user device may be searched (6912, 7004) for the queried term. Search results may be returned (7014) identifying locations in the electronic items at which the queried term appears.

121 citations

Proceedings ArticleDOI
27 Oct 2013
TL;DR: Two novel methods are proposed: 1) a KNN-based method that considers the pairwise similarity of two reviewers based on their group-level relational information and selects k most similar reviewers for voting; 2) a more general graph-based classification method that jointly classifies a set of reviewersbased on their pairwise transaction correlations.
Abstract: As the rapid development of China's e-commerce in recent years and the underlying evolution of adversarial spamming tactics, more sophisticated spamming activities may carry out in Chinese review websites. Empirical analysis, on recently crawled product reviews from a popular Chinese e-commerce website, reveals the failure of many state-of-the-art spam indicators on detecting collusive spammers. Two novel methods are then proposed: 1) a KNN-based method that considers the pairwise similarity of two reviewers based on their group-level relational information and selects k most similar reviewers for voting; 2) a more general graph-based classification method that jointly classifies a set of reviewers based on their pairwise transaction correlations. Experimental results show that both our methods promisingly outperform the indicator-only classifiers in various settings.

121 citations

Patent
31 Mar 2005
TL;DR: In this article, techniques for enhancing the reliability of information received from users in various ways, such as by analyzing votes or other ratings supplied by a user in order to identify fraudulent and other unreliable ratings, are described.
Abstract: Techniques are described for enhancing the reliability of information received from users in various ways, such as by analyzing votes or other ratings supplied by a user in order to identify fraudulent and other unreliable ratings. Such analysis may be based at least in part on prior ratings submitted by the user, such as based on one or more patterns indicating that the user has a bias for or against one or more of various aspects. Unreliable ratings can then be excluded from use in various ways. This abstract is provided to comply with the rules requiring it, and is submitted with the intention that it not limit the scope of the claims.

121 citations

Journal ArticleDOI
03 Apr 2020
TL;DR: The authors distill the internal representations of a large model such as BERT into a simplified version of it, and formulate two ways to distill such representations and various algorithms to conduct the distillation.
Abstract: Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

120 citations


Authors

Showing all 13498 results

NameH-indexPapersCitations
Jiawei Han1681233143427
Bernhard Schölkopf1481092149492
Christos Faloutsos12778977746
Alexander J. Smola122434110222
Rama Chellappa120103162865
William F. Laurance11847056464
Andrew McCallum11347278240
Michael J. Black11242951810
David Heckerman10948362668
Larry S. Davis10769349714
Chris M. Wood10279543076
Pietro Perona10241494870
Guido W. Imbens9735264430
W. Bruce Croft9742639918
Chunhua Shen9368137468
Network Information
Related Institutions (5)
Microsoft
86.9K papers, 4.1M citations

89% related

Google
39.8K papers, 2.1M citations

88% related

Carnegie Mellon University
104.3K papers, 5.9M citations

87% related

ETH Zurich
122.4K papers, 5.1M citations

82% related

University of Maryland, College Park
155.9K papers, 7.2M citations

82% related

Performance
Metrics
No. of papers from the Institution in previous years
YearPapers
20234
2022168
20212,015
20202,596
20192,002
20181,189