Institution
Amazon.com
Company•Seattle, Washington, United States•
About: Amazon.com is a company organization based out in Seattle, Washington, United States. It is known for research contribution in the topics: Computer science & Service (business). The organization has 13363 authors who have published 17317 publications receiving 266589 citations.
Topics: Computer science, Service (business), Service provider, Context (language use), Virtual machine
Papers published on a yearly basis
Papers
More filters
•
TL;DR: It is proven that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct, and it is proved that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially.
Abstract: Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32\times$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.
121 citations
•
21 Mar 2008TL;DR: In this paper, a search query term may be received (6902, 7002) and text of a collection of electronic items stored in memory of the user device may be searched (6912, 7004) for the queried term.
Abstract: Search may be performed on a user device (104), such as a handheld electronic book reader device. A search query term may be received (6902, 7002). Text of a collection of electronic items stored in memory of the user device may be searched (6912, 7004) for the queried term. Search results may be returned (7014) identifying locations in the electronic items at which the queried term appears.
121 citations
••
27 Oct 2013TL;DR: Two novel methods are proposed: 1) a KNN-based method that considers the pairwise similarity of two reviewers based on their group-level relational information and selects k most similar reviewers for voting; 2) a more general graph-based classification method that jointly classifies a set of reviewersbased on their pairwise transaction correlations.
Abstract: As the rapid development of China's e-commerce in recent years and the underlying evolution of adversarial spamming tactics, more sophisticated spamming activities may carry out in Chinese review websites. Empirical analysis, on recently crawled product reviews from a popular Chinese e-commerce website, reveals the failure of many state-of-the-art spam indicators on detecting collusive spammers. Two novel methods are then proposed: 1) a KNN-based method that considers the pairwise similarity of two reviewers based on their group-level relational information and selects k most similar reviewers for voting; 2) a more general graph-based classification method that jointly classifies a set of reviewers based on their pairwise transaction correlations. Experimental results show that both our methods promisingly outperform the indicator-only classifiers in various settings.
121 citations
•
31 Mar 2005TL;DR: In this article, techniques for enhancing the reliability of information received from users in various ways, such as by analyzing votes or other ratings supplied by a user in order to identify fraudulent and other unreliable ratings, are described.
Abstract: Techniques are described for enhancing the reliability of information received from users in various ways, such as by analyzing votes or other ratings supplied by a user in order to identify fraudulent and other unreliable ratings. Such analysis may be based at least in part on prior ratings submitted by the user, such as based on one or more patterns indicating that the user has a bias for or against one or more of various aspects. Unreliable ratings can then be excluded from use in various ways. This abstract is provided to comply with the rules requiring it, and is submitted with the intention that it not limit the scope of the claims.
121 citations
••
03 Apr 2020TL;DR: The authors distill the internal representations of a large model such as BERT into a simplified version of it, and formulate two ways to distill such representations and various algorithms to conduct the distillation.
Abstract: Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.
120 citations
Authors
Showing all 13498 results
Name | H-index | Papers | Citations |
---|---|---|---|
Jiawei Han | 168 | 1233 | 143427 |
Bernhard Schölkopf | 148 | 1092 | 149492 |
Christos Faloutsos | 127 | 789 | 77746 |
Alexander J. Smola | 122 | 434 | 110222 |
Rama Chellappa | 120 | 1031 | 62865 |
William F. Laurance | 118 | 470 | 56464 |
Andrew McCallum | 113 | 472 | 78240 |
Michael J. Black | 112 | 429 | 51810 |
David Heckerman | 109 | 483 | 62668 |
Larry S. Davis | 107 | 693 | 49714 |
Chris M. Wood | 102 | 795 | 43076 |
Pietro Perona | 102 | 414 | 94870 |
Guido W. Imbens | 97 | 352 | 64430 |
W. Bruce Croft | 97 | 426 | 39918 |
Chunhua Shen | 93 | 681 | 37468 |