scispace - formally typeset
Search or ask a question

Showing papers by "Andrei Z. Broder published in 2014"


Proceedings ArticleDOI
24 Feb 2014
TL;DR: This paper shows how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques, and proposes a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
Abstract: The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality. The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.

53 citations


Proceedings Article
07 Apr 2014
TL;DR: This year WWW has the privilege of hosting WWW in Seoul, Korea for the very first time, and international delegates can experience the advanced information technologies of Korea as well as both the contemporary culture and historical traditions of Seoul.
Abstract: It is my great pleasure to welcome you to the 23rd International World Wide Web Conference, WWW 2014. Since its inception in 1994, the International World Wide Web Conference (WWW) has become the premier forum in advancing Web technologies, introducing these technologies to the industry and to users, and promoting the development of Web standards. WWW is organized by the International World Wide Web Conference Steering Committee (IW3C2) in collaboration with local conference organizers of the host country. This year we have the privilege of hosting WWW in Seoul, Korea for the very first time. International delegates can experience the advanced information technologies of Korea as well as both the contemporary culture and historical traditions of Seoul, which has been the capital city of Korea for the past six hundred years. I hope you are pleased, as I am, with the quality of this years program. There are 3 keynote speeches by world-class experts. The Research track presents 84 high quality papers; 107 posters of the Poster track report concise summaries of research; the Demo track shows 28 interesting demos; the Ph.D. Symposium track reports 11 papers of doctoral students; the Developers track presents 5 papers and 7 speeches focused on development experiences; the Industry track consists of 12 fascinating speeches from prominent industry leaders; the Web Science track presents 20 papers and 15 posters on novel interdisciplinary research; the W3C track is composed of sessions on the latest Web standards and emerging technologies; the Panel program consists of a plenary panel on the future of the Web and 3 panels on current issues. In addition to the tracks and special programs, workshops and tutorials are organized to report on-going works and to provide in-depth knowledge on important subjects, respectively; 19 workshops are organized to present 128 papers and provide 25 invited speeches; 13 tutorials are organized as lectures by experts. The content of the program is contained in the proceedings and the companion proceedings for your future reference.

20 citations


Patent
07 Mar 2014
TL;DR: In this paper, the first location of a network accessible device from a location associated with a content item provider is determined during a predetermined amount of time after the content item is displayed on the accessible device.
Abstract: System and method for determining a first location of a network accessible device from a location associated with a content item provider when a content item is displayed thereon. A second location of the network accessible device from the location associated with the content item provider is also determined during a predetermined amount of time after the content item is displayed on the network accessible device. The method and system can also operate to determine data responsive to a change in distance between the first location and the second location. The data can then be provided a status value to be used by the content item selector.