scispace - formally typeset
Search or ask a question

Showing papers on "Scalability published in 2008"


Proceedings ArticleDOI
26 Oct 2008
TL;DR: A factor analysis approach based on probabilistic matrix factorization to solve the data sparsity and poor prediction accuracy problems by employing both users' social network information and rating records is proposed.
Abstract: Data sparsity, scalability and prediction quality have been recognized as the three most crucial challenges that every collaborative filtering algorithm or recommender system confronts. Many existing approaches to recommender systems can neither handle very large datasets nor easily deal with users who have made very few ratings or even none at all. Moreover, traditional recommender systems assume that all the users are independent and identically distributed; this assumption ignores the social interactions or connections among users. In view of the exponential growth of information generated by online social networks, social network analysis is becoming important for many Web applications. Following the intuition that a person's social network will affect personal behaviors on the Web, this paper proposes a factor analysis approach based on probabilistic matrix factorization to solve the data sparsity and poor prediction accuracy problems by employing both users' social network information and rating records. The complexity analysis indicates that our approach can be applied to very large datasets since it scales linearly with the number of observations, while the experimental results shows that our method performs much better than the state-of-the-art approaches, especially in the circumstance that users have made few or no ratings.

1,395 citations


Proceedings ArticleDOI
22 Sep 2008
TL;DR: In this article, a provably secure storage outsourced data possession (PDP) technique based on symmetric key cryptography was proposed, which allows outsourcing of dynamic data, such as block modification, deletion and append.
Abstract: Storage outsourcing is a rising trend which prompts a number of interesting security issues, many of which have been extensively investigated in the past. However, Provable Data Possession (PDP) is a topic that has only recently appeared in the research literature. The main issue is how to frequently, efficiently and securely verify that a storage server is faithfully storing its client's (potentially very large) outsourced data. The storage server is assumed to be untrusted in terms of both security and reliability. (In other words, it might maliciously or accidentally erase hosted data; it might also relegate it to slow or off-line storage.) The problem is exacerbated by the client being a small computing device with limited resources. Prior work has addressed this problem using either public key cryptography or requiring the client to outsource its data in encrypted form.In this paper, we construct a highly efficient and provably secure PDP technique based entirely on symmetric key cryptography, while not requiring any bulk encryption. Also, in contrast with its predecessors, our PDP technique allows outsourcing of dynamic data, i.e, it efficiently supports operations, such as block modification, deletion and append.

1,146 citations


Journal ArticleDOI
17 Aug 2008
TL;DR: The experiments demonstrated that P4P either improves or maintains the same level of application performance of native P2P applications, while, at the same time, it substantially reduces network provider cost compared with either native or latency-based localized P1P applications.
Abstract: As peer-to-peer (P2P) emerges as a major paradigm for scalable network application design, it also exposes significant new challenges in achieving efficient and fair utilization of Internet network resources. Being largely network-oblivious, many P2P applications may lead to inefficient network resource usage and/or low application performance. In this paper, we propose a simple architecture called P4P to allow for more effective cooperative traffic control between applications and network providers. We conducted extensive simulations and real-life experiments on the Internet to demonstrate the feasibility and effectiveness of P4P. Our experiments demonstrated that P4P either improves or maintains the same level of application performance of native P2P applications, while, at the same time, it substantially reduces network provider cost compared with either native or latency-based localized P2P applications.

769 citations


Journal ArticleDOI
01 Jun 2008
TL;DR: This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.
Abstract: Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on our memory-intensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8% performance improvement over our 3D-stacked memory architecture.

679 citations


Journal ArticleDOI
17 Aug 2008
TL;DR: The challenges and the architectural design issues of a large-scale P2P-VoD system based on the experiences of a real system deployed by PPLive are discussed and a number of results on user behavior, various system performance metrics, including user satisfaction, are presented.
Abstract: P2P file downloading and streaming have already become very popular Internet applications. These systems dramatically reduce the server loading, and provide a platform for scalable content distribution, as long as there is interest for the content. P2P-based video-on-demand (P2P-VoD) is a new challenge for the P2P technology. Unlike streaming live content, P2P-VoD has less synchrony in the users sharing video content, therefore it is much more difficult to alleviate the server loading and at the same time maintaining the streaming performance. To compensate, a small storage is contributed by every peer, and new mechanisms for coordinating content replication, content discovery, and peer scheduling are carefully designed. In this paper, we describe and discuss the challenges and the architectural design issues of a large-scale P2P-VoD system based on the experiences of a real system deployed by PPLive. The system is also designed and instrumented with monitoring capability to measure both system and component specific performance metrics (for design improvements) as well as user satisfaction. After analyzing a large amount of collected data, we present a number of results on user behavior, various system performance metrics, including user satisfaction, and discuss what we observe based on the system design. The study of a real life system provides valuable insights for the future development of P2P-VoD technology.

618 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: Spade is the System S declarative stream processing engine that allows developers to construct their applications with fine granular stream operators without worrying about the performance implications that might exist, even in a distributed system.
Abstract: In this paper, we present Spade - the System S declarative stream processing engine. System S is a large-scale, distributed data stream processing middleware under development at IBM T. J. Watson Research Center. As a front-end for rapid application development for System S, Spade provides (1) an intermediate language for flexible composition of parallel and distributed data-flow graphs, (2) a toolkit of type-generic, built-in stream processing operators, that support scalar as well as vectorized processing and can seamlessly inter-operate with user-defined operators, and (3) a rich set of stream adapters to ingest/publish data from/to outside sources. More importantly, Spade automatically brings performance optimization and scalability to System S applications. To that end, Spade employs a code generation framework to create highly-optimized applications that run natively on the Stream Processing Core (SPC), the execution and communication substrate of System S, and take full advantage of other System S services. Spade allows developers to construct their applications with fine granular stream operators without worrying about the performance implications that might exist, even in a distributed system. Spade's optimizing compiler automatically maps applications into appropriately sized execution units in order to minimize communication overhead, while at the same time exploiting available parallelism. By virtue of the scalability of the System S runtime and Spade's effective code generation and optimization, we can scale applications to a large number of nodes. Currently, we can run Spade jobs on ≈ 500 processors within more than 100 physical nodes in a tightly connected cluster environment. Spade has been in use at IBM Research to create real-world streaming applications, ranging from monitoring financial market feeds to radio telescopes to semiconductor fabrication lines.

527 citations


Proceedings ArticleDOI
Haoyuan Li1, Yi Wang1, Dong Zhang1, Ming Zhang2, Edward Y. Chang1 
23 Oct 2008
TL;DR: Through empirical study on a large dataset of 802,939 Web pages and 1,021,107 tags, it is demonstrated that PFP can achieve virtually linear speedup and to be promising for supporting query recommendation for search engines.
Abstract: Frequent itemset mining (FIM) is a useful tool for discovering frequently co-occurrent items. Since its inception, a number of significant FIM algorithms have been developed to speed up mining performance. Unfortunately, when the dataset size is huge, both the memory use and computational cost can still be prohibitively expensive. In this work, we propose to parallelize the FP-Growth algorithm (we call our parallel algorithm PFP) on distributed machines. PFP partitions computation in such a way that each machine executes an independent group of mining tasks. Such partitioning eliminates computational dependencies between machines, and thereby communication between them. Through empirical study on a large dataset of 802,939 Web pages and 1,021,107 tags, we demonstrate that PFP can achieve virtually linear speedup. Besides scalability, the empirical study demonstrates that PFP to be promising for supporting query recommendation for search engines.

472 citations


Proceedings ArticleDOI
30 Oct 2008
TL;DR: An outdoors augmented reality system for mobile phones that matches camera-phone images against a large database of location-tagged images using a robust image retrieval algorithm and shows a smart-phone implementation that achieves a high image matching rate while operating in near real-time.
Abstract: We have built an outdoors augmented reality system for mobile phones that matches camera-phone images against a large database of location-tagged images using a robust image retrieval algorithm. We avoid network latency by implementing the algorithm on the phone and deliver excellent performance by adapting a state-of-the-art image retrieval algorithm based on robust local descriptors. Matching is performed against a database of highly relevant features, which is continuously updated to reflect changes in the environment. We achieve fast updates and scalability by pruning of irrelevant features based on proximity to the user. By compressing and incrementally updating the features stored on the phone we make the system amenable to low-bandwidth wireless connections. We demonstrate system robustness on a dataset of location-tagged images and show a smart-phone implementation that achieves a high image matching rate while operating in near real-time.

406 citations


Book ChapterDOI
01 Jan 2008
TL;DR: The major features of the Vampir tool-set are described and the underlying implementation that is necessary to provide low overhead and good scalability is outlined.
Abstract: This paper presents the Vampir tool-set for performance analysis of parallel applications. It consists of the run-time measurement system VampirTrace and the visualization tools Vampir and VampirServer. It describes the major features and outlines the underlying implementation that is necessary to provide low overhead and good scalability. Furthermore, it gives a short overview about the development history and future work as well as related work.

359 citations


Journal ArticleDOI
Werner Vogels1
TL;DR: At the foundation of Amazon’s cloud computing are infrastructure services such as Amazon's S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications.
Abstract: At the foundation of Amazon’s cloud computing are infrastructure services such as Amazon’s S3 (Simple Storage Service), SimpleDB, and EC2 (Elastic Compute Cloud) that provide the resources for constructing Internet-scale computing platforms and a great variety of applications. The requirements placed on these infrastructure services are very strict; they need to score high marks in the areas of security, scalability, availability, performance, and cost effectiveness, and they need to meet these requirements while serving millions of customers around the globe, continuously.

356 citations


Journal ArticleDOI
TL;DR: Stage’s scalability is examined to suggest that it may be useful for swarm robotics researchers who would otherwise use custom simulators, with their attendant disadvantages in terms of code reuse and transparency.
Abstract: Stage is a C++ software library that simulates multiple mobile robots. Stage version 2, as the simulation backend for the Player/Stage system, may be the most commonly used robot simulator in research and university teaching today. Development of Stage version 3 has focused on improving scalability, usability, and portability. This paper examines Stage’s scalability.

Proceedings ArticleDOI
22 Aug 2008
TL;DR: Monsoon is described, a new network architecture, which scales and commoditizes data center networking monsoon realizes a simple mesh-like architecture using programmable commodity layer-2 switches and servers, which creates a huge, flexible switching domain, supporting any server/any service and unfragmented server capacity at low cost.
Abstract: Applications hosted in today's data centers suffer from internal fragmentation of resources, rigidity, and bandwidth constraints imposed by the architecture of the network connecting the data center's servers. Conventional architectures statically map web services to Ethernet VLANs, each constrained in size to a few hundred servers owing to control plane overheads. The IP routers used to span traffic across VLANs and the load balancers used to spray requests within a VLAN across servers are realized via expensive customized hardware and proprietary software. Bisection bandwidth is low, severly constraining distributed computation Further, the conventional architecture concentrates traffic in a few pieces of hardware that must be frequently upgraded and replaced to keep pace with demand - an approach that directly contradicts the prevailing philosophy in the rest of the data center, which is to scale out (adding more cheap components) rather than scale up (adding more power and complexity to a small number of expensive components).Commodity switching hardware is now becoming available with programmable control interfaces and with very high port speeds at very low port cost, making this the right time to redesign the data center networking infrastructure. In this paper, we describe monsoon, a new network architecture, which scales and commoditizes data center networking monsoon realizes a simple mesh-like architecture using programmable commodity layer-2 switches and servers. In order to scale to 100,000 servers or more,monsoon makes modifications to the control plane (e.g., source routing) and to the data plane (e.g., hot-spot free multipath routing via Valiant Load Balancing). It disaggregates the function of load balancing into a group of regular servers, with the result that load balancing server hardware can be distributed amongst racks in the data center leading to greater agility and less fragmentation. The architecture creates a huge, flexible switching domain, supporting any server/any service and unfragmented server capacity at low cost.

Proceedings ArticleDOI
09 Jun 2008
TL;DR: A new schema-mapping technique for multi-tenancy called Chunk Folding is described, where the logical tables are vertically partitioned into chunks that are folded together into different physical multi-Tenant tables and joined as needed.
Abstract: In the implementation of hosted business services, multiple tenants are often consolidated into the same database to reduce total cost of ownership. Common practice is to map multiple single-tenant logical schemas in the application to one multi-tenant physical schema in the database. Such mappings are challenging to create because enterprise applications allow tenants to extend the base schema, e.g., for vertical industries or geographic regions. Assuming the workload stays within bounds, the fundamental limitation on scalability for this approach is the number of tables the database can handle. To get good consolidation, certain tables must be shared among tenants and certain tables must be mapped into fixed generic structures such as Universal and Pivot Tables, which can degrade performance.This paper describes a new schema-mapping technique for multi-tenancy called Chunk Folding, where the logical tables are vertically partitioned into chunks that are folded together into different physical multi-tenant tables and joined as needed. The database's "meta-data budget" is divided between application-specific conventional tables and a large fixed set of generic structures called Chunk Tables. Good performance is obtained by mapping the most heavily-utilized parts of the logical schemas into the conventional tables and the remaining parts into Chunk Tables that match their structure as closely as possible. We present the re sults of several experiments designed to measure the efficacy of Chunk Folding and describe the multi-tenant database testbed in which these experiments were performed.

Proceedings ArticleDOI
09 Jun 2008
TL;DR: The purpose of this paper is to demonstrate the opportunities and limitations of using S3 as a storage system for general-purpose database applications which involve small objects and frequent updates.
Abstract: There has been a great deal of hype about Amazon's simple storage service (S3). S3 provides infinite scalability and high availability at low cost. Currently, S3 is used mostly to store multi-media documents (videos, photos, audio) which are shared by a community of people and rarely updated. The purpose of this paper is to demonstrate the opportunities and limitations of using S3 as a storage system for general-purpose database applications which involve small objects and frequent updates. Read, write, and commit protocols are presented. Furthermore, the cost ($), performance, and consistency properties of such a storage system are studied.

Book ChapterDOI
07 Jul 2008
TL;DR: A new join operation is reported on for the separation domain which aggressively abstracts information for scalability yet does not lead to false error reports.
Abstract: Pointer safety faults in device drivers are one of the leading causes of crashes in operating systems code. In principle, shape analysis tools can be used to prove the absence of this type of error. In practice, however, shape analysis is not used due to the unacceptable mixture of scalability and precision provided by existing tools. In this paper we report on a new join operation ${\sqcup\dagger}$ for the separation domain which aggressively abstracts information for scalability yet does not lead to false error reports. ${\sqcup\dagger}$ is a critical piece of a new shape analysis tool that provides an acceptable mixture of scalability and precision for industrial application. Experiments on whole Windows and Linux device drivers (firewire, pci-driver, cdrom, md, etc.) represent the first working application of shape analysis to verification of whole industrial programs.

Proceedings ArticleDOI
11 Feb 2008
TL;DR: This work presents a compression scheme for the web graph specifically designed to accommodate community queries and other random access algorithms on link servers, and uses a frequent pattern mining approach to extract meaningful connectivity formations.
Abstract: A link server is a system designed to support efficient implementations of graph computations on the web graph. In this work, we present a compression scheme for the web graph specifically designed to accommodate community queries and other random access algorithms on link servers. We use a frequent pattern mining approach to extract meaningful connectivity formations. Our Virtual Node Miner achieves graph compression without sacrificing random access by generating virtual nodes from frequent itemsets in vertex adjacency lists. The mining phase guarantees scalability by bounding the pattern mining complexity to O(E log E). We facilitate global mining, relaxing the requirement for the graph to be sorted by URL, enabling discovery for both inter-domain as well as intra-domain patterns. As a consequence, the approach allows incremental graph updates. Further, it not only facilitates but can also expedite graph computations such as PageRank and local random walks by implementing them directly on the compressed graph. We demonstrate the effectiveness of the proposed approach on several publicly available large web graph data sets. Experimental results indicate that the proposed algorithm achieves a 10- to 15-fold compression on most real word web graph data sets

Journal ArticleDOI
TL;DR: An updated take on Amdahl's analytical model uses modern design constraints to analyze many-core design alternatives, providing computer architects with a better understanding of many- core design types, enabling them to make more informed tradeoffs.
Abstract: An updated take on Amdahl's analytical model uses modern design constraints to analyze many-core design alternatives. The revised models provide computer architects with a better understanding of many-core design types, enabling them to make more informed tradeoffs.

Journal ArticleDOI
01 Aug 2008
TL;DR: This paper reports on the results of an independent evaluation of the techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data Management Using Vertical Partitioning", as well as a complementary analysis of state-of-the-art RDF storage solutions.
Abstract: This paper reports on the results of an independent evaluation of the techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data Management Using Vertical Partitioning", authored by D. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach [1]. We revisit the proposed benchmark and examine both the data and query space coverage. The benchmark is extended to cover a larger portion of the query space in a canonical way. Repeatability of the experiments is assessed using the code base obtained from the authors. Inspired by the proposed vertically-partitioned storage solution for RDF data and the performance figures using a column-store, we conduct a complementary analysis of state-of-the-art RDF storage solutions. To this end, we employ MonetDB/SQL, a fully-functional open source column-store, and a well-known -- for its performance -- commercial row-store DBMS. We implement two relational RDF storage solutions -- triple-store and vertically-partitioned -- in both systems. This allows us to expand the scope of [1] with the performance characterization along both dimensions -- triple-store vs. vertically-partitioned and row-store vs. column-store -- individually, before analyzing their combined effects. A detailed report of the experimental test-bed, as well as an in-depth analysis of the parameters involved, clarify the scope of the solution originally presented and position the results in a broader context by covering more systems.

Proceedings ArticleDOI
Spiros Papadimitriou1, Jimeng Sun1
15 Dec 2008
TL;DR: The distributed co-clustering (DisCo) framework is proposed, which introduces practical approaches for distributed data pre-processing, and co-Clustering, and it is shown that DisCo can scale well and efficiently process and analyze extremely large datasets on commodity hardware.
Abstract: Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the distributed co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.

Journal ArticleDOI
TL;DR: Experimental results show that the deployment of EASY on top of an existing SDP, namely Ariadne, enables rich semantic, context- and QoS-aware service discovery, which furthermore performs better than the classical, rigid, syntactic matching, and improves the scalability ofAriadne.

Proceedings ArticleDOI
Seungwoo Kang1, Jinwon Lee1, Hyukjae Jang1, Hyonik Lee1, Youngki Lee1, Souneil Park1, Taiwoo Park1, Junehwa Song1 
17 Jun 2008
TL;DR: This paper presents SeeMon, a scalable and energy-efficient context monitoring framework for sensor-rich, resource-limited mobile environments, and implements and test a prototype system, which achieves a high level of scalability and energy efficiency.
Abstract: Proactively providing services to mobile individuals is essential for emerging ubiquitous applications. The major challenge in providing users with proactive services lies in continuously monitoring their contexts based on numerous sensors. The context monitoring with rich sensors imposes heavy workloads on mobile devices with limited computing and battery power. We present SeeMon, a scalable and energy-efficient context monitoring framework for sensor-rich, resource-limited mobile environments. Running on a personal mobile device, SeeMon effectively performs context monitoring involving numerous sensors and applications. On top of SeeMon, multiple applications on the device can proactively understand users' contexts and react appropriately. This paper proposes a novel context monitoring approach that provides efficient processing and sensor control mechanisms. We implement and test a prototype system on two mobile devices: a UMPC and a wearable device with a diverse set of sensors. Example applications are also developed based on the implemented system. Experimental results show that SeeMon achieves a high level of scalability and energy efficiency.

Journal ArticleDOI
TL;DR: A new algorithm and easily extensible framework for computing MS complexes for large scale data of any dimension where scalar values are given at the vertices of a closure-finite and weak topology (CW) complex, therefore enabling computation on a wide variety of meshes such as regular grids, simplicial meshes, and adaptive multiresolution (AMR) meshes is described.
Abstract: The Morse-Smale (MS) complex has proven to be a useful tool in extracting and visualizing features from scalar-valued data. However, efficient computation of the MS complex for large scale data remains a challenging problem. We describe a new algorithm and easily extensible framework for computing MS complexes for large scale data of any dimension where scalar values are given at the vertices of a closure-finite and weak topology (CW) complex, therefore enabling computation on a wide variety of meshes such as regular grids, simplicial meshes, and adaptive multiresolution (AMR) meshes. A new divide-and-conquer strategy allows for memory-efficient computation of the MS complex and simplification on-the-fly to control the size of the output. In addition to being able to handle various data formats, the framework supports implementation-specific optimizations, for example, for regular data. We present the complete characterization of critical point cancellations in all dimensions. This technique enables the topology based analysis of large data on off-the-shelf computers. In particular we demonstrate the first full computation of the MS complex for a 1 billion/10243 node grid on a laptop computer with 2 Gb memory.

Proceedings ArticleDOI
01 Nov 2008
TL;DR: In this article, the authors proposed an intelligent method for scheduling usage of available energy storage capacity from plug-in hybrid electric vehicles (PHEV) and electric vehicles(EV) The batteries on these vehicles can either provide power to the grid when parked, known as V2G concept or take power from the grid to charge the batteries on the vehicles.
Abstract: This paper proposes an intelligent method for scheduling usage of available energy storage capacity from plug-in hybrid electric vehicles (PHEV) and electric vehicles (EV) The batteries on these vehicles can either provide power to the grid when parked, known as vehicle-to-grid (V2G) concept or take power from the grid to charge the batteries on the vehicles A scalable parking lot model is developed with different parameters assigned to fleets of vehicles The size of the parking lot is assumed to be large enough to accommodate the number of vehicles performing grid transactions In order to figure out the appropriate charge and discharge times throughout the day, binary particle swarm optimization is applied Price curves from the California ISO database are used in this study to have realistic price fluctuations Finding optimal solutions that maximize profits to vehicle owners while satisfying system and vehicle owners constraints is the objective of this study Different fleets of vehicles are used to approximate varying customer base and demonstrate the scalability of parking lots for V2G The results are compared for consistency and scalability Discussions on how this technique can be applied to other grid issues such as peaking power are included at the end

Proceedings ArticleDOI
25 Oct 2008
TL;DR: Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.
Abstract: Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register files often organized as a two dimensional grid. The most difficult challenge in deploying CGRAs is compiler scheduling technology that can efficiently map software implementations of compute intensive loops onto the array. Traditional schedulers focus on the placement of operations in time and space. With CGRAs, the challenge of placement is compounded by the need to explicitly route operands from producers to consumers. To systematically attack this problem, we take an edge-centric approach to modulo scheduling that focuses on the routing problem as its primary objective. With edge-centric modulo scheduling (EMS), placement is a by-product of the routing process, and the schedule is developed by routing each edge in the dataflow graph. Routing cost metrics provide the scheduler with a global perspective to guide selection. Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.

Proceedings ArticleDOI
23 Jun 2008
TL;DR: Harmony, a runtime supported programming and execution model that provides semantics for simplifying parallelism management, dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and online monitoring driven performance optimization for heterogeneous many core systems is proposed.
Abstract: The emergence of heterogeneous many core architectures presents a unique opportunity for delivering order of magnitude performance increases to high performance applications by matching certain classes of algorithms to specifically tailored architectures. Their ubiquitous adoption, however, has been limited by a lack of programming models and management frameworks designed to reduce the high degree of complexity of software development intrinsic to heterogeneous architectures. This paper proposes Harmony, a runtime supported programming and execution model that provides: (1) semantics for simplifying parallelism management, (2) dynamic scheduling of compute intensive kernels to heterogeneous processor resources, and (3) online monitoring driven performance optimization for heterogeneous many core systems. We are particulably concerned with simplifying development and ensuring binary portability and scalability across system configurations and sizes. Initial results from ongoing development demonstrate the binary compatibility with variable number of cores, as well as dynamic adaptation of schedules to data sets. We present preliminary results of key features for some benchmark applications.

Journal ArticleDOI
TL;DR: This paper presents an algorithm for drawing a sequence of graphs online that strives to maintain the global structure of the graph and, thus, the user's mental map while allowing arbitrary modifications between consecutive layouts.
Abstract: This paper presents an algorithm for drawing a sequence of graphs online. The algorithm strives to maintain the global structure of the graph and, thus, the user's mental map while allowing arbitrary modifications between consecutive layouts. The algorithm works online and uses various execution culling methods in order to reduce the layout time and handle large dynamic graphs. Techniques for representing graphs on the GPU allow a speedup by a factor of up to 17 compared to the CPU implementation. The scalability of the algorithm across GPU generations is demonstrated. Applications of the algorithm to the visualization of discussion threads in Internet sites and to the visualization of social networks are provided.

Proceedings ArticleDOI
07 Jun 2008
TL;DR: A novel regression-based approaches to predict parallel program scalability is explored, which uses several program executions on a small subset of the processors to predict execution time on larger numbers of processors and provides accurate scaling predictions.
Abstract: Many applied scientific domains are increasingly relying on large-scale parallel computation. Consequently, many large clusters now have thousands of processors. However, the ideal number of processors to use for these scientific applications varies with both the input variables and the machine under consideration, and predicting this processor count is rarely straightforward. Accurate prediction mechanisms would provide many benefits, including improving cluster efficiency and identifying system configuration or hardware issues that impede performance.We explore novel regression-based approaches to predict parallel program scalability. We use several program executions on a small subset of the processors to predict execution time on larger numbers of processors. We compare three different regression-based techniques: one based on execution time only; another that uses per-processor information only; and a third one based on the global critical path. These techniques provide accurate scaling predictions, with median prediction errors between 6.2% and 17.3% for seven applications.

Journal ArticleDOI
TL;DR: A novel user model is built that helped in achieving significant reduction in system complexity, sparsity, and made the neighbor transitivity relationship hold, and computational results reveal that they outperform the classical approach.
Abstract: The main strengths of collaborative filtering (CF), the most successful and widely used filtering technique for recommender systems, are its cross-genre or 'outside the box' recommendation ability and that it is completely independent of any machine-readable representation of the items being recommended However, CF suffers from sparsity, scalability, and loss of neighbor transitivity CF techniques are either memory-based or model-based While the former is more accurate, its scalability compared to model-based is poor An important contribution of this paper is a hybrid fuzzy-genetic approach to recommender systems that retains the accuracy of memory-based CF and the scalability of model-based CF Using hybrid features, a novel user model is built that helped in achieving significant reduction in system complexity, sparsity, and made the neighbor transitivity relationship hold The user model is employed to find a set of like-minded users within which a memory-based search is carried out This set is much smaller than the entire set, thus improving system's scalability Besides our proposed approaches are scalable and compact in size, computational results reveal that they outperform the classical approach

Journal ArticleDOI
TL;DR: It is observed that the overall performance of TM is significantly worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm.
Abstract: TM (transactional memory) is a concurrency control paradigm that provides atomic and isolated execution for regions of code. TM is considered by many researchers to be one of the most promising solutions to address the problem of programming multicore processors. Its most appealing feature is that most programmers only need to reason locally about shared data accesses, mark the code region to be executed transactionally, and let the underlying system ensure the correct concurrent execution. This model promises to provide the scalability of fine-grain locking, while avoiding common pitfalls of lock composition such as deadlock. In this article we explore the performance of a highly optimized STM and observe that the overall performance of TM is significantly worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm.

Proceedings Article
01 May 2008
TL;DR: The SemanticVectors package that efficiently creates semantic vectors for words and documents from a corpus of free text articles is described, which can play an important role in furthering research in distributional semantics, and can help to significantly reduce the current gap between good research results and valuable applications in production software.
Abstract: This paper describes the open source SemanticVectors package that efficiently creates semantic vectors for words and documents from a corpus of free text articles. We believe that this package can play an important role in furthering research in distributional semantics, and (perhaps more importantly) can help to significantly reduce the current gap that exists between good research results and valuable applications in production software. Two clear principles that have guided the creation of the package so far include ease-of-use and scalability. The basic package installs and runs easily on any Java-enabled platform, and depends only on Apache Lucene. Dimension reduction is performed using Random Projection, which enables the system to scale much more effectively than other algorithms used for the same purpose. This paper also describes a trial application in the Technology Management domain, which highlights some user-centred design challenges which we believe are also key to successful deployment of this technology.