Abstract: Technological evolution successively reduces accidental constraints and leads to machine organizations dictated by fundamental physical constraints on layout and signal speed. More specifically, machines have to be realized in threedimensional space, with elementary building blocks and connecting elements that exhibit minimum feature sizes. Furthermore, there is a maximum speed at which information can travel. The conjunction of layout and signal speed constraints results in lower limits to latencies and upper limits to bandwidths, respectively, between points and across cuts of any machine. In the outlined scenario, communication becomes a dominant factor determining performance. Machine organizations become increasingly parallel and hierarchical, to provide the most efficient support possible for data processing and communication requirements of computations. At the same time, the design and the implementation of algorithms must strive for increased concurrency and locality, to reduce the communication requirements arising from their execution. While, at the state of the art, we do not have a complete and systematic theory of optimal design of machines and algorithms, under the fundamental physical constraints, we do have a number of insights that can provide valuable guidance toward a unified framework for parallel and hierarchical computation. This talk will discuss some of these insights, mostly based on work coauthored by the speaker over the years, thereby offering a somewhat personal perspective rather than attempting a balanced and complete survey of the field. For reasons reflecting historical technological tradeoffs between local communication overheads and flight time of messages, many of the models of computation studied in computer science assume instantaneous communication, irrespective of the distance between source and destination. The latter assumption is, of course, incompatible with the upper limit on the speed of signals. Once this limit is duly recognized and incorporated into the models, a number on interesting consequences emerge, as systematically investigated in [8, 9, 10]. One significant consequence is that slack parallelism can not be traded off against locality, in a scalable way. In other words, the hierarchical nature of machines Copyright is held by the author/owner(s). CF’07, May 7–9, 2007, Ischia, Italy. ACM 978-1-59593-683-7/07/0005. has to be explicitly taken into account for performance optimization. The Random Access Machine (RAM) has been a cornerstone of the history of computing, providing the basis for much design and analysis of sequential algorithms and inspiring much development in uniprocessor computer architecture. However, constant-time access to any location in memory is not achievable with bounded signal speed. This is simply established theoretically and clearly witnessed in the practice of computer design, where memory latencies have become a serious bottleneck, as reflected in an increasing gap between the maximum number of instructions executable by the processor in unit time and those actually executed for average workloads. In [1, 2, 3], a systematic study has been undertaken of how to approximate the ideal RAM under layout and speed-of-signal constraints. Novel highly pipelinable hierarchical memory structures as well as novel processor organizations have been proposed and analyzed in this context, but considerable work remains to be done in this direction, to fully exploit the combined potential of concurrency and locality in the memory hierarchy. Communication requirements arise in a computation when an interaction is required between data produced at different places or at different times. The network and the memory system typically handle the data transfers in the two cases. Network and memory data transfers become strictly interwined in systems built with Chip Multi Processors (CMPs). Indeed, these transfers have to compete for the same bandwidth across the chip boundary. A theory based on the quantitative notion of information exchange has been developed in [6] and applied to the optimization of dedicated machines. In particular, the theory determines the optimal split of the chip area between memory and functional units, given the information-exchange function of the target application. While CMP organizations further open the machine design space and pose a number of interesting performance optimization problems, it is recognized that the greatest challenge to the full exploitation of the new architectures comes from the software side. Indeed, managing parallelism and locality adds significant burdens to the process of algorithm design and implementation. Furthermore, the greater variety to be expected in the architecture and in the amount of resources of the platforms that will become available creates a non trivial obstacle to the portability of software, particularly where performance is at stake. Several models of computation have been formulated that capture the variations in the properties of both the mem-