scispace - formally typeset
Search or ask a question
Author

Matt Crawford

Bio: Matt Crawford is an academic researcher from Fermilab. The author has contributed to research in topics: Network packet & Scheduling (computing). The author has an hindex of 8, co-authored 15 publications receiving 217 citations.

Papers
More filters
Journal ArticleDOI
Wenji Wu1, Matt Crawford1, M. Bowden1
TL;DR: A mathematical model is developed to characterize the Linux packet receiving process, which is studied from NIC to application and key factors that affect Linux systems' network performance are analyzed.

71 citations

Journal ArticleDOI
Wenji Wu1, Phil DeMar1, Matt Crawford1
TL;DR: Sorting Reordered Packets with Interrupt Coalescing (SRPIC) works in the network device driver; it makes use of the interrupt coalescing mechanism to sort the reordered packets belonging to the same TCP stream in a block of packets before delivering them upward.

26 citations

Journal IssueDOI
Wenji Wu1, Matt Crawford1
TL;DR: This paper systematically describes the trip of a TCP packet from its ingress into a Linux network end system to its final delivery to the application, and proposes and test one possible solution to resolve this performance bottleneck in Linux TCP.
Abstract: Transmission control protocol (TCP) is the most widely used transport protocol on the Internet today. Over the years, especially recently, due to requirements of high bandwidth transmission, various approaches have been proposed to improve TCP performance. The Linux 2.6 kernel is now preemptible. It can be interrupted mid-task, making the system more responsive and interactive. However, we have noticed that Linux kernel preemption can interact badly with the performance of the networking subsystem. In this paper, we investigate the performance bottleneck in Linux TCP. We systematically describe the trip of a TCP packet from its ingress into a Linux network end system to its final delivery to the application; we study the performance bottleneck in Linux TCP through mathematical modelling and practical experiments; finally, we propose and test one possible solution to resolve this performance bottleneck in Linux TCP. Copyright © 2007 John Wiley & Sons, Ltd.

24 citations

Proceedings ArticleDOI
24 Sep 2007
TL;DR: Using SRM in a large international high energy physics collaboration, called WLCG, to prepare to handle the large volume of data expected when the Large Hadron Collider goes online at CERN is described.
Abstract: Storage management is one of the most important enabling technologies for large-scale scientific investigations. Having to deal with multiple heterogeneous storage and file systems is one of the major bottlenecks in managing, replicating, and accessing files in distributed environments. Storage resource managers (SRMs), named after their Web services control protocol, provide the technology needed to manage the rapidly growing distributed data volumes, as a result of faster and larger computational facilities. SRMs are grid storage services providing interfaces to storage resources, as well as advanced functionality such as dynamic space allocation and file management on shared storage systems. They call on transport services to bring files into their space transparently and provide effective sharing of files. SRMs are based on a common specification that emerged over time and evolved into an international collaboration. This approach of an open specification that can be used by various institutions to adapt to their own storage systems has proven to be a remarkable success - the challenge has been to provide a consistent homogeneous interface to the grid, while allowing sites to have diverse infrastructures. In particular, supporting optional features while preserving interoperability is one of the main challenges we describe in this paper. We also describe using SRM in a large international high energy physics collaboration, called WLCG, to prepare to handle the large volume of data expected when the Large Hadron Collider (LHC) goes online at CERN. This intense collaboration led to refinements and additional functionality in the SRM specification, and the development of multiple interoperating implementations of SRM for various complex multi- component storage systems.

22 citations

Posted Content
TL;DR: An NIC with an NIC data steering mechanism to remedy the RSS and Flow Director limitations is proposed “A Transport-Friendly NIC” (A-TFN), and experimental results have proven the effectiveness of A- TFN in accelerating TCP/IP performance.
Abstract: Receive side scaling (RSS) is a network interface card (NIC) technology. It provides the benefits of parallel receive processing in multiprocessing environments. However, existing RSS-enabled NICs lack a critical data steering mechanism that would automatically steer incoming network data to the same core on which its application process resides. This absence causes inefficient cache usage if an application is not running on the core on which RSS has scheduled the received traffic to be processed. In Linux systems, it cannot even ensure that packets in a TCP flow are processed by a single core, even if the interrupts for the flow are pinned to a specific core. This results in degraded performance. In this paper, we develop such a data steering mechanism in the NIC for multicore or multiprocessor systems. This data steering mechanism is mainly targeted at TCP, but it can be extended to other transport layer protocols. We term a NIC with such a data steering mechanism "A Transport Friendly NIC" (A-TFN). Experimental results have proven the effectiveness of A-TFN in accelerating TCP/IP performance.

21 citations


Cited by
More filters
Proceedings ArticleDOI
01 Apr 2010
TL;DR: It is shown that the implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput, and ATLAS's performance benefit increases as the number of cores increases.
Abstract: Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.

439 citations

Journal ArticleDOI
TL;DR: This evaluation shows how NetVM can compose complex network functionality from multiple pipelined VMs and still obtain throughputs up to 10 Gbps, an improvement of more than 250% compared to existing techniques that use SR-IOV for virtualized networking.
Abstract: NetVM brings virtualization to the Network by enabling high bandwidth network functions to operate at near line speed, while taking advantage of the flexibility and customization of low cost commodity servers. NetVM allows customizable data plane processing capabilities such as firewalls, proxies, and routers to be embedded within virtual machines, complementing the control plane capabilities of Software Defined Networking. NetVM makes it easy to dynamically scale, deploy, and reprogram network functions. This provides far greater flexibility than existing purpose-built, sometimes proprietary hardware, while still allowing complex policies and full packet inspection to determine subsequent processing. It does so with dramatically higher throughput than existing software router platforms. NetVM is built on top of the KVM platform and Intel DPDK library. We detail many of the challenges we have solved such as adding support for high-speed inter-VM communication through shared huge pages and enhancing the CPU scheduler to prevent overheads caused by inter-core communication and context switching. NetVM allows true zero-copy delivery of data to VMs both for packet processing and messaging among VMs within a trust boundary. Our evaluation shows how NetVM can compose complex network functionality from multiple pipelined VMs and still obtain throughputs up to 10 Gbps, an improvement of more than 250% compared to existing techniques that use SR-IOV for virtualized networking.

399 citations

Proceedings ArticleDOI
17 Aug 2015
TL;DR: A soft-edge load balancing scheme that closely tracks that of a single, non-blocking switch over many workloads and is adaptive to failures and topology asymmetry, called Presto is designed and implemented.
Abstract: Datacenter networks deal with a variety of workloads, ranging from latency-sensitive small flows to bandwidth-hungry large flows. Load balancing schemes based on flow hashing, e.g., ECMP, cause congestion when hash collisions occur and can perform poorly in asymmetric topologies. Recent proposals to load balance the network require centralized traffic engineering, multipath-aware transport, or expensive specialized hardware. We propose a mechanism that avoids these limitations by (i) pushing load-balancing functionality into the soft network edge (e.g., virtual switches) such that no changes are required in the transport layer, customer VMs, or networking hardware, and (ii) load balancing on fine-grained, near-uniform units of data (flowcells) that fit within end-host segment offload optimizations used to support fast networking speeds. We design and implement such a soft-edge load balancing scheme, called Presto, and evaluate it on a 10 Gbps physical testbed. We demonstrate the computational impact of packet reordering on receivers and propose a mechanism to handle reordering in the TCP receive offload functionality. Presto's performance closely tracks that of a single, non-blocking switch over many workloads and is adaptive to failures and topology asymmetry.

250 citations

Proceedings Article
25 Apr 2012
TL;DR: This paper presents the eXpressive Internet Architecture (XIA), an architecture with native support for multiple principals and the ability to evolve its functionality to accommodate new, as yet unforeseen, principals over time.
Abstract: Motivated by limitations in today's host-centric IP network, recent studies have proposed clean-slate network architectures centered around alternate first-class principals, such as content, services, or users. However, much like the host-centric IP design, elevating one principal type above others hinders communication between other principals and inhibits the network's capability to evolve. This paper presents the eXpressive Internet Architecture (XIA), an architecture with native support for multiple principals and the ability to evolve its functionality to accommodate new, as yet unforeseen, principals over time. We describe key design requirements, and demonstrate how XIA's rich addressing and forwarding semantics facilitate flexibility and evolvability, while keeping core network functions simple and efficient. We describe case studies that demonstrate key functionality XIA enables.

156 citations