X
Xulong Tang
Researcher at University of Pittsburgh
Publications - 49
Citations - 666
Xulong Tang is an academic researcher from University of Pittsburgh. The author has contributed to research in topics: Computer science & Compiler. The author has an hindex of 10, co-authored 34 publications receiving 451 citations. Previous affiliations of Xulong Tang include Pennsylvania State University & University of Science and Technology of China.
Papers
More filters
Proceedings ArticleDOI
Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
Ashutosh Pattnaik,Xulong Tang,Adwait Jog,Onur Kayiran,Asit K. Mishra,Mahmut Kandemir,Onur Mutlu,Chita R. Das +7 more
TL;DR: Two new runtime techniques are developed: a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and a concurrent kernel management mechanism that uses the affinity Prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU core in memory.
Posted Content
YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design
TL;DR: This work proposes YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design, and proposes a novel block-punched pruning scheme for any kernel size.
Proceedings ArticleDOI
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Xulong Tang,Ashutosh Pattnaik,Huaipan Jiang,Onur Kayiran,Adwait Jog,Sreepathi Pai,Mohamed Ibrahim,Mahmut Kandemir,Chita R. Das +8 more
TL;DR: This work proposes SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency and achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.
Proceedings ArticleDOI
Data movement aware computation partitioning
TL;DR: The potential of compiler support in exploiting NDP in the context of emerging many core systems is explored and a novel compiler algorithm that partitions the computations in a given loop nest into sub computations and schedules the resulting sub computation on different cores with the goal of reducing the distance-to-data on the on-chip network is proposed.
Proceedings ArticleDOI
Opportunistic computing in GPU architectures
Ashutosh Pattnaik,Xulong Tang,Onur Kayiran,Adwait Jog,Mishra Asit K,Mahmut Kandemir,Anand Sivasubramaniam,Chita R. Das +7 more
TL;DR: This paper develops two offloading techniques, called LLC-Compute and Omni- Compute, which employ simple bookkeeping hardware to enable GPU cores to compute instructions offloaded by other GPU cores.