CSE researchers present four papers at HPCA 2025

March 13, 2025

CSE authors presented new research on topics related to computer architecture, from in-cache computing optimization to oblivious memory performance improvement.

Four papers authored by researchers affiliated with CSE were among six papers from Michigan presented at the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA), the premier venue for new research in the area of computer architecture. This year’s conference took place March 1-5 in Las Vegas, NV.

HPCA accepted just over 100 papers for presentation this year, making CSE’s four total papers a notable achievement. New research presented by CSE authors spans a range of topics related to computer architecture, including mobile vector processing, protocol-hardware co-design for memory security, and space microdatacenters system architecture.

The papers presented are as follows, with the names of authors affiliated with CSE in bold:

Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing *Distinguished Artifact Award Honorable Mention*
Alireza Khadem, Daichi Fujiki, Hilbert Chen, Yufeng Gu, Nishil Talati, Scott Mahlke, Reetu Das

PhD student Alireza Khadem and Prof. Reetuparna Das stand side-by-side smiling. Alireza holds a paper certificate noting their receipt of Distinguished Artifact Award Honorable Mention. — PhD student Alireza Khadem (left), Prof. Reetuparna Das (right), and their coauthors received a Distinguished Artifact Award Honorable Mention for their paper titled “Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing” at HPCA.

Abstract: In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses. While this is sufficient for typical vector engines, it fails to effectively utilize the large Single Instruction, Multiple Data (SIMD) widths of in-cache vector engines. This is because mobile data-parallel kernels expose limited parallelism across a single dimension.

Based on our analysis of mobile vector kernels, we introduce a long-vector Multi-dimensional Vector ISA Extension (MVE) for mobile in-cache computing. MVE achieves high SIMD resource utilization and enables flexible programming by abstracting cache geometry and data layout. The proposed ISA features multi-dimensional strided and random memory accesses and efficient dimension-level masked execution to encode parallelism across multiple dimensions. Using a wide range of data-parallel mobile workloads, we demonstrate that MVE offers significant performance and energy reduction benefits of 2.9x and 8.8x, on average, compared to the SIMD units of a commercial mobile processor, at an area overhead of 3.6%.

Palermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design
Haojie Ye, Yuchen Xia, Yuhan Chen, Kuan-Yu Chen, Yichao Yuan, Shuwen Deng, Baris Kasikci, Trevor Mudge, Nishil Talati

Abstract: Oblivious RAM (ORAM) hides the memory access patterns, enhancing data privacy by preventing attackers from discovering sensitive information based on the sequence of memory accesses. The performance of ORAM is often limited by its inherent trade-off between security and efficiency, as concealing memory access patterns imposes significant computational and memory overhead. While prior works focus on improving the ORAM performance by prefetching and eliminating ORAM requests, we find that their performance is very sensitive to workload locality behavior and incurs additional management overhead caused by the ORAM stash pressure.

This paper presents Palermo: a protocol-hardware co-design to improve ORAM performance. The key observation in Palermo is that classical ORAM protocols enforce restrictive dependencies between memory operations that result in low memory bandwidth utilization. Palermo introduces a new protocol that overlaps large portions of memory operations, within a single and between multiple ORAM requests, without breaking correctness and security guarantees. Subsequently, we propose an ORAM controller architecture that executes the proposed protocol to service ORAM requests. The hardware is responsible for concurrently issuing memory requests as well as imposing the necessary dependencies to ensure a consistent view of the ORAM tree across requests. Using a rich workload mix, we demonstrate that Palermo outperforms the RingORAM baseline by 2.8x, on average, incurring a negligible area overhead of 5.78mm² (less than 2% in 12th generation Intel CPU after technology scaling) and 2.14W without sacrificing security. We further show that Palermo also outperforms the state-of-the-art works PageORAM, PrORAM, and IR-ORAM.

Architecting Space Microdatacenters: A System-level Approach
Nathaniel Bleier, Rick Eason, Michael Lembeck, Rakesh Kumar

Abstract: Server-based computing in space has been recently proposed due to potential benefits in terms of capability, latency, security, sustainability, and cost. Despite this, there has been no work asking the question: how should we architect systems for server-based computing in space when considering overall cost. This paper presents a Total Cost of Ownership (TCO)-based approach to architecture of server-based computing systems for space (Space Microdatacenters- SµDC) for processing data produced by low Earth orbit (LEO)based Earth observation (EO) satellites. We show that power of compute is the primary factor in determining SµDC TCO, though the dependence is sublinear. Second, the impact of compute mass, monetary cost, and communication on TCO is relatively insignificant. Third, architectures with the highest FLOPs/W provide much higher performance per TCO $ even if they have poor FLOPs/$ . We leverage these insights to advocate extreme heterogeneity designs for SµDCs. These designs reduce SµDC TCO by 116× in spite of poor FLOPs/$ characteristics. We also show that (a) collaborative compute constellations—constellations in which EO satellites are also equipped with compute hardware—further improve SµDC TCO by 1.31 to 1.74×, (b) a distributed architecture reduces TCO by 10% over a monolithic architecture, and (c) low monetary cost of compute can be leveraged to provide near zero cost compute overprovisioning which improves an SµDC’s availability significantly and supports graceful degradation. Overall, this is the first paper on cost-aware architecture and optimization of a SµDC.

Nathaniel Bleier stands at a podium next to a large screen with a presentation on it. People sit at long tables listening to his presentation. — Prof. Nathaniel Bleier presents his paper titled “Architecting Space Microdatacenters: A System-level Approach” at HPCA.

NVMePass: A Lightweight, High-performance and Scalable NVMe Virtualization Architecture with I/O Queues Passthrough
Yiquan Chen, Zhen Jin, Yijing Wang, Yi Chen, Jiexiong Xu, Hao Yu, Jinlong Chen, Wenhai Lin, Kanghua Fang, Keyao Zhang, Chengkun Wei, Qiang Liu, Yuan Xie, Wenzhi Chen

Abstract: Most data-intensive applications currently run on NVMe storage, and virtualization is essential in cloud computing. Existing NVMe virtualization technologies include software-based and hardware-assisted. Virtio suffers from severe performance degradation, and polling-based solutions consume too many valuable CPU resources. Hardware-assisted solutions provide high performance and no CPU usage but have the challenges of developing dedicated hardware.

In this paper, we propose NVMePass, a novel software-hardware co-design NVMe passthrough virtualization architecture designed to achieve high performance and no CPU overhead while maintaining high scalability. The key ideas of NVMePass are NVMe I/O queues passthrough for VMs and a mechanism to ensure security. The NVMePass supports DMA and interrupts remapping for VMs without hypervisor involvement, eliminating virtualization overhead and providing near-native performance. Isolation is achieved by I/O queues and logical block address resources exclusively allocated to VMs. We propose NVMe Resource Domain (NRD) and implement it in the NVMe controller to intercept illegal I/O requests. Thus, isolation and security are fully achieved. Results from our experiments show that NVMePass can provide comparable performance to VFIO, with an IOPS of 100.1%-100.5% of VFIO. Furthermore, compared to SPDK-Vhost, NVMePass achieves 40.0% lower latency when running 150 VMs, and NVMePass has an improvement of 68.0% OPS performance in a real-world application when running 100 VMs.

Explore:

Nathaniel Bleier; Nishil Talati; Reetuparna Das; Research News; Scott Mahlke; Trevor Mudge