Research | PSAL at POSTECH

Research

Our mission is to advance the state-of-the-art of computing technology and improve the performance and efficiency of systems ranging from high-performance server systems to low-power edge devices. We pursue our goal through innovative computer architecture design along with software optimizations, especially for parallel and scalable systems.

Research Areas

Computer architecture
Near-data processing
Systems for machine learning
Memory system
GPU computing
Large-scale systems

Projects

PyTorchSim: A Comprehensive, Fast, and Accurate NPU Simulation Framework
(MICRO'25)

Existing NPU simulators lack support for key features such ashigh-speed, multi-core, multi-model tenancy, generic ISA (with vector operations), compiler, data-dependent timing model, and enabling both inference and training. To address the challenges, we propose PyTorchSim, a novel NPU simulation framework integrated with PyTorch 2, which satisfies the requirements. PyTorchSim models NPUs with a custom RISC-V-based ISA extended to support various acceleration units (e.g., systolic array). Our custom backend for PyTorch 2 compiles a given DNN using this ISA through lowering passes with MLIR and LLVM. Then, our extended Gem5 and Spike simulators execute the machine code to accurately model the DNN’s timing and functional aspects on the NPU. However, as this conventional Instruction-Level Simulation (ILS) run inevitably slowly, we propose Tile-Level Simulation (TLS) to improve the speed without sacrificing accuracy. It uses tile-granularity operation latencies from offline ILS runs for high speed while still modeling DRAM and interconnect with cycle-accurate simulators.

Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders
(MICRO'24)

Emerging Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL.mem protocol provides minimal latency overhead through an optimized protocol stack, frequent CXL memory accesses can result in significant slowdowns for memory-bound applications whether they are latency-sensitive or bandwidth-intensive. The near-data processing (NDP) in the CXL controller promises to overcome such limitations of passive CXL memory. However, prior work on NDP in CXL memory proposes application-specific units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but existing CXL.io/PCIe-based mechanisms incur µs-scale latency and are not suitable for fine-grained NDP. To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (M2NDP), which comprises memory-mapped functions (M2func) and memory-mapped µthreading (M2µthread). M2func is a CXL.mem-compatible low-overhead communication mechanism between the host processor and NDP controller in CXL memory. M2µthread enables low-cost, general-purpose NDP unit design by introducing lightweight µthreads that support highly concurrent execution of kernels with minimal resource wastage. Combining them, M2NDP achieves significant speedups for various workloads by up to 128x (14.5x overall) and reduces energy by up to 87.9% (80.3% overall) compared to baseline CPU/GPU hosts with passive CXL memory.

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory (HPCA'24)

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that mandate memory oversubscription, resulting in substantial speedups. However, the DRAM cache needs to be carefully designed to address the latency and bandwidth limitations of the SCM while minimizing cost overhead and considering GPU’s characteristics. Because the massive number of GPU threads can easily thrash the DRAM cache and degrade performance, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multidimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probe traffic and increase effective DRAM BW with minimal cost overhead, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache implementation with Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power consumption and exploiting SCM’s SLC/MLC modes to adapt to workload’s memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, the HMS improves performance by up to 12.5× (2.9× overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.

Non-Invasive, Memory Access-Triggered Near-Data Processing for DNN Training Acceleration on GPUs (IEEE Access, 2024)

To address the off-chip bandwidth (BW) and memory capacity challenges of GPUs in DNN training, we propose a memory access-triggered near-data processing matNDP architecture that offloads memory- and communication-bound operations. With matNDP, normal memory accesses also serve as implicit NDP requests to enable NDP in a non-invasive manner without modifying core-side ISA/microarchitecture/SW, for practical NDP. In addition, matNDP enables on-the-fly NDP where the data already supplied in normal memory requests for compute-bound operations are also simultaneously used for NDP; thus, matNDP can overlap even dependent kernels while also reducing memory traffic. Moreover, with the overlap, memory bandwidth (BW) underutilized by GPU cores can be used by NDP units to improve performance even under the same total memory BW. The matNDP units can be deployed to heterogeneous memory devices in a system. First, we deploy them near GPU’s memory controllers. Secondly, our NDP units are deployed in memory expanders connected to multiple GPUs to create an NDP-enabled memory eXpander Network (NDPXNet). It can entirely offload gradient reduction and optimizer in data-parallel training, achieving additional speedups while eliminating redundancy in memory usage and optimizer execution. As a result, we 1) enable NDP without core HW/SW changes, 2) overlap the execution of dependent layers, and 3) offload both memory- and communication- bound operations from GPUs in DNN training. Through our deep learning compiler support, NDP kernels can be generated automatically without any model code modification. Consequently, matNDP can improve training throughput by up to 2.73× and reduce energy by up to 41.4%

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator (IEEE CAL, 2024)

As DNNs (Deep Neural Networks) demand increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) has become more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. For ease of simulation, it takes DNN models in the ONNX graph format generated from various deep learning frameworks. In addition, based on the observation that typical NPU cores process tensor tiles from SRAM with deterministic compute latency, we model computation accurately with an event-driven approach, avoiding the overhead of modeling cycle-level activities. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 365x over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities.