top of page

Research

Our mission is to advance the state-of-the-art of computing technology and improve the performance and efficiency of systems ranging from high-performance server systems to low-power edge devices. We pursue our goal through innovative computer architecture design along with software optimizations, especially for parallel and scalable systems.

We closely collaborate with other teams in the Advanced Computing Systems Group for algorithm/software/hardware co-design research projects.

Advanced Computing Systems Group featured in POSTECH Labcumentary (before Prof. Jisung Park joined our group)
Research Areas
  • Computer architecture

  • Near-data processing

  • Systems for machine learning

  • Memory system

  • GPU computing

  • Large-scale systems

On-going Projects

Architectural support for huge GPT models

Large language models (LLMs) like GPT are expected to bring transformative changes to human lives. However, the high cost of running the training and inference tasks of current and next-generation GPT models poses significant challenges in designing future systems. To significantly enhance the performance and efficiency of deep learning accelerators, we are exploring algorithm-hardware co-design approaches. These approaches aim to develop large-scale systems specialized for LLMs and realize greater performance gains from tensor sparsity compared to existing approaches.

Near-data processing (NDP) in memory expander for high-throughput DNN training in Multi-GPU systems

While the compute performance of GPUs has been improved significantly, the improvement in the memory system and interconect has lagged far behind. As a result, GPUs spend a significant fraction of time on memory- and communication-bound operations during DNN training. Meanwhile, recent memory-semantic interconnect (e.g., NVLink and Compute Express Link) pose opportunities to introduce memory expanders within a system and the ability to offload computation to the controller of the memory expander. This work proposes an NDP architecture with compiler support to offload memory- and communication-bound operations to memory expanders to substantially improve DNN training performance.

Heterogeneous 3D-stacked memory with Storage Class Memory (SCM) and DRAM for GPUs

SCM is a class of emerging memory devices that provide byte-addressability, non-volatility, and higher capacity/cost. However, as they provide lower bandwidth and higher latency than DRAM, the memory system needs to be carefully designed to incorporate them to achieve better performance. This work proposes a memory hierarchy with an efficient DRAM caching architecture for 3D-stacked memory with SCM and DRAM.

Scalable Neural Processing Unit (NPU) system architecture

Datacenters that serve a massive amount of machine learning service requests require that the NPUs have a good scalability at the chip-level (with many NPU cores), package-level (with multiple dies in a package), node-level (with multiple NPU cards), and rack-level (with multiple NPU nodes in a rack). This project explores software/hardware co-designing approaches to enable a very scalable NPU system for large-scale deep learning systems.

Lossless tensor compression for high-performance DNN inference/training

DNN inference/training requires very high memory capacity and bandwidth. Meanwhile, the inherent redundancy and sparsity in DNN's tensors pose an opportunity for significantly reducing the size of tensors to increase the effective memory capacity and bandwidth. In this project, we are working on developing an effective hardware-based tensor compression algorithm to improve the overall system performance and energy-efficiency for DNN inference and training.

Architecture for accelerating large-scale Graph Neural Network (GNN) training

Training GNNs is challenging due to the intrinsic irregularity of graphs and the diversity of GNN architectures (GCN, GAT, etc.). Large-scale GNN training is even more challenging because the huge feature data for nodes and edges are distributed across multiple compute nodes, placing a heavy burden on the memory system and the interconnect. In this project, we are architecting a scalable, specialized accelerator system to accelerate the training of large-scale GNNs.

 

bottom of page