PyTorchSim

PyTorchSim is a fast and cycle-accurate NPU simulation framework with comprehensive feature support:

Integrated with PyTorch 2, it can simulate existing PyTorch models by simply designating a simulated NPU as the target device.
Provides an NPU-specific compiler backend based on MLIR and LLVM, enabling compiler optimizations and supporting the simulation of both training and inference.
Supports multi-core NPU and multi-model tenancy with detailed interconnect and DRAM models (Booksim and Ramulator 2).
Can model data-dependent timing behavior, such as that of mixture-of-experts models.
Implements a custom RISC-V–based ISA with a rich instruction set to express various operations in AI models.
Employs a Tile-Level Simulation technique, which enables fast simulation without loss of accuracy.
Validated against Google TPU v3, showing a mean absolute error (MAE) of 11.5%.

Simulator for Memory-Mapped Near-Data Processing (M²NDP)

https://github.com/PSAL-POSTECH/M2NDP-public

This is a cycle-level simulator developed to model the M²NDP architecture proposed in the paper, Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders, MICRO'24.
Here are some high-level features of the M²NDP architecture:

General-purpose NDP for CXL memory: Enables general-purpose (rather than application-specific) NDP in CXL memory for diverse real-world workloads.
Low-overhead offloading: Supports low-overhead NDP offloading and management with M²func (Memory-Mapped function).
Cost-effective NDP unit design: The M²μthread (Memory-Mapped μthreading) execution model based on an extended RISC-V vector extension efficiently utilizes resources to maximize concurrency.

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

ONNXim is a fast cycle-level simulator that can model multi-core NPUs for DNN inference. Its features include the following:

Faster simulation speed in comparison to other detailed NPU simulation frameworks (see the figure below).
Support for modeling multi-core NPUs.
Support for cycle-level simulation of memory (through Ramulator) and network-on-chip (through Booksim2), which is important for properly modeling memory-bound operations in deep learning.
Use of ONNX graphs as DNN model specifications, enabling simulation of DNNs implemented in different deep learning frameworks (e.g., PyTorch and TensorFlow).
Support language models that do not use ONNX graphs. Additionally, enable auto-regressive generation phases and iteration-level batching.

For more details, please refer to our paper below:

Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim, "ONNXim: A Fast, Cycle-level Multi-core NPU Simulator," [ IEEE Xplore ] [ arXiv ]

Simulator for GPUs with Heterogeneous Memory Stack (HMS) [HPCA'24]

https://github.com/PSAL-POSTECH/accelsim_HMS

This repository contains the source code of our modified Accel-sim simulator used for our work below that proposed Heterogeneous Memory Stack (HMS):

Jeongmin Hong, Sungjun Cho, Geonwoo Park, Wonhyuk Yang, Young-Ho Gong and Gwangsun Kim, "Bandwidth-Effective DRAM Cache for GPU s with Storage-Class Memory," HPCA'24.

PyTorchSim: A Comprehensive, Fast, and Accurate NPU Simulation Framework

https://github.com/PSAL-POSTECH/PyTorchSim ​PyTorchSim is a fast and cycle-accurate NPU simulation framework with comprehensive feature support:

Simulator for Memory-Mapped Near-Data Processing (M²NDP)

https://github.com/PSAL-POSTECH/M2NDP-public ​This is a cycle-level simulator developed to model the M²NDP architecture proposed in the paper, Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders, MICRO'24. Here are some high-level features of the M²NDP architecture:

General-purpose NDP for CXL memory: Enables general-purpose (rather than application-specific) NDP in CXL memory for diverse real-world workloads.

Low-overhead offloading: Supports low-overhead NDP offloading and management with M²func (Memory-Mapped function).

Cost-effective NDP unit design: The M²μthread (Memory-Mapped μthreading) execution model based on an extended RISC-V vector extension efficiently utilizes resources to maximize concurrency.

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator

Simulator for GPUs with Heterogeneous Memory Stack (HMS) [HPCA'24]

https://github.com/PSAL-POSTECH/PyTorchSim

PyTorchSim is a fast and cycle-accurate NPU simulation framework with comprehensive feature support:

https://github.com/PSAL-POSTECH/M2NDP-public

This is a cycle-level simulator developed to model the M²NDP architecture proposed in the paper, Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders, MICRO'24.
Here are some high-level features of the M²NDP architecture: