2

SoftmaxPIM: An HBM-Based PIM Architecture for Accelerating GeMV--Softmax Execution Pipeline

Graph convolutional neural networks (GCNs) are representative graph neural network (GNN) models that can be used for analyzing and …

Taewoon Kang, Namhun Kim, Jihun Lee, Murali Annavaram, Gunjae Koo

SoftmaxPIM: An HBM-Based PIM Architecture for Accelerating GeMV--Softmax Execution Pipeline

BiKD: Bidirectional Kernel Decomposition for Large-Scale GCNs on GPU

Graph convolutional neural networks (GCNs) are representative graph neural network (GNN) models that can be used for analyzing and …

Inje Kim, Jihun Lee, Jonghyun Jeong, Geonwoo Choi, Myung Kuk Yoon, Yunho Oh, Gunjae Koo

BiKD: Bidirectional Kernel Decomposition for Large-Scale GCNs on GPU

SparsePIM+: Accelerating SpMV on HBM-Based PIM via Logic-Die Accumulators with Opportunistic TSV Utilization

Sparse matrix–vector multiplication (SpMV) is a fundamental kernel in many applications, yet its performance is severely limited by …

Taewoon Kang, Namhun Kim, Geonwoo Choi, Taeweon Suh, Gunjae Koo

SparsePIM+: Accelerating SpMV on HBM-Based PIM via Logic-Die Accumulators with Opportunistic TSV Utilization

LogFlex: Flexible-Bit Log Arithmetic Accelerator for Language Models on Edge

Deploying language models on resource-constrained mobile/wearable devices while maintaining output quality is challenging. To address …

Yujin Kim, Faraz Tahmasebi, Gunjae Koo, Hyoukjun Kwon

TM-Training: An Energy-Efficient Tiered Memory System for Deep Learning Training in NPUs

DRAM accounts for a large fraction of the total cost of ownership of memory systems in deep learning acceleration systems. To achieve …

Jaeyong Park, Sangun Choi, Jongmin Kim, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage

Deep Neural Network (DNN) training demands large memory capacities that exceed the limits of current GPU onboard memory. Expanding GPU …

Junsu Kim, Jaebeom Jeon, Jaeyong Park, Sangun Choi, Minseong Gil, Seokin Hong, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage

Beyond VABlock: Improving Transformer Workloads through Aggressive Prefetching

The memory capacity constraint of GPUs is a major challenge in running large deep learning workloads with their ever increasing memory …

Jane Rhee, Ikyoung Choi, Gunjae Koo, Yunho Oh, Myung Kuk Yoon

Beyond VABlock: Improving Transformer Workloads through Aggressive Prefetching

TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs

This letter introduces a novel software technique to optimize thread allocation for merged and fused kernels in multitenant inference …

Minseong Gil, Jaebeom Jeon, Junsu Kim, Sangun Choi, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

TLP Balancer: Predictive Thread Allocation for Multitenant Inference in Embedded GPUs

SAVector: Vectored Systolic Arrays

Conventional DNN inference accelerators are designed with a few (up to four) large systolic arrays. As such a scale-up architecture …

Sangun Choi, Seongjun Park, Jaeyong Park, Jongmin Kim, Gunjae Koo, Seokin Hong, Myung Kuk Yoon, Yunho Oh

SAVector: Vectored Systolic Arrays

Conflict-Aware Compiler for Hierarchical Register File on GPUs

Modern graphics processing units (GPUs) leverage a high degree of thread-level parallelism, necessitating large-sized register files …

Eunbi Jeong, Eun Seong Park, Gunjae Koo, Yunho Oh, Myung Kuk Yoon

Conflict-Aware Compiler for Hierarchical Register File on GPUs