Optimizing LLM inference for FPGAs
File(s)asicon25jrdf23.pdf (213.39 KB)
Accepted version
Author(s)
Type
Conference Paper
Abstract
Large Language Models (LLMs) deliver state-of-the-art performance but demand high computation and memory, making deployment in resource-limited settings challenging. Field-Programmable Gate Arrays (FPGAs) offer parallelism and efficiency, yet most prior FPGA accelerators rely on low-level, platform-specific flows that hinder portability. This work presents oneLLM, to our knowledge, the first FPGA-based LLM inference design using Intel’s oneAPI, enabling a unified high-level programming model across CPUs, GPUs, and FPGAs. Our deeply pipelined, multi-kernel hardware architecture connects specialized kernels via oneAPI pipes for on-chip streaming, reducing host–device communication. Implemented on an Intel Agilex 7 FPGA, it achieves 3 times faster than a CPU implementation, and 8.8 times faster than a non-pipelined baseline while meeting resource constraints, demonstrating the potential of portable
FPGA development for LLM acceleration. Code available at https://github.com/custom-computing-ic/llm-oneapi-fpga.
FPGA development for LLM acceleration. Code available at https://github.com/custom-computing-ic/llm-oneapi-fpga.
Date Acceptance
2025-09-01
Publisher
IEEE
Copyright Statement
Copyright This paper is embargoed until publication. Once published the author’s accepted manuscript will be made available under a CC-BY License in accordance with Imperial’s Research Publications Open Access policy (www.imperial.ac.uk/oa-policy).
License URL
Source
2025 IEEE 16th International Conference on ASIC
Publication Status
Accepted
Start Date
2025-10-21
Finish Date
2025-10-24
Coverage Spatial
Kunming, China