Publications

Google Scholar Total Citations

Peer-Reviewed Conference Papers

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

First Author ICLR 2026 arXiv HF Math HF Code Citations: 10
  • Authors: Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki
  • Venue: International Conference on Learning Representations (ICLR), 2026
  • Contribution: Led the entire project lifecycle—from inception and experimental design to dataset construction.
  • Summary: Proposed “LLM Rewriting” to synthesize high-quality pre-training data in math and code. Demonstrated that improving style and logic (beyond simple rephrasing) significantly boosts performance, achieving state-of-the-art results among open math/code pre-training corpora.
  • Datasets: Released Swallow-Math and Swallow-Code (v1 & v2).

Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

Co-Author arXiv COLM 2025 Citations: 4
  • Authors: Youmi Ma, Sakae Mizuki, Kazuki Fujii, Taishi Nakamura, Masanari Ohi, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Koki Maeda, Kakeru Hattori, Takumi Okamoto, Shigeki Ishida, Rio Yokota, Hiroya Takamura, Naoaki Okazaki
  • Venue: Conference on Language Modeling (COLM), 2025

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Co-Author arXiv ICLR 2025 Citations: 6
  • Authors: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
  • Venue: International Conference on Learning Representations (ICLR), 2025

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

First Author arXiv COLM 2024 Hugging Face Code Citations: 123
  • Authors: Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki
  • Venue: Conference on Language Modeling (COLM), 2024

Building a Large Japanese Web Corpus for Large Language Models

Co-Author arXiv COLM 2024 Citations: 25
  • Authors: Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki
  • Venue: Conference on Language Modeling (COLM), 2024

Workshop Papers

Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs

Co-Author arXiv COLM 2025 Award Citations: 1
  • Venue: Conference on Language Modeling (COLM) Multilingual and Equitable Language Technologies Workshop 2025
  • Award: Outstanding Paper Award, Natural Language Processing Research Meeting of the Information Processing Society of Japan (IPSJ-NLP)

llm-recipes: A Framework for Seamless Integration and Efficient Continual Pre-Training of Large Language Models

First Author SC 2024 GitHub Slides
  • Venue: SC24 (Supercomputing) Trillion Parameter Consortium (TPC) Workshop
  • Contribution: Led the development of a custom training framework designed for 0-day support of new LLMs not yet supported by Megatron-LM.
  • Tech Stack: Built on PyTorch FSDP-v1, enabling SFT and Continual Pre-Training for any model compatible with Hugging Face Transformers.

Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese

Co-Author arXiv CVPR Citations: 15
  • Authors: Yuichi Inoue, Kento Sasaki, Yuma Ochi, Kazuki Fujii, Kotaro Tanahashi, Yu Yamaguchi
  • Venue: CVPR 2024, The 3rd Workshop on Computer Vision in the Wild

Preprints

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

First Author arXiv Citations: 3
  • Contribution: Spearheaded the verification of FP8 training for the Swallow Project (Japanese & English Bilingual LLM).
  • Summary: Investigated FP8 stability for Continual Pre-training of 70B models. While prior works focused on from scratch training, I discovered that FP8 introduces instability during the continuous training phase of Llama-3-70B and demonstrated that the default DelayedScaling in Transformer Engine v1.x is insufficient for this regime.

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Co-Author arXiv Hugging Face Citations: 18
  • Role: Served as the Lead for Pre-training, Library Development, and Distributed Training (May 2023 - Aug 2024).
  • Summary: Technical report on the LLM-jp initiative. I oversaw the infrastructure and training pipeline for building fully open Japanese LLMs from scratch.
  • Note: Authors are listed in alphabetical order.

Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator

First Author arXiv Citations: 3
  • Authors: Kazuki Fujii, Kohei Watanabe, Rio Yokota
  • Summary: Proposed a systematic memory consumption estimator for LLM training with 4D parallelism (TP, PP, DP, CP). The estimator provides accurate per-GPU memory breakdowns covering model states, activations, and communication buffers, enabling practitioners to determine optimal parallelism configurations before launching expensive training runs.

Japanese Publications

継続事前学習による日本語に強い大規模言語モデルの構築

(Construction of Strong Japanese LLMs through Continual Pre-training)
First Author NLP2024 Award Slides Hugging Face Code
  • Authors: 藤井一喜(Kazuki Fujii), 中村泰士, Mengsay Loem, 飯田大貴, 大井聖也, 服部翔, 平井翔太, 水木栄, 横田理央, 岡崎直観
  • Venue: 言語処理学会第30回年次大会 (NLP2024)
  • Award: 優秀賞 (Selected as one of the best papers, 12/599)

Swallowコーパス: 日本語大規模ウェブコーパス

(Swallow Corpus: A Large-Scale Japanese Web Corpus)
Co-Author NLP2024 Award
  • Authors: 岡崎直観, 服部翔, 平井翔太, 飯田大貴, 大井聖也, 藤井一喜(Kazuki Fujii), 中村泰士, Mengsay Loem, 横田理央, 水木栄
  • Venue: 言語処理学会第30回年次大会 (NLP2024)
  • Award: 優秀賞 (Selected as one of the best papers, 12/599)

大規模言語モデルの分散並列学習

(Distributed Parallel Training of Large Language Models)
First Author IPSJ Award Slides Code
  • Authors: 藤井一喜(Kazuki Fujii), 横田理央
  • Venue: 情報処理学会 第86回全国大会 (2024)
  • Award: 大会優秀賞 (Best Paper Award of IPSJ National Convention)

Swallowコーパスv2: 教育的な日本語ウェブコーパスの構築

(Swallow Corpus v2: Building an Educational Japanese Web Corpus)
Co-Author NLP2025
  • Authors: 服部 翔, 岡崎 直観, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, 大井 聖也, (他)
  • Venue: 言語処理学会第31回年次大会 (NLP2025)

模倣学習による大規模言語モデルの指示チューニング

(Instruction Tuning of Large Language Models via Imitation Learning)
Co-Author NLP2025
  • Authors: Youmi Ma, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, 大井 聖也, (他)
  • Venue: 言語処理学会第31回年次大会 (NLP2025)

新聞記事からつくる 時事と社会に強い日本語LLM

(Building Japanese LLMs Strong in Current Events and Society from Newspaper Articles)
Co-Author NLP2025
  • Authors: 服部 翔, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, (他)
  • Venue: 言語処理学会第31回年次大会 (NLP2025)

Talks

2025

合成データパイプラインを利用したSwallow ProjectにおけるLLM性能向上

(Improving LLM Performance in the Swallow Project using Synthetic Data Pipelines)
Date Japanese Slides Event
  • Event: AWS AI Frontier Meetup 2025

論文では語られないLLM開発において重要なこと ― Swallow Projectを通して

(Important Aspects of LLM Development Untold in Papers: Through the Swallow Project)
Date Japanese Video Slides Event
  • Event: NLP Colloquium (第81回 NLPコロキウム)

IHPCSS 2025 Lisbon

Date English Slides Event
  • Event: International High Performance Computing Summer School (IHPCSS 2025) in Lisbon

Amazon SageMaker HyperPod を利用した日本語 LLM (Swallow) の構築

(Building the Japanese LLM "Swallow" using Amazon SageMaker HyperPod)
Date Japanese Video Slides Event
  • Event: AWS Summit Japan 2025 (Session CUS-02)
2024

Continual Pre-Training on TSUBAME for a Target Language

Date English Event
  • Event: SC24 (The International Conference for High Performance Computing) Tokyo Tech Booth
  • Summary: Introduced methodology for continual pre-training of Llama-3 (8B/70B) on TSUBAME supercomputer, focusing on efficient adaptation strategies.

大規模モデルの学習知見

(Insights on Training Large-Scale Models)
Date Japanese Video Slides Event
  • Event: NVIDIA AI Summit Japan 2024

Google Cloud の AI Hypercomputer で学習を加速させる

(Accelerating Training with Google Cloud AI Hypercomputer)
Date Japanese Video Event
  • Event: Google Cloud Next ‘24 Tokyo
  • Topic: GENIAC 2024, H100 (A3) Cluster with Cluster Toolkit

大規模言語モデルの分散並列学習

(Distributed Parallel Training of Large Language Models)
Date Japanese Event
  • Event: RIKEN AIP (理研AIP) Seminar
  • Summary: Explained technical aspects of Data, Tensor, and Pipeline parallelism (3D Parallelism) for efficient LLM training, including real-world project examples.

自然言語処理のための分散並列学習

(Distributed Parallel Training for Natural Language Processing)
Date Invited Japanese Slides Event
  • Event: NLP2024 Workshop (Invited Talk)

大規模言語モデルの事前学習知見

(Pre-Training Insights of Large Language Models)
Date Japanese Slides Event
  • Event: DSAI Symposium 2023 (Tokyo Tech)

Books

大規模言語モデル入門 Ⅱ 〜生成型LLMの実装と評価

(Introduction to Large Language Models II: Implementation and Evaluation)
Role Publisher Date Amazon

Authored Chapter: Distributed Parallel Training

I contributed as a technical author focusing on the scalability infrastructure for Large Language Models. I designed and implemented hands-on tutorials for pre-training Llama-2 from scratch, bridging the gap between theoretical concepts and production-grade engineering.

  • Core Topics: Data Parallelism, DeepSpeed ZeRo, Pipeline Parallelism (PP), Tensor Parallelism (TP), and 3D Parallelism.
  • Impact: Highly rated on Amazon Japan, serving as a definitive technical resource for LLM engineers in Japan.

Technical Blogs

Developing a 172B LLM with Strong Japanese Capabilities Using NVIDIA Megatron-LM

NVIDIA Blog 172B FP8 H100

Summary: As part of the GENIAC initiative by Japan’s METI, I led the training of a 172 billion parameter model from scratch. We utilized NVIDIA H100 GPUs and achieved a 1.4x speedup (550 TFLOP/s) by using FP8 hybrid training with Megatron-Core and Transformer Engine.

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

AWS Blog SageMaker Llama 3.3

Summary: This technical report details the development of Llama 3.3 Swallow (70B), which outperforms GPT-4o-mini in Japanese tasks. I discussed the infrastructure optimization on AWS SageMaker HyperPod and techniques for efficient continual pre-training.


Technical Articles (Japanese)

A collection of technical articles posted on Zenn (a popular engineering knowledge sharing platform in Japan).

Zenn Profile Likes

LLM Development & Training Infrastructure

大規模言語モデル(LLM)の作り方 Megatron-DeepSpeed編 Part1
(How to Build LLMs: Megatron-DeepSpeed Edition Part 1)

Likes Megatron

Swallow: LLaMA-2 日本語継続事前学習モデル
(Swallow: LLaMA-2 Continual Pre-training for Japanese)

Likes LLaMA-2

大規模言語モデル(LLM)の作り方 GPT-NeoX編 Part 1
(How to Build LLMs: GPT-NeoX Edition Part 1)

Likes NeoX

GENIAC: 172B 事前学習知見
(GENIAC: Insights from Pre-training a 172B Parameter Model)

Likes 172B

LLM開発の裏で行われるデバッグ作業: PyTorch DCP
(Debugging Behind LLM Development: Deep Dive into PyTorch DCP)

Likes Debug

Advanced Optimization & Techniques

FP8 trainingを支える技術 1
(Technologies Supporting FP8 Training Part 1)

Likes FP8

NVIDIA NeMoを利用したGPT-OSSの学習
(Training GPT-OSS using NVIDIA NeMo)

Likes NeMo

Kotomamba: Mamba State Space Model 分散学習ライブラリ
(Kotomamba: Distributed Training Library for Mamba SSM)

Likes Mamba

Infrastructure & Tips

Google Cloud: HPC Toolkitにて大規模深層学習環境を整備する
(Setting up Large-Scale Deep Learning Environments with Google Cloud HPC Toolkit)

GCP HPC

[Tips] PyTorchにおける動的リンク
(Dynamic Linking in PyTorch)

Likes PyTorch