Publications

Google Scholar

Peer-Reviewed Conference Papers

Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

Co-Author arXiv COLM 2025
  • Authors: Youmi Ma, Sakae Mizuki, Kazuki Fujii, Taishi Nakamura, Masanari Ohi, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Koki Maeda, Kakeru Hattori, Takumi Okamoto, Shigeki Ishida, Rio Yokota, Hiroya Takamura, Naoaki Okazaki
  • Venue: Conference on Language Modeling (COLM), 2025

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Co-Author arXiv ICLR 2025
  • Authors: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
  • Venue: International Conference on Learning Representations (ICLR), 2025

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

First Author arXiv COLM 2024 Hugging Face Code
  • Authors: Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki
  • Venue: Conference on Language Modeling (COLM), 2024

Building a Large Japanese Web Corpus for Large Language Models

Co-Author arXiv COLM 2024
  • Authors: Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki
  • Venue: Conference on Language Modeling (COLM), 2024

Workshop Papers

Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs

Co-Author arXiv COLM 2025 Award
  • Venue: Conference on Language Modeling (COLM) Multilingual and Equitable Language Technologies Workshop 2025
  • Award: Outstanding Paper Award, Natural Language Processing Research Meeting of the Information Processing Society of Japan (IPSJ-NLP)

llm-recipes: A Framework for Seamless Integration and Efficient Continual Pre-Training of Large Language Models

First Author SC 2024 GitHub Slides
  • Venue: SC24 (Supercomputing) Trillion Parameter Consortium (TPC) Workshop
  • Contribution: Led the development of a custom training framework designed for 0-day support of new LLMs not yet supported by Megatron-LM.
  • Tech Stack: Built on PyTorch FSDP-v1, enabling SFT and Continual Pre-Training for any model compatible with Hugging Face Transformers.

Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese

Co-Author arXiv CVPR
  • Authors: Yuichi Inoue, Kento Sasaki, Yuma Ochi, Kazuki Fujii, Kotaro Tanahashi, Yu Yamaguchi
  • Venue: CVPR 2024, The 3rd Workshop on Computer Vision in the Wild

Preprints

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

First Author arXiv HF Math HF Code
  • Contribution: Led the entire project lifecycle—from inception and experimental design to dataset construction.
  • Summary: Proposed “LLM Rewriting” to synthesize high-quality pre-training data in math and code. Demonstrated that improving style and logic (beyond simple rephrasing) significantly boosts performance, achieving state-of-the-art results among open math/code pre-training corpora.
  • Datasets: Released Swallow-Math and Swallow-Code (v1 & v2).

Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs

First Author arXiv
  • Contribution: Spearheaded the verification of FP8 training for the Swallow Project (Japanese & English Bilingual LLM).
  • Summary: Investigated FP8 stability for Continual Pre-training of 70B models. While prior works focused on from scratch training, I discovered that FP8 introduces instability during the continuous training phase of Llama-3-70B and demonstrated that the default DelayedScaling in Transformer Engine v1.x is insufficient for this regime.

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Co-Author arXiv Hugging Face
  • Role: Served as the Lead for Pre-training, Library Development, and Distributed Training (May 2023 - Aug 2024).
  • Summary: Technical report on the LLM-jp initiative. I oversaw the infrastructure and training pipeline for building fully open Japanese LLMs from scratch.
  • Note: Authors are listed in alphabetical order.

Japanese Publications

継続事前学習による日本語に強い大規模言語モデルの構築

(Construction of Strong Japanese LLMs through Continual Pre-training)
First Author NLP2024 Award Slides Hugging Face Code
  • Authors: 藤井一喜(Kazuki Fujii), 中村泰士, Mengsay Loem, 飯田大貴, 大井聖也, 服部翔, 平井翔太, 水木栄, 横田理央, 岡崎直観
  • Venue: 言語処理学会第30回年次大会 (NLP2024)
  • Award: 優秀賞 (Selected as one of the best papers, 12/599)

Swallowコーパス: 日本語大規模ウェブコーパス

(Swallow Corpus: A Large-Scale Japanese Web Corpus)
Co-Author NLP2024 Award
  • Authors: 岡崎直観, 服部翔, 平井翔太, 飯田大貴, 大井聖也, 藤井一喜(Kazuki Fujii), 中村泰士, Mengsay Loem, 横田理央, 水木栄
  • Venue: 言語処理学会第30回年次大会 (NLP2024)
  • Award: 優秀賞 (Selected as one of the best papers, 12/599)

大規模言語モデルの分散並列学習

(Distributed Parallel Training of Large Language Models)
First Author IPSJ Award Slides Code
  • Authors: 藤井一喜(Kazuki Fujii), 横田理央
  • Venue: 情報処理学会 第86回全国大会 (2024)
  • Award: 大会優秀賞 (Best Paper Award of IPSJ National Convention)

Swallowコーパスv2: 教育的な日本語ウェブコーパスの構築

(Swallow Corpus v2: Building an Educational Japanese Web Corpus)
Co-Author NLP2025
  • Authors: 服部 翔, 岡崎 直観, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, 大井 聖也, (他)
  • Venue: 言語処理学会第31回年次大会 (NLP2025)

模倣学習による大規模言語モデルの指示チューニング

(Instruction Tuning of Large Language Models via Imitation Learning)
Co-Author NLP2025
  • Authors: Youmi Ma, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, 大井 聖也, (他)
  • Venue: 言語処理学会第31回年次大会 (NLP2025)

新聞記事からつくる 時事と社会に強い日本語LLM

(Building Japanese LLMs Strong in Current Events and Society from Newspaper Articles)
Co-Author NLP2025
  • Authors: 服部 翔, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, (他)
  • Venue: 言語処理学会第31回年次大会 (NLP2025)

Talks

—2025—

合成データパイプラインを利用したSwallow ProjectにおけるLLM性能向上

(Improving LLM Performance in the Swallow Project using Synthetic Data Pipelines)
Date Japanese Slides Event
  • Event: AWS AI Frontier Meetup 2025

論文では語られないLLM開発において重要なこと ― Swallow Projectを通して

(Important Aspects of LLM Development Untold in Papers: Through the Swallow Project)
Date Japanese Video Slides Event
  • Event: NLP Colloquium (第81回 NLPコロキウム)

IHPCSS 2025 Lisbon

Date English Slides Event
  • Event: International High Performance Computing Summer School (IHPCSS 2025) in Lisbon

Amazon SageMaker HyperPod を利用した日本語 LLM (Swallow) の構築

(Building the Japanese LLM "Swallow" using Amazon SageMaker HyperPod)*
Date Japanese Video Slides Event
  • Event: AWS Summit Japan 2025 (Session CUS-02)

— 2024 —

Continual Pre-Training on TSUBAME for a Target Language

Date English Event
  • Event: SC24 (The International Conference for High Performance Computing) Tokyo Tech Booth
  • Summary: Introduced methodology for continual pre-training of Llama-3 (8B/70B) on TSUBAME supercomputer, focusing on efficient adaptation strategies.

大規模モデルの学習知見

(Insights on Training Large-Scale Models)
Date Japanese Video Slides
  • Event: NVIDIA AI Summit Japan 2024
    Event

Google Cloud の AI Hypercomputer で学習を加速させる

(Accelerating Training with Google Cloud AI Hypercomputer)
Date Japanese Video Event
  • Event: Google Cloud Next ‘24 Tokyo
  • Topic: GENIAC 2024, H100 (A3) Cluster with Cluster Toolkit

大規模言語モデルの分散並列学習

(Distributed Parallel Training of Large Language Models)
Date Japanese Event
  • Event: RIKEN AIP (理研AIP) Seminar
  • Summary: Explained technical aspects of Data, Tensor, and Pipeline parallelism (3D Parallelism) for efficient LLM training, including real-world project examples.

自然言語処理のための分散並列学習

(Distributed Parallel Training for Natural Language Processing)
Date Invited Japanese Slides Event
  • Event: NLP2024 Workshop (Invited Talk)

大規模言語モデルの事前学習知見

(Pre-Training Insights of Large Language Models)
Date Japanese Slides Event
  • Event: DSAI Symposium 2023 (Tokyo Tech)

Books

大規模言語モデル入門 Ⅱ 〜生成型LLMの実装と評価

(Introduction to Large Language Models II: Implementation and Evaluation)
Role Publisher Date Amazon

Authored Chapter: Distributed Parallel Training

I contributed as a technical author focusing on the scalability infrastructure for Large Language Models. I designed and implemented hands-on tutorials for pre-training Llama-2 from scratch, bridging the gap between theoretical concepts and production-grade engineering.

  • Core Topics: Data Parallelism, DeepSpeed ZeRo, Pipeline Parallelism (PP), Tensor Parallelism (TP), and 3D Parallelism.
  • Impact: Highly rated on Amazon Japan, serving as a definitive technical resource for LLM engineers in Japan.

Technical Blogs

Developing a 172B LLM with Strong Japanese Capabilities Using NVIDIA Megatron-LM

NVIDIA Blog 172B FP8 H100

Summary: As part of the GENIAC initiative by Japan’s METI, I led the training of a 172 billion parameter model from scratch. We utilized NVIDIA H100 GPUs and achieved a 1.4x speedup (550 TFLOP/s) by using FP8 hybrid training with Megatron-Core and Transformer Engine.


Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

AWS Blog SageMaker Llama 3.3

Summary: This technical report details the development of Llama 3.3 Swallow (70B), which outperforms GPT-4o-mini in Japanese tasks. I discussed the infrastructure optimization on AWS SageMaker HyperPod and techniques for efficient continual pre-training.


Technical Articles (Japanese)

A collection of technical articles posted on Zenn (a popular engineering knowledge sharing platform in Japan).

Zenn Profile Likes

LLM Development & Training Infrastructure

大規模言語モデル(LLM)の作り方 Megatron-DeepSpeed編 Part1
(How to Build LLMs: Megatron-DeepSpeed Edition Part 1)

Likes Megatron

Swallow: LLaMA-2 日本語継続事前学習モデル
(Swallow: LLaMA-2 Continual Pre-training for Japanese)

Likes LLaMA-2

大規模言語モデル(LLM)の作り方 GPT-NeoX編 Part 1
(How to Build LLMs: GPT-NeoX Edition Part 1)

Likes NeoX

GENIAC: 172B 事前学習知見
(GENIAC: Insights from Pre-training a 172B Parameter Model)

Likes 172B

LLM開発の裏で行われるデバッグ作業: PyTorch DCP
(Debugging Behind LLM Development: Deep Dive into PyTorch DCP)

Likes Debug

Advanced Optimization & Techniques

FP8 trainingを支える技術 1
(Technologies Supporting FP8 Training Part 1)

Likes FP8

NVIDIA NeMoを利用したGPT-OSSの学習
(Training GPT-OSS using NVIDIA NeMo)

Likes NeMo

Kotomamba: Mamba State Space Model 分散学習ライブラリ
(Kotomamba: Distributed Training Library for Mamba SSM)

Likes Mamba

Infrastructure & Tips

Google Cloud: HPC Toolkitにて大規模深層学習環境を整備する
(Setting up Large-Scale Deep Learning Environments with Google Cloud HPC Toolkit)

GCP HPC

[Tips] PyTorchにおける動的リンク
(Dynamic Linking in PyTorch)

Likes PyTorch