Publications
Peer-Reviewed Conference Papers
Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models
- Authors: Youmi Ma, Sakae Mizuki, Kazuki Fujii, Taishi Nakamura, Masanari Ohi, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Koki Maeda, Kakeru Hattori, Takumi Okamoto, Shigeki Ishida, Rio Yokota, Hiroya Takamura, Naoaki Okazaki
- Venue: Conference on Language Modeling (COLM), 2025
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
- Authors: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
- Venue: International Conference on Learning Representations (ICLR), 2025
Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities
- Authors: Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki
- Venue: Conference on Language Modeling (COLM), 2024
Building a Large Japanese Web Corpus for Large Language Models
- Authors: Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki
- Venue: Conference on Language Modeling (COLM), 2024
Workshop Papers
Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs
- Venue: Conference on Language Modeling (COLM) Multilingual and Equitable Language Technologies Workshop 2025
- Award: Outstanding Paper Award, Natural Language Processing Research Meeting of the Information Processing Society of Japan (IPSJ-NLP)
llm-recipes: A Framework for Seamless Integration and Efficient Continual Pre-Training of Large Language Models
- Venue: SC24 (Supercomputing) Trillion Parameter Consortium (TPC) Workshop
- Contribution: Led the development of a custom training framework designed for 0-day support of new LLMs not yet supported by Megatron-LM.
- Tech Stack: Built on PyTorch FSDP-v1, enabling SFT and Continual Pre-Training for any model compatible with Hugging Face Transformers.
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
- Authors: Yuichi Inoue, Kento Sasaki, Yuma Ochi, Kazuki Fujii, Kotaro Tanahashi, Yu Yamaguchi
- Venue: CVPR 2024, The 3rd Workshop on Computer Vision in the Wild
Preprints
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
- Contribution: Led the entire project lifecycle—from inception and experimental design to dataset construction.
- Summary: Proposed “LLM Rewriting” to synthesize high-quality pre-training data in math and code. Demonstrated that improving style and logic (beyond simple rephrasing) significantly boosts performance, achieving state-of-the-art results among open math/code pre-training corpora.
- Datasets: Released Swallow-Math and Swallow-Code (v1 & v2).
Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs
- Contribution: Spearheaded the verification of FP8 training for the Swallow Project (Japanese & English Bilingual LLM).
- Summary: Investigated FP8 stability for Continual Pre-training of 70B models. While prior works focused on from scratch training, I discovered that FP8 introduces instability during the continuous training phase of Llama-3-70B and demonstrated that the default DelayedScaling in Transformer Engine v1.x is insufficient for this regime.
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
- Role: Served as the Lead for Pre-training, Library Development, and Distributed Training (May 2023 - Aug 2024).
- Summary: Technical report on the LLM-jp initiative. I oversaw the infrastructure and training pipeline for building fully open Japanese LLMs from scratch.
- Note: Authors are listed in alphabetical order.
Japanese Publications
継続事前学習による日本語に強い大規模言語モデルの構築
- Authors: 藤井一喜(Kazuki Fujii), 中村泰士, Mengsay Loem, 飯田大貴, 大井聖也, 服部翔, 平井翔太, 水木栄, 横田理央, 岡崎直観
- Venue: 言語処理学会第30回年次大会 (NLP2024)
- Award: 優秀賞 (Selected as one of the best papers, 12/599)
Swallowコーパス: 日本語大規模ウェブコーパス
- Authors: 岡崎直観, 服部翔, 平井翔太, 飯田大貴, 大井聖也, 藤井一喜(Kazuki Fujii), 中村泰士, Mengsay Loem, 横田理央, 水木栄
- Venue: 言語処理学会第30回年次大会 (NLP2024)
- Award: 優秀賞 (Selected as one of the best papers, 12/599)
大規模言語モデルの分散並列学習
- Authors: 藤井一喜(Kazuki Fujii), 横田理央
- Venue: 情報処理学会 第86回全国大会 (2024)
- Award: 大会優秀賞 (Best Paper Award of IPSJ National Convention)
Swallowコーパスv2: 教育的な日本語ウェブコーパスの構築
- Authors: 服部 翔, 岡崎 直観, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, 大井 聖也, (他)
- Venue: 言語処理学会第31回年次大会 (NLP2025)
模倣学習による大規模言語モデルの指示チューニング
- Authors: Youmi Ma, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, 大井 聖也, (他)
- Venue: 言語処理学会第31回年次大会 (NLP2025)
新聞記事からつくる 時事と社会に強い日本語LLM
- Authors: 服部 翔, 水木 栄, 藤井 一喜(Kazuki Fujii), 中村 泰士, (他)
- Venue: 言語処理学会第31回年次大会 (NLP2025)
Talks
—2025—
合成データパイプラインを利用したSwallow ProjectにおけるLLM性能向上
- Event: AWS AI Frontier Meetup 2025
論文では語られないLLM開発において重要なこと ― Swallow Projectを通して
- Event: NLP Colloquium (第81回 NLPコロキウム)
IHPCSS 2025 Lisbon
- Event: International High Performance Computing Summer School (IHPCSS 2025) in Lisbon
Amazon SageMaker HyperPod を利用した日本語 LLM (Swallow) の構築
- Event: AWS Summit Japan 2025 (Session CUS-02)
— 2024 —
Continual Pre-Training on TSUBAME for a Target Language
- Event: SC24 (The International Conference for High Performance Computing) Tokyo Tech Booth
- Summary: Introduced methodology for continual pre-training of Llama-3 (8B/70B) on TSUBAME supercomputer, focusing on efficient adaptation strategies.
大規模モデルの学習知見
- Event: NVIDIA AI Summit Japan 2024
Google Cloud の AI Hypercomputer で学習を加速させる
- Event: Google Cloud Next ‘24 Tokyo
- Topic: GENIAC 2024, H100 (A3) Cluster with Cluster Toolkit
大規模言語モデルの分散並列学習
- Event: RIKEN AIP (理研AIP) Seminar
- Summary: Explained technical aspects of Data, Tensor, and Pipeline parallelism (3D Parallelism) for efficient LLM training, including real-world project examples.
自然言語処理のための分散並列学習
- Event: NLP2024 Workshop (Invited Talk)
大規模言語モデルの事前学習知見
- Event: DSAI Symposium 2023 (Tokyo Tech)
Books
大規模言語モデル入門 Ⅱ 〜生成型LLMの実装と評価
Authored Chapter: Distributed Parallel Training
I contributed as a technical author focusing on the scalability infrastructure for Large Language Models. I designed and implemented hands-on tutorials for pre-training Llama-2 from scratch, bridging the gap between theoretical concepts and production-grade engineering.
- Core Topics: Data Parallelism, DeepSpeed ZeRo, Pipeline Parallelism (PP), Tensor Parallelism (TP), and 3D Parallelism.
- Impact: Highly rated on Amazon Japan, serving as a definitive technical resource for LLM engineers in Japan.
Technical Blogs
Featured Articles (International)
Developing a 172B LLM with Strong Japanese Capabilities Using NVIDIA Megatron-LM
Summary: As part of the GENIAC initiative by Japan’s METI, I led the training of a 172 billion parameter model from scratch. We utilized NVIDIA H100 GPUs and achieved a 1.4x speedup (550 TFLOP/s) by using FP8 hybrid training with Megatron-Core and Transformer Engine.
- Links: English Post | Japanese Post
Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod
Summary: This technical report details the development of Llama 3.3 Swallow (70B), which outperforms GPT-4o-mini in Japanese tasks. I discussed the infrastructure optimization on AWS SageMaker HyperPod and techniques for efficient continual pre-training.
- Links: AWS Blog Post
Technical Articles (Japanese)
A collection of technical articles posted on Zenn (a popular engineering knowledge sharing platform in Japan).
LLM Development & Training Infrastructure
大規模言語モデル(LLM)の作り方 Megatron-DeepSpeed編 Part1
(How to Build LLMs: Megatron-DeepSpeed Edition Part 1)
Swallow: LLaMA-2 日本語継続事前学習モデル
(Swallow: LLaMA-2 Continual Pre-training for Japanese)
大規模言語モデル(LLM)の作り方 GPT-NeoX編 Part 1
(How to Build LLMs: GPT-NeoX Edition Part 1)
GENIAC: 172B 事前学習知見
(GENIAC: Insights from Pre-training a 172B Parameter Model)
LLM開発の裏で行われるデバッグ作業: PyTorch DCP
(Debugging Behind LLM Development: Deep Dive into PyTorch DCP)
Advanced Optimization & Techniques
FP8 trainingを支える技術 1
(Technologies Supporting FP8 Training Part 1)
NVIDIA NeMoを利用したGPT-OSSの学習
(Training GPT-OSS using NVIDIA NeMo)
Kotomamba: Mamba State Space Model 分散学習ライブラリ
(Kotomamba: Distributed Training Library for Mamba SSM)
Infrastructure & Tips
Google Cloud: HPC Toolkitにて大規模深層学習環境を整備する
(Setting up Large-Scale Deep Learning Environments with Google Cloud HPC Toolkit)
[Tips] PyTorchにおける動的リンク
(Dynamic Linking in PyTorch)