I'm a first-year Master's student working at the intersection of HPC and Machine Learning. My research focuses on distributed training of large models and low-precision(FP8) training. I am a core contributor of Swallow Project which is a Japanese LLM development initiative. Also, I am in charge of the maintenance of Pre-training Library for LLM and conducting experiments on training LLMs in many projects.

My interest is efficient training of large models and I usually profile the LLMs training process with pytorch profiler or nsight systems and also I am interested in low-precision training. In our experiments, FP8-DelayedScaling training is not sufficient for long-run training in terms of training stability, which is reported in our paper. I am currently researching how to improve the stability of FP8 training with Microscaling(MX) Data Format and tile-wise fine-grained quantization.

News

Nov 2024

NVIDIA AI Summit 2024 Tokyo Talk

I gave a talk at NVIDIA AI Summit 2024 Tokyo on the topic of 'How to train LLM efficiently with Megatron-LM and TransformerEngine'.

August 2024

Google Cloud Next '24 Tokyo Talk

I gave a talk at Google Cloud Next '24 Tokyo on the topic of 'How to use Google Cluster Toolkit and real use-case'.

March 2024

NLP 2024 workshop talk

I gave a talk at the NLP 2024 workshop on the topic of 'Distributed Training Technologies for Natural Language Processing'.

Education

2024—Present

Institute of Science Tokyo

Master in Computer Science

Advisor: Prof. Jun Sakuma and Prof. Rio Yokota

2020—2024

Tokyo Institute of Technology

B.S. in Computer Science

Publications

ICLR 2025

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki

SC (tpc workshop) 2024

llm-recipes: A Framework for Seamless Integration and Efficient Continual Pre-Training of Large Language Models

Kazuki Fujii, Taishi Nakamura, Rio Yokota

CVPR (workshop) 2024

Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese

Yuichi Inoue, Kento Sasaki, Yuma Ochi, Kazuki Fujii, Kotaro Tanahashi, Yu Yamaguchi

COLM 2024

Building a Large Japanese Web Corpus for Large Language Models

Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki

COLM 2024

Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

Experience

Apr 2024 - Present

Research Intern SB Intuitions

Manager: Sho Takase

Worked on developing frameworks on training large language models.

Oct 2023 - Present

Research Intern AIST (National Institute of Advanced Industrial Science and Technology)

Manager: Hiroya Takamura

I am involved in selecting and maintaining pre-training and post-training libraries, managing experiments, and setting up experimental environments to develop a Japanese LLM with competitive performance. This initiative, known as the Swallow Project (https://swallow-llm.github.io/index.en.html), has contributed to the development of non-English LLMs by achieving top performance among open Japanese models as of December 2023. As a core contributor to the project, I have been broadly involved in all aspects of the training process—from procuring computational resources and maintaining the Environment Module to creating synthetic data.

Feb 2023 - Present

Research Intern Turing

Manager: Yu Yamaguchi

Worked on developing frameworks on training vision-language models and large language models.

Apr 2024 - Dec 2024

Intern Sakana AI

Manager: Takuya Akiba

Worked on deploying and maintaining H100 cluster for research and development of large language models.

Oct 2023 - Feb 2024

Research Intern Kotoba Technologies

Manager: Noriyuki Kojima

Worked on developing LLM training library and working on training large language models. I developed Mamba training library on Dec 2023 when huggingface didn't support mamba training at that time.

Summer 2023

Intern Preferred Networks, Inc.

Developed ImageRecognition System for Real-world Applications

Research Library

vlm-recipes: VLM training Framework

PythonPyTorch

A framework for training vision-language models with PyTorch FSDP. As of May 2024, since Megatron-LM did not support training Vision Language Models (VLMs), I independently extended llm-recipes to enable Visual Instruction Tuning, resulting in the development of vlm-recipes. Development was subsequently halted once Megatron-LM began supporting training for LLaVA.

moe-recipes: Mixture of Experts LLM training Framework

PythonPyTorch

As of January 2024, the range of MoE models supported by Megatron-LM was limited, and the version of Megatron-LM relied upon by megablockss was outdated. Consequently, to enable continual pre-training of Mixtral, it was necessary to develop a custom library. I independently created moe-recipes, a library built on DeepSpeed as the backend, which supported the development of tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1. This library has also been utilized in experiments for the ICLR 2025 paper, 'Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization.'

llm-recipes: LLM continual pre-training & post-training Framework

PythonPyTorch

As of January 2024, since Megatron-LM did not support training Mistral-7B-v0.1, I built upon Meta’s llama-recipes (now known as llama-cookbook) to develop a library that enables the training of non-Llama models. I modified the DataLoader to handle training at a 100B-token scale, integrated wandb logging, and implemented additional essential training features such as learning rate scheduling. The resulting library, llm-recipes, supports continual pre-training, supervised fine-tuning (SFT), and DPO. This work was submitted to the SC24 TPC workshop(https://tpc.dev/tpc-workshop-at-sc24/) and accepted. This library was used for training the models tokyotech-llm/Swallow-MS-7b-v0.1 and tokyotech-llm/Swallow-MS-7b-instruct-v0.1 as part of the Swallow Project, where I led the training efforts.

kotomamba: State Space Model training Framework

PythonPyTorch

As of December 2023, even popular libraries like Hugging Face Transformers did not support Mamba. To enable both from-scratch training and continual pre-training for Mamba models, I independently developed kotomamba—a distributed training library built on PyTorch FSDP.