I'm a first-year Master's student working at the intersection of HPC and Machine Learning. My research focuses on distributed training of large models and low-precision(FP8) training. I am a core contributor of Swallow Project which is a Japanese LLM development initiative. Also, I am in charge of the maintenance of Pre-training Library for LLM and conducting experiments on training LLMs in many projects.
My interest is efficient training of large models and I usually profile the LLMs training process with pytorch profiler or nsight systems and also I am interested in low-precision training. In our experiments, FP8-DelayedScaling training is not sufficient for long-run training in terms of training stability, which is reported in our paper. I am currently researching how to improve the stability of FP8 training with Microscaling(MX) Data Format and tile-wise fine-grained quantization.
News
Nov 2024
NVIDIA AI Summit 2024 Tokyo Talk
I gave a talk at NVIDIA AI Summit 2024 Tokyo on the topic of 'How to train LLM efficiently with Megatron-LM and TransformerEngine'.
August 2024
Google Cloud Next '24 Tokyo Talk
I gave a talk at Google Cloud Next '24 Tokyo on the topic of 'How to use Google Cluster Toolkit and real use-case'.
March 2024
NLP 2024 workshop talk
I gave a talk at the NLP 2024 workshop on the topic of 'Distributed Training Technologies for Natural Language Processing'.
Education
Institute of Science Tokyo
Master in Computer Science
Advisor: Prof. Jun Sakuma and Prof. Rio Yokota
Tokyo Institute of Technology
B.S. in Computer Science
Publications
ICLR 2025
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
SC (tpc workshop) 2024
llm-recipes: A Framework for Seamless Integration and Efficient Continual Pre-Training of Large Language Models
Kazuki Fujii, Taishi Nakamura, Rio Yokota
CVPR (workshop) 2024
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
Yuichi Inoue, Kento Sasaki, Yuma Ochi, Kazuki Fujii, Kotaro Tanahashi, Yu Yamaguchi
COLM 2024
Building a Large Japanese Web Corpus for Large Language Models
Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki
Experience
Research Intern — SB Intuitions
Manager: Sho Takase
Worked on developing frameworks on training large language models.
Research Intern — AIST (National Institute of Advanced Industrial Science and Technology)
Manager: Hiroya Takamura
I am involved in selecting and maintaining pre-training and post-training libraries, managing experiments, and setting up experimental environments to develop a Japanese LLM with competitive performance. This initiative, known as the Swallow Project (https://swallow-llm.github.io/index.en.html), has contributed to the development of non-English LLMs by achieving top performance among open Japanese models as of December 2023. As a core contributor to the project, I have been broadly involved in all aspects of the training process—from procuring computational resources and maintaining the Environment Module to creating synthetic data.
Research Intern — Turing
Manager: Yu Yamaguchi
Worked on developing frameworks on training vision-language models and large language models.
Intern — Sakana AI
Manager: Takuya Akiba
Worked on deploying and maintaining H100 cluster for research and development of large language models.
Research Intern — Kotoba Technologies
Manager: Noriyuki Kojima
Worked on developing LLM training library and working on training large language models. I developed Mamba training library on Dec 2023 when huggingface didn't support mamba training at that time.
Intern — Preferred Networks, Inc.
Developed ImageRecognition System for Real-world Applications
Research Library
vlm-recipes: VLM training Framework
A framework for training vision-language models with PyTorch FSDP. As of May 2024, since Megatron-LM did not support training Vision Language Models (VLMs), I independently extended llm-recipes to enable Visual Instruction Tuning, resulting in the development of vlm-recipes. Development was subsequently halted once Megatron-LM began supporting training for LLaVA.
moe-recipes: Mixture of Experts LLM training Framework
As of January 2024, the range of MoE models supported by Megatron-LM was limited, and the version of Megatron-LM relied upon by megablockss was outdated. Consequently, to enable continual pre-training of Mixtral, it was necessary to develop a custom library. I independently created moe-recipes, a library built on DeepSpeed as the backend, which supported the development of tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1. This library has also been utilized in experiments for the ICLR 2025 paper, 'Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization.'
llm-recipes: LLM continual pre-training & post-training Framework
As of January 2024, since Megatron-LM did not support training Mistral-7B-v0.1, I built upon Meta’s llama-recipes (now known as llama-cookbook) to develop a library that enables the training of non-Llama models. I modified the DataLoader to handle training at a 100B-token scale, integrated wandb logging, and implemented additional essential training features such as learning rate scheduling. The resulting library, llm-recipes, supports continual pre-training, supervised fine-tuning (SFT), and DPO. This work was submitted to the SC24 TPC workshop(https://tpc.dev/tpc-workshop-at-sc24/) and accepted. This library was used for training the models tokyotech-llm/Swallow-MS-7b-v0.1 and tokyotech-llm/Swallow-MS-7b-instruct-v0.1 as part of the Swallow Project, where I led the training efforts.
kotomamba: State Space Model training Framework
As of December 2023, even popular libraries like Hugging Face Transformers did not support Mamba. To enable both from-scratch training and continual pre-training for Mamba models, I independently developed kotomamba—a distributed training library built on PyTorch FSDP.