Google Scholar Github Email: kifish.pro@gmail.com
Experience
- 2023.04-present LLM Researcher at ByteDance Seed LLM
- 2021.07-2023.03 NLP algorithm engineer at Kuaishou MMU
Publications
2025
[LLM pretrain] Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang. Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space. arXiv:2512.24617, 2025.10
- Great Team Collaboration
- We propose Dynamic Large Concept Models (DLCM), a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient.
- I design and construct the training data entirely from open-source data.
- arXiv Hugging Face
[LLM pretrain] Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian. Scaling Latent Reasoning via Looped Language Models. arXiv:2510.25741, 2025.10
- Great Team Collaboration
- We scale up Looped Language Models (LoopLM) to 2.6 billion parameters and complete pretraining on 7.7 trillion open-source tokens following a multi-stage data recipe encompassing Pretraining, Continual Training (CT), Long-CT, and Mid-Training. The resulting model is on par with SOTA language models of 2–3× size. We open source all the model weights and the data recipe.
- I design and curate all pretraining data mixtures utilizing open-source data and provide key insights throughout the pretraining process.
- Project Page arXiv Twitter Hugging Face 机器之心
[LLM pretrain] Kai Hua, Steven Wu, Ge Zhang. AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection. arXiv:2505.07293, 2025.05
- LLM Pretrain-data Selection (Idea Originator && Project Leader)
- We propose AttentionInfluence, a training-free and supervision-free method for reasoning-centric data selection. By masking attention heads in a small pretrained model and measuring loss differences, we identify reasoning-intensive data that significantly improves the performance of larger models. Applied to a 7B model, our approach yields consistent gains on benchmarks like MMLU, GSM8K, and HumanEval—demonstrating an effective weak-to-strong scaling path for reasoning-focused pretraining.
- arXiv Twitter 量子位 Community Reproduction Submission Log
[LLM posttrain] Jinrui Liu, Jeff Wu, Xuanguang Pan, Gavin Cheung, Shuai Ma, Chongyang Tao. AIR: Post-training Data Selection for Reasoning via Attention Head Influence. arXiv:2512.13279, 2025.12
- LLM Posttrain-data Selection (Idea Originator && Project Leader)
- We propose AIR (Attention Influence for Reasoning), a train-free and unsupervised framework for post-training data selection. AIR measures the influence of attention heads to estimate the reasoning intensity of samples and intermediate steps, enabling more effective data filtering for multi-step reasoning tasks. Our results on Qwen2.5-32B using the s1 dataset demonstrate consistent improvements across diverse reasoning benchmarks while maintaining strong generalization.
[LLM posttrain] Xuanguang Pan, Chongyang Tao, Jiayuan Bai, Jianling Gao, Zhengwei Tao, Xiansheng Zhou, Gavin Cheung, Shuai Ma. EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis. arXiv:2601.04875, 2026.01
- Great Team Collaboration
- We proposes a structure-aware framework for generating high-quality Text-to-SQL training data. Instead of relying on uncontrolled LLM generation, EvolSQL systematically increases SQL complexity through syntax-tree-based transformation operators, enabling scalable and diverse data synthesis. Experiments show that models trained on EvolSQL data achieve strong performance and generalization with significantly less (1/18) data, highlighting the importance of structure-aware data construction for semantic parsing.
[Model] In-Place Test-Time Training. Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Wenhao Huang, Di He, Tianle Cai. ICLR 2026.
- Great Team Collaboration
- ICLR 2026 Oral
[LLM evaluation] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents. arXiv:2512.12730, 2025.12
- Great Team Collaboration
- discussion and cooperation
- labeled examples case
- arXiv Hugging Face 机器之心 Twitter
Seed VLM&LLM Team. Seed-2.0, Technical Report, 2026.02
- VLM&LLM (Team Collaboration)
- Construct all newly added long-context(128K/512K) CT data and long-context evaluation
- paper
Seed VLM&LLM Team. Seed-1.8, Technical Report, 2025.12
- VLM&LLM (Team Collaboration)
- Construct all newly added long-context(128K/512K) CT data and long-context evaluation
- github
Seed Model&LLM&VLM Team. Seed-VWN, Technical Report, 2025.11
- Model&LLM&VLM (Team Collaboration)
- Construct all newly added long-context(128K/512K) CT data and long-context evaluation
- arXiv
Seed LLM Team. Seed OSS 36B, Open Source Model, 2025.08
- LLM Code/Pretrain (Team Collaboration)
- [MASK] the text mid-training and long-context(128K/512K) CT
- Hugging Face 量子位
Seed LLM&VLM Team. Seed-1.6, Technical Blog, 2025.06
- LLM&VLM Pretrain (Team Collaboration)
- [MASK] the multimodal long-context(128K/512K) CT
- Technical Blog 机器之心
Seed VLM&LLM Team. Seed1.5-VL Technical Report. arXiv:2505.07062, 2025.05.
Seed LLM Team. Seed-Thinking-v1.5: Advancing Superb Reasoning Models with Reinforcement Learning. arXiv:2504.13914. 2025.04
2024
- [Embedding] Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Zhengwei Tao, and Shuai Ma. Llms are also effective embedding models: An in-depth overview. arXiv preprint arXiv:2412.12591, 2024.12
- arXiv [TOIS 2025]
2020
- [Retrieval-Based Chatbot] Kai Hua, Zhiyuan Feng, Chongyang Tao, Rui Yan, Lu Zhang. Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020), 2020.10
- arXiv [CIKM 2020]