🐨 About Me
I am Fan Zhou, currently doing AI research at GAIR Lab (2024 – Now) mentored by Prof. Pengfei Liu.
Previously, I obtained my Master Degree, and Bachelor Degree (IEEE honor class) at Shanghai Jiao Tong University (SJTU), major in computer science. I’ve interned at Microsoft Research Asia (2021-2022), XLang Lab@HKUNLP (2023), where I spent wonderful times with my mentors and colleagues.
Research Interests
I am generally interested in Natural Language Processing, and try to train good models and build useful tools.
Recently, I am particularly interested in:
- Data-Centric Methods and Foundation Model Development: (ProX, Sailor2)
- Code Generation, Understanding, and Everything related to Reasoning (MSTaR)
- Agentic Language Models and Applications: (OpenAgents, Lemur)
🔥 News
- 2024.12: 🔥 Enjoy Sailor2, a state-of-the-art language model family for south-east asia.
- 2024.11: 🔥 We have released MStaR, a self-evolving training recipe for multimodal reasoning.
- 2024.09: 🔥 We have released ProX, a small-LM-based pre-training data refining framework!
- 2024.09: 📄 OlympicArena paper is accepted by Neurips'24.
- 2024.07: 📄 OpenAgents paper is accepted by COLM'24.
- 2024.05: 📄 Preference Dissection paper is accepted by ACL'24.
- 2024.01: 📄 Our Lemur paper(Agent Model) is accepted by ICLR'24 (Spotlight, 5%).
- 2023.10: 🔥 We've built OpenAgents, an open platform for language agents in the wild!
- 2023.10: 🙋 We have released Lemur-70B, an agentic language model based on LLama-2!
- 2023.04: 🔥 New preprint applying symbolic tasks in instruction tuning
- 2022.10: 📄 Our TaCube paper(Table QA) is accepted by EMNLP'22 (Oral Presentation).
Publications
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
2024, Project.
Blog /
Code /
Models /
Pre-training Datasets /
Post-training Datasets /
X Thread /
An open state-of-the-art language model family for south-east asia languages, continually trained on Qwen-2.5.
Diving into Self-Evolving Training for Multimodal Reasoning
Wei Liu*, Junlong Li*, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He, (*=equal contribution)
2024, Preprint.
PDF /
Code /
Resources /
Project Page /
A self-evolving training recipe for multimodal reasoning, M-STaR.
Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Fan Zhou*, Zengzhi Wang*, Qian Liu, Junlong Li, Pengfei Liu, (*=equal contribution)
2024, Preprint.
PDF /
Code /
Dataset,
>5K Downloads /
Project Page /
A small-LLM-based pre-training data refining framework via seamless program generation, with >100B Tokens of High-quality Data released.
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu
Neurips 2024 (DB track)
PDF /
Code /
Datasets /
Project Page /
A challenging multi-modal olympic competition benchmark for LLMs and LVMs.
Dissecting Human and LLM Preferences
Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu
ACL 2024
PDF /
Code /
Datasets /
Disentangling preferred and dispreferred features of LLM responses.
OpenAgents: An Open Platform for Language Agents in the Wild
Tianbao Xie*, Fan Zhou*, Zhoujun Cheng*, Peng Shi*, Luoxuan Weng*, Yitao Liu*, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, Tao Yu, (*=equal contribution)
COLM 2024
PDF /
Code /
Blog
(7.5K Users) /
An open platform for using, hosting, and building language agents.
Lemur: Harmonizing Natural Language and Code for Language Agents
Yiheng Xu*, Hongjin Su*, Chen Xing*, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, Tao Yu, (*=equal contribution)
ICLR 2024, Spotlight
PDF /
Code /
Models /
Blog
A 70B agent model pre-trained with balanced code-text corpora, compatible with GPT-3.5.
From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning
Qian Liu*, Fan Zhou*, Zhengbao Jiang, Longxu Dou, Min Lin, (*=equal contribution)
Tech Report 2023
PDF /
Code /
Datasets &
Models /
A symbolic and synthetic method for improving LM instruction tuning.
Reflection of Thought: Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems
Fan Zhou*, Haoyu Dong*, Qian Liu, Zhoujun Cheng, Shi Han, Dongmei Zhang, (*=equal contribution)
NeurIPS 2022, 2nd MATH-AI Workshop
PDF
Inference time calibration for LLM-based numerical reasoning.
TaCube: Pre-computing Data Cubes for Answering Numerical-Reasoning Questions over Tabular Data
Fan Zhou, Mengkang Hu, Haoyu Dong, Zhoujun Cheng, Fan Cheng, Shi Han, Dongmei Zhang
EMNLP 2022, Oral
PDF
Pre-computing aggregation/arithmetic results to assist table numerical reasoning.
Table Pre-training: A Survey on Model Architectures, Pretraining Objectives, and Downstream Tasks
Haoyu Dong, Zhoujun Cheng, Xinyi He, Mengyu Zhou, Anda Zhou, Fan Zhou, Ao Liu, Shi Han, Dongmei Zhang
IJCAI 2022 (survey track)
PDF
A survey on various tabular models, especially on the pretrained transformers.
Exploring Image Regions Not Well Encoded by an INN
Zenan Ling, Fan Zhou, Meng Wei, Quanshi Zhang
AISTATS 2022
PDF
An analysis on the normalizing flow’s generation flaws.
Quantification and Analysis of Layer-wise and Pixel-wise Information Discarding
Haotian Ma, Hao Zhang, Fan Zhou, Quanshi Zhang
ICML 2022
PDF /
Code
A quantitative analysis of CNNs.
Projects

OpenAgents (2023)
Host your own ChatGPT Plus locally!
- Data Agent: code interpreter augmented with data tools
- Plugins Agent: 200+ plugins for daily life
- Web Agent: autonomous web browsing
Exeperiences
2021.09 - 2024.03, M.S.@SJTU, Computer Science.
2017.09 - 2021.06, B.S.@SJTU, IEEE honor class, Computer Science.
Service and Awards
- Reviewer: COLING 2024~2025, ICLR 2025, Instruction Workshop @ NeurIPS 2023, MATH-AI Workshop @ NeurIPS 2024
- Teaching Assistant: Introduction to Programming (2021), Large Language Models (CS2916, 2024)
- MSRA Stars of Tomorrow (Award of Excellent Intern), 2022
- Outstanding Graduates of SJTU, 2021
- Shanghai City Scholarship(≈top 5%), 2018