Yue's Homepage

About me

My name is Yue Zhao* (赵岳 in simplified Chinese). I am a Postdoctoral Scholar at Stanford University, advised by Prof. Ehsan Adeli. I am affiliated with Stanford Translational AI (STAI) in Medicine and Mental Healt and Stanford Vision and Learning Lab. I recently got my PhD from the University of Texas at Austin, supervised by Prof. Philipp Krähenbühl. I obtained my MPhil's degree from Multimedia Laboratory at the Chinese University of Hong Kong, supervised by Prof. Dahua Lin. More previously, I got my Bachelor's degrees from Tsinghua University. My research interests are in computer vision, with an emphasis on video analysis and understanding. I am currently interested in (1) video compression in pursuit of visual Intelligence, (2) understanding long-form streaming videos, and (3) learning executable actions from videos. I am a recipient of the 2024-2025 NVIDIA Graduate Fellowship.

News

[Oct 21, 2025] RayZer won ICCV 2025 Best Student Paper Honorable Mention!
[Oct 16, 2025] I am selected as a NeurIPS 2025 "Reviewer : Top Reviewer"!
[Jan 22, 2025] BSQ and ISM are accepted to ICLR 2025!
[Jun 11, 2024] One tech report on generic visual tokenizer is on arXiv.
[Jun 07, 2024] LaViLa (CVPR 2023) wins an Egocentric Vision (EgoVis) 2022/2023 Distinguished Paper Award!
[May 01, 2024] 🏳️‍🌈⃤ VideoPrism accepted to ICML!
[Apr 20, 2024] Our positive-congruent training paper accepted by TPAMI (finally)!
[Feb 26, 2024] One paper accepted to CVPR 2024. See you in Seattle this summer!
[Feb 20, 2024] One tech report on foundational video encoder is on arXiv. It is fueled by VIIT's captions.
[Jan 11, 2024] One tech report on video instruction tuning (VIIT) is available on arXiv.
[Dec 08, 2023] I am awarded the 2024-2025 NVIDIA Graduate Fellowship. Thank you NVIDIA!
[Jun 19, 2023] We won EPIC-Kitchens 2023 Action Recognition and Multi-Instance Retrieval Challenges! I gave a talk on the winning solution at the workshop.
[Feb 28, 2023] One paper accepted to CVPR 2023 as Highlight. See you in Vancouver this summer!
[Aug 07, 2022] One paper accepted to ECCV 2022.
[May 16, 2022] One tech report on positive-congruent training is available on arXiv.
[Mar 28, 2022] One paper accepted to CVPR 2022 as Oral.
[Aug 20, 2021] Had a wonderful summer at AWS in Seattle.
[Mar 09, 2020] Two papers accepted to CVPR 2020 (1 oral + 1 poster).
[Aug 02, 2019] The extended version of our ICCV 2017 work has been accepted by IJCV.
[Jun 18, 2019] We launch MMAction, a versatile toolbox for action understanding based on PyTorch. v0.1.0 is now online!

Selected Preprints

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang
arXiv:2502.05178 [cs.CV]
[pdf][project page][code]

Training a Large Video Model on a Single Machine in a Day

Yue Zhao, Philipp Krähenbühl
arXiv:2309.16669 [cs.CV]
Winner of the EPIC-Kitchens 2023 Action Recognition and Multi-Instance Retrieval Challenges at CVPR 2023.
Invited talk at 11th EPIC Workshop at CVPR 2023
[pdf][code][tech report]

Selected Publications

One-Minute Video Generation with Test-Time Training

Karan Dalal*, Daniel Koceja*, Gashon Hussein*, Jiarui Xu*, Yue Zhao†, Youjin Song†, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Invited keynote talk at The Workshop on Test-time Scaling for Computer Vision (ViSCALE) at CVPR2025
[pdf][project page][code] [poster]

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
International Conference on Learning Representations (ICLR), 2025
[pdf][code] [poster]

VideoPrism: A Foundational Visual Encoder for Video Understanding

Long Zhao*, Nitesh B. Gundavarapu*, Liangzhe Yuan*, Hao Zhou*, ..., Yue Zhao, ..., Mikhail Sirotenko+, Ting Liu+, Boqing Gong+
International Conference on Machine Learning (ICML), 2024
[arXiv] [Blog]

ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training

Yue Zhao, Yantao Shen, Yuanjun Xiong, Shuo Yang, Wei Xia, Zhuowen Tu, Bernt Schiele, Stefano Soatto
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2024
[arXiv][code]

Distilling Vision-Language Models on Millions of Videos

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [project page]

Learning Video Representations from Large Language Models

Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023 (Highlight, top-2.5%)
Invited talk at Joint International 3rd Ego4D and 11th EPIC Workshop
Egocentric Vision (EgoVis) 2022/2023 Distinguished Paper Award
[arXiv] [code] [project page] [demo] [colab] [video (~8min)] [poster]

Real-time Online Video Detection with Temporal Smoothing Transformers

Yue Zhao, Philipp Krähenbühl
European Conference on Computer Vision (ECCV), 2022
[arXiv] [code] [poster]

Revisiting Skeleton-based Action Recognition

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, Bo Dai
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral, top-4.2%)
[arXiv] [code]

FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding

Dian Shao, Yue Zhao, Bo Dai, Dahua Lin
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020 (Oral, top-5.0%).
[arXiv][project page]

Temporal Action Detection with Structured Segment Networks

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin
International Conference on Computer Vision (ICCV), 2017.
[pdf] [arXiv] [IJCV version] [code] [project page]