About me

My name is Yue Zhao* (赵岳 in simplified Chinese). I am currently a last-year PhD student at the University of Texas at Austin, supervised by Prof. Philipp Krähenbühl. I am a recipient of the 2024-2025 NVIDIA Graduate Fellowship. I obtained my MPhil's degree from Multimedia Laboratory at the Chinese University of Hong Kong, supervised by Prof. Dahua Lin. More previously, I got my Bachelor's degrees from Tsinghua University. My research interests are in computer vision, with an emphasis on video analysis and understanding. I am currently interested in (1) video compression in pursuit of visual Intelligence, (2) understanding long-form streaming videos, and (3) learning executable actions from videos.

News

[Jan 22, 2025] BSQ and ISM are accepted to ICLR 2025!
[Jun 11, 2024] One tech report on generic visual tokenizer is on arXiv.
[Jun 07, 2024] LaViLa (CVPR 2023) wins an Egocentric Vision (EgoVis) 2022/2023 Distinguished Paper Award!
[May 01, 2024] 🏳️‍🌈⃤ VideoPrism accepted to ICML!
[Apr 20, 2024] Our positive-congruent training paper accepted by TPAMI (finally)!
[Feb 26, 2024] One paper accepted to CVPR 2024. See you in Seattle this summer!
[Feb 20, 2024] One tech report on foundational video encoder is on arXiv. It is fueled by VIIT's captions.
[Jan 11, 2024] One tech report on video instruction tuning (VIIT) is available on arXiv.
[Dec 08, 2023] I am awarded the 2024-2025 NVIDIA Graduate Fellowship. Thank you NVIDIA!
[Jun 19, 2023] We won EPIC-Kitchens 2023 Action Recognition and Multi-Instance Retrieval Challenges! I gave a talk on the winning solution at the workshop.
[Feb 28, 2023] One paper accepted to CVPR 2023 as Highlight. See you in Vancouver this summer!
[Aug 07, 2022] One paper accepted to ECCV 2022.
[May 16, 2022] One tech report on positive-congruent training is available on arXiv.
[Mar 28, 2022] One paper accepted to CVPR 2022 as Oral.
[Aug 20, 2021] Had a wonderful summer at AWS in Seattle.
[Mar 09, 2020] Two papers accepted to CVPR 2020 (1 oral + 1 poster).
[Aug 02, 2019] The extended version of our ICCV 2017 work has been accepted by IJCV.
[Jun 18, 2019] We launch MMAction, a versatile toolbox for action understanding based on PyTorch. v0.1.0 is now online!

Selected Preprints

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang
arXiv:2502.05178 [cs.CV]
[pdf][project page][code]

Selected Publications

One-Minute Video Generation with Test-Time Training

Karan Dalal*, Daniel Koceja*, Gashon Hussein*, Jiarui Xu*, Yue Zhao†, Youjin Song†, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
[pdf][project page][code]

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
International Conference on Learning Representations (ICLR), 2025
[pdf][code] [poster]

VideoPrism: A Foundational Visual Encoder for Video Understanding

Long Zhao*, Nitesh B. Gundavarapu*, Liangzhe Yuan*, Hao Zhou*, ..., Yue Zhao, ..., Mikhail Sirotenko+, Ting Liu+, Boqing Gong+
International Conference on Machine Learning (ICML), 2024
[arXiv] [Blog]

ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training

Yue Zhao, Yantao Shen, Yuanjun Xiong, Shuo Yang, Wei Xia, Zhuowen Tu, Bernt Schiele, Stefano Soatto
IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2024
[arXiv][code]

Distilling Vision-Language Models on Millions of Videos

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [project page]

Real-time Online Video Detection with Temporal Smoothing Transformers

Yue Zhao, Philipp Krähenbühl
European Conference on Computer Vision (ECCV), 2022
[arXiv] [code] [poster]

Revisiting Skeleton-based Action Recognition

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, Bo Dai
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral, top-4.2%)
[arXiv] [code]

FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding

Dian Shao, Yue Zhao, Bo Dai, Dahua Lin
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020 (Oral, top-5.0%).
[arXiv][project page]

Temporal Action Detection with Structured Segment Networks

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin
International Conference on Computer Vision (ICCV), 2017.
[pdf] [arXiv] [IJCV version] [code] [project page]

Education Experience

The University of Texas at Austin, TX, USA

August 2020 -
Ph.D. in Computer Science.

The Chinese University of Hong Kong, HK SAR, China

August 2017 - July 2020
M.Phil. in Information Engineering.

Department of Electronic Engineering, Tsinghua University, Beijing, China

August 2012 - July 2016
Bachelor of Engineering, magna cum laude.

School of Economics and Management, Tsinghua University, Beijing, China

August 2013 - July 2016
Bachelor of Science (Second Degree) in Economics.

Department of Information Technology and Electrical Engineering (D-ITET), Swiss Federal Institute of Technology(ETH), Zürich, Switzerland

September 2014 - Feburary 2015
Mobility student fully funded by China Scholarship Council (CSC).

Professional Experience

NVIDIA Research, Santa Clara, CA, USA

May 2024 - August 2024
Research Scientist Intern

Google Research, Venice, CA, USA

May 2023 - August 2023
Student Researcher

FAIR Labs, New York, NY, USA

May 2022 - August 2022
Research Scientist Intern

Amazon Web Services, Seattle, WA, USA

June 2021 - August 2021
Applied Scientist Intern

Multimedia Laboratory, The Chinese University of Hong Kong, HK SAR, China

September 2016 - August 2017, July 2015 - September 2015
Junior Research Assistant

Other projects

People tracking using RGB-D videos.

An undergraduate-level class project on people detection and tracking from RGB-D data collected by Kinect. [demo]

Miscellaneous

* For non-Chinese speakers, the pronuciation for Zhao Yue (family name coming first is preferred) is close to ['dʒau 'ju:-eh].