Kevin Qinghong Lin

Ph.D. Student

Show Lab
National University of Singapore

Email: kevin.qh.lin [at]

Photo taken on Rottnest Island.


Hi, I am a Ph.D. student in Show Lab @ NUS, working with Prof. Mike Shou.

My research focuses on Video Understanding and Language Models, aiming to develop assistants to streamline human tasks.



VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Kevin QH. Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Z. Shou.

Preprint, 2024
[project] [paper] [code]
Can an agent recreate PowerPoint animation effects from instructional videos?

CosMo: Contrastive Streamlined Multimodal Model With Interleaved Pre-Training
Alex JP. Wang, Linjie Li, Kevin QH. Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang and Mike Z. Shou.

Preprint, 2023
[project] [paper] [code] [dataset] [MoE]
Interleaved vision-text pretraining, with contrastive and generative modeling, MoE scaling.

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao, Lei Ji, Luowei Zhou, Kevin QH. Lin, Joya Chen, Zihan Fan, Mike Z. Shou.

Preprint, 2023
[project] [paper]
A video agent for general video understanding.


VideoLLM-online: Towards Large Video-Language Model for Streaming Video.
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin QH. Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Z. Shou.

CVPR, 2024
[project] [paper] [code]
The first streaming video-language model, achieving 10 FPS for long video online processing.

Bootstrapping SparseFormers from Vision Foundation Models
Ziteng Gao, Zhan Tong, Kevin QH. Lin, Joya Chen, Mike Z. Shou.

CVPR, 2024
[paper] [code]
An efficient pathway to transform vision foundation models into SparseFormer.

VisorGPT: Learning Visual Prior via Generative Pre-Training
Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin QH. Lin, Yefeng Zheng, Linlin Shen, Mike Z. Shou.

NeurIPS, 2023
[project] [paper] [code]
Model visual prior by language modeling.

UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin QH. Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex JP. Wang, Rui Yan, Mike Z. Shou.

ICCV, 2023
[demo] [paper] [code]
The first video temporal grounding pretraining model, unifying diverse temporal labels.

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick, Yale Song, Sayan Nag, Kevin QH. Lin, Hardik Shah, Mike Z. Shou, Rama Chellappa, Pengchuan Zhang

ICCV, 2023
[project] [paper] [code]
The new generation of egocentric video-language pre-training.

Too Large; Data Reduction for Vision-Language Pre-Training
Alex JP. Wang, Kevin QH. Lin, David JH. Zhang, Stan WX. Lei, Mike Z. Shou.

ICCV, 2023
[paper] [code]
Compress large-scale vision-text dataset into a small, high-quality set.

Affordance Grounding from Demonstration Video to Target Image
Joya Chen, Difei Gao, Kevin QH. Lin, Mike Z. Shou.

CVPR, 2023
[paper] [code]
Learning where to interact (affordance) from demonstration videos.

All in one: Exploring unified video-language pre-training
Alex JP. Wang, Yixiao Ge, Rui Yan, Yuying Ge, Kevin QH. Lin, Satoshi Tsutsui, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Z. Shou.

CVPR, 2023
[paper] [code]
The first unified video-language pretrained model.

Egocentric Video-Language Pretraining
Kevin QH. Lin, Alex JP. Wang, M. Soldan, M. Wray, R. Yan, Eric ZC. Xu, D. Gao, R. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, Mike Z. Shou.

NeurIPS, 2022. Spotlight (1.7%)
[project] [paper] [code] [poster]
The first egocentric vision-language pretrained model.
EgoVis 2022/2023 Distinguished Paper Award & PREMIA Best Student Paper Award 2023.
Double champions in Ego4D & Epic-Kitchens CVPR 2022 challenges.


VLog: Video as a Long Document

[demo] [code] [twitter]
Given a long video, we turn it into a doc containing visual + audio info. By sending this doc to ChatGPT, we can chat over the video!



Flag Counter

© Kevin