|
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Kevin QH. Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Z. Shou.
NeurIPS D&B, 2024. Spotlight
[project]
[paper]
[code (tbd)]
Can an agent recreate PowerPoint animation effects from instructional videos?
|
|
Learning Video Context as Interleaved Multimodal Sequences. Kevin QH. Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Z. Shou.
ECCV, 2024
[paper]
[code]
Video in-context learning using interleaved sequences of images, videos, plots and dialogues.
|
|
VideoLLM-online: Towards Large Video-Language Model for Streaming Video. Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin QH. Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Z. Shou.
CVPR, 2024
[project]
[paper]
[code]
The first streaming video-language model, achieving 10 FPS for long video online processing.
|
|
Bootstrapping SparseFormers from Vision Foundation Models Ziteng Gao, Zhan Tong, Kevin QH. Lin, Joya Chen, Mike Z. Shou.
CVPR, 2024
[paper]
[code]
An efficient pathway to transform vision foundation models into SparseFormer.
|
|
VisorGPT: Learning Visual Prior via Generative Pre-Training
Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin QH. Lin, Yefeng Zheng, Linlin Shen, Mike Z. Shou.
NeurIPS, 2023
[project]
[paper]
[code]
Model visual prior by language modeling.
|
|
UniVTG: Towards Unified Video-Language Temporal Grounding
Kevin QH. Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex JP. Wang, Rui Yan, Mike Z. Shou.
ICCV, 2023
[demo]
[paper]
[code]
The first video temporal grounding pretraining model, unifying diverse temporal labels.
|
|
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick, Yale Song, Sayan Nag, Kevin QH. Lin, Hardik Shah, Mike Z. Shou, Rama Chellappa, Pengchuan Zhang
ICCV, 2023
[project]
[paper]
[code]
The new generation of egocentric video-language pre-training.
|
|
Too Large; Data Reduction for Vision-Language Pre-Training
Alex JP. Wang, Kevin QH. Lin, David JH. Zhang, Stan WX. Lei, Mike Z. Shou.
ICCV, 2023
[paper]
[code]
Compress large-scale vision-text dataset into a small, high-quality set.
|
|
Affordance
Grounding from Demonstration Video
to Target Image
Joya Chen, Difei Gao, Kevin QH. Lin, Mike Z. Shou.
CVPR, 2023
[paper]
[code]
Learning where to interact (affordance) from demonstration videos.
|
|
All
in one: Exploring unified video-language
pre-training
Alex JP. Wang, Yixiao Ge, Rui Yan, Yuying Ge, Kevin QH. Lin, Satoshi Tsutsui, Xudong
Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Z. Shou.
CVPR, 2023
[paper]
[code]
The first unified video-language pretrained model.
|
|
Egocentric Video-Language Pretraining
Kevin QH. Lin, Alex JP. Wang, M. Soldan, M. Wray, R. Yan, Eric ZC. Xu, D. Gao, R. Tu,
W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, Mike Z. Shou.
NeurIPS, 2022. Spotlight (1.7%)
[project]
[paper]
[code]
[poster]
The first egocentric vision-language pretrained model.
EgoVis 2022/2023 Distinguished Paper Award & PREMIA Best Student Paper Award 2023.
Double champions in Ego4D & Epic-Kitchens CVPR 2022 challenges.
|