Publications

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Video Temporal Grounding (VTG), the task of localizing video segments from natural language queries, faces significant challenges in …

OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

Minghang Zheng , Zihao Yin , Yi Yang , Puxin Peng , Yang Liu

March 2026 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a …

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

Minghang Zheng , Puxin Peng , Benyuan Sun , Yi Yang , Yang Liu

July 2025 Proceedings of the IEEE/CVF International Conference on Computer Vision ICCV

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made …

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

September 2024 Proceedings of the 31st ACM International Conference on Multimedia ACM MM

Training-free Video Temporal Grounding usingLarge-scale Pre-trained Models

Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language …

Training-free Video Temporal Grounding usingLarge-scale Pre-trained Models

Minghang Zheng , Xinhao Cai , Qingchao Chen , Yuxin Peng , Yang Liu

August 2024 Proceedings of the European Conference on Computer Vision ECCV

Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization

Video sentence localization aims to locate moments in an unstructured video according to a given natural language query. A main …

Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization

Minghang Zheng , Shaogang Gong , Hailin Jin , Yuxin Peng , Yang Liu

July 2023 Association for Computational Linguistics ACL

Phrase-Level Temporal Relationship Mining for Temporal Sentence Localization

In this paper, we address the problem of video temporal sentence localization, which aims to localize a target moment from videos …

Phrase-Level Temporal Relationship Mining for Temporal Sentence Localization

Minghang Zheng , Sizhe Li , Qingchao Chen , Yuxin Peng , Yang Liu

February 2023 The Thirty-Seventh AAAI Conference on Artificial Intelligence AAAI oral

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning

Temporal sentence grounding aims to detect the most salient moment corresponding to the natural language query from untrimmed videos. …

Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning

Minghang Zheng , Yanjie Huang , Qingchao Chen , Yuxin Peng , Yang Liu

Peking University

March 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR

Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Video moment localization aims at localizing the video segments which are most related to the given free-form natural language query. …

Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Minghang Zheng , Yanjie Huang , Qingchao Chen , Yang Liu

Peking University

February 2022 The AAAI Conference on Artificial Intelligence AAAI

End-to-End Object Detection with Adaptive Clustering Transformer

End-to-end Object Detection with Transformer (DETR) performs object detection with Transformer and achieves comparable performance with …

End-to-End Object Detection with Adaptive Clustering Transformer

Minghang Zheng , Peng Gao , Renrui Zhang , Kunchang Li , Hongsheng Li , Hao Dong

November 2021 British Machine Vision Conference BMVC oral

Fast Convergence of DETR with Spatially Modulated Co-Attention

The recently proposed Detection Transformer (DETR) model successfully applies Transformer to objects detection and achieves comparable …

Fast Convergence of DETR with Spatially Modulated Co-Attention

Peng Gao , Minghang Zheng , Xiaogang Wang , Jifeng Dai , Honghsneg Li

October 2021 International Conference on Computer Vision ICCV