TimeSoccer

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

School of Computer Science and Technology, East China Normal University, Shanghai, China
ACM Multimedia 2025
^*Indicates Equal Contribution

Abstract

Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

BibTeX

@article{you2025timesoccer, title={TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation}, author={You, Ling and Huang, Wenxuan and Xie, Xinni and Wei, Xiangyi and Li, Bangyan and Lin, Shaohui and Li, Yang and Wang, Changbo}, journal={arXiv preprint arXiv:2504.17365}, year={2025} }

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Overview. Traditional methods (left) show limited accuracy in temporal alignment and caption quality, while TimeSoccer (right) produces context-aware descriptions with better alignment to ground-truth events.

Abstract

Video Demo

Method

Results

Comparison of MoFA-Select with standard SFT and baseline compression methods on the Soccernet-Caption dataset in the 45-minute video setting.

Ablation Study on Training Paradigms and Positional Encoding Extensions for Full-Match Video Understanding on the Soccernet-Caption dataset.

Qualitative Comparison

Quality comparison results on different methods. TimeSoccer demonstrates its advantages from multiple perspectives: (i) more accurate timestamp alignment; (ii) improved event descriptions; (iii) richer, more realistic commentary resembling professional broadcasts.

Application

BibTeX