TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Ling You*, Wenxuan Huang*, Xinni Xie, Xiangyi Wei, Bangyan Li, Yang Li Shaohui Lin Changbo Wang
School of Computer Science and Technology, East China Normal University, Shanghai, China
Arxiv 2025

*Indicates Equal Contribution
MY ALT TEXT

Overview. Traditional methods (left) show limited accuracy in temporal alignment and caption quality, while TimeSoccer (right) produces context-aware descriptions with better alignment to ground-truth events.

Abstract

Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

Video Demo

Method

Overview of TimeSoccer. Given a full 45-minute soccer video, frame features are extracted using an Image Encoder and Image Q-Former, while timestamps are obtained from the original frame sequence. Both features and timestamps are then processed by the MoFA-Select module.The compressed features are passed through a sliding Video Q-Former to generate video tokens, which are concatenated with timestamp-based text tokens and the user query token before being input into the LLM for final prediction.

Results

Quantitative comparison of temporal localization and caption quality across different methods on the Soccernet-Caption dataset. The best results are bolded and the second best results are underlined in all following tables. M-Score and C-Score denote Qwen2.5-VL-72B-Instruct evaluations for match consistency and overall commentary quality. “SN-Caption+X” indicates that the model generates commentary based on timestamps predicted by SN-Caption, while “3-minute via 15× inference” refers to generating the final match-level commentary by performing 15 separate inferences on consecutive 3-minute segments.

Comparison of MoFA-Select with standard SFT and baseline compression methods on the Soccernet-Caption dataset in the 45-minute video setting.

Ablation Study on Training Paradigms and Positional Encoding Extensions for Full-Match Video Understanding on the Soccernet-Caption dataset.

Qualitative Comparison

Quality comparison results on different methods. TimeSoccer demonstrates its advantages from multiple perspectives: (i) more accurate timestamp alignment; (ii) improved event descriptions; (iii) richer, more realistic commentary resembling professional broadcasts.

Application

BibTeX

@article{you2025timesoccer,
        title={TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation},
        author={You, Ling and Huang, Wenxuan and Xie, Xinni and Wei, Xiangyi and Li, Bangyan and Lin, Shaohui and Li, Yang and Wang, Changbo},
        journal={arXiv preprint arXiv:2504.17365},
        year={2025}
      }