DOGR: Towards Versatile Visual Document Grounding and Referring

Towards Versatile Visual Document Grounding and Referring

Yinan Zhou* ¹^,²^,³, Yuxin Chen*†², Haokun Lin²^,³^,⁴, Yichen Wu³^,⁵,
Shuyu Yang¹, Li Zhu¹, Zhongang Qi‡², Chen Ma‡³, Ying Shan²

*Equal Contribution †Project Lead ‡Corresponding Authors ,
¹Xi’an Jiaotong University, ²ARC Lab, Tencent PCG, ³City University of Hongkong,
⁴Institute of Automation, CAS
⁵Harvard University

Abstract

In recent years, Multimodal Large Language Models(MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks.

To fill this gap, we propose the DOcument Grounding and Referring data engine ( Logo DOGR-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM’s grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct Logo DOGR-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, Logo DOGR. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be opensourced for community development.

Overview

Well-annotated and diverse grounded data are crucial for improving the grounding and referring capabilities of MLLMs. Currently, there is a shortage of comprehensive and accurately labeled document grounded data. Manually annotating raw document images is both time-consuming and labor-intensive, as it requires not only marking the bounding boxes but also accurately annotating all the text within those boxes. To tackle this challenge, we collect a substantial volume of documents and develop the Logo DOGR-Engine to construct fine-grained document grounded datasets. Our document data sources primarily encompass three document data types: posters, charts, and PDF documents. As shown below, we start by filtering the raw data to remove low-quality samples or those with missing or broken information. An overview of training data statistical information and two strategies are also provided below.

Pipeline of DOGR-Engine.

Training data composition of DOGR-Dataset.

Rerendering Stategy for Poster and Chart.

Merge Strategy for PDF document.

You can download the DOGR-Bench on DOGR-Bench.

Visualization Examples

Quantitative Results of DOGE on DOGE-Bench.

More DOGE’s inference results on DOGE-Bench..

Other DOGE’s inference samples.

Failure cases of DOGE.

BibTeX

@misc{zhou2025dogrversatilevisualdocument, title={DOGR: Towards Versatile Visual Document Grounding and Referring}, author={Yinan Zhou and Yuxin Chen and Haokun Lin and Yichen Wu and Shuyu Yang and Zhongang Qi and Chen Ma and Li Zhu and Ying Shan}, year={2025}, eprint={2411.17125}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.17125}, } }

DOGR

Towards Versatile Visual Document Grounding and Referring

Demo Video

Abstract

DOGR-Engine

Overview

DOGR-Bench & DOGR

Experiment Results

Visualization Examples

BibTeX