In recent years, Multimodal Large Language Models(MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks.
To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and
instruction-tuning data to activate MLLM’s grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct
DOGR-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing
comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model,
DOGR.
This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be opensourced for community development.
Well-annotated and diverse grounded data are crucial for improving the grounding and referring capabilities of MLLMs. Currently, there is a shortage of comprehensive and accurately labeled document grounded data.
Manually annotating raw document images is both time-consuming and labor-intensive, as it requires not only marking the bounding boxes but also accurately annotating all the text within those boxes.
To tackle this challenge, we collect a substantial volume of documents and develop the
DOGR-Engine to construct fine-grained document grounded datasets. Our document data sources primarily encompass three document data types:
posters, charts, and PDF documents. As shown below, we start by filtering the raw data to remove low-quality samples or those with missing or broken information. An overview of training data statistical information and two strategies are also provided below.
Pipeline of DOGR-Engine.
Training data composition of DOGR-Dataset.
Rerendering Stategy for Poster and Chart.
Merge Strategy for PDF document.
You can download the DOGR-Bench on DOGR-Bench.
7-Task Definition of
DOGR-Bench.
Model architecture of
DOGR.
Results on Existing General Document Benchmarks.
Results of DOGE on DOGE-Bench.
Quantitative Results of DOGE on DOGE-Bench.
More DOGE’s inference results on DOGE-Bench..
Other DOGE’s inference samples.
Failure cases of DOGE.
@misc{zhou2025dogrversatilevisualdocument,
title={DOGR: Towards Versatile Visual Document Grounding and Referring},
author={Yinan Zhou and Yuxin Chen and Haokun Lin and Yichen Wu and Shuyu Yang and Zhongang Qi and Chen Ma and Li Zhu and Ying Shan},
year={2025},
eprint={2411.17125},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17125},
}
}