Logo DOGR

Towards Versatile Visual Document Grounding and Referring

Yinan Zhou* 1,2,3, Yuxin Chen*†2, Haokun Lin2,3,4, Yichen Wu3,5,
Shuyu Yang1, Li Zhu1, Zhongang Qi‡2, Chen Ma‡3, Ying Shan2
*Equal Contribution †Project Lead ‡Corresponding Authors ,
1Xi’an Jiaotong University, 2ARC Lab, Tencent PCG, 3City University of Hongkong,
4Institute of Automation, CAS
5Harvard University

Demo Video

Abstract

In recent years, Multimodal Large Language Models(MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks.

To fill this gap, we propose the DOcument Grounding and Referring data engine (LogoDOGR-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM’s grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct LogoDOGR-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, LogoDOGR. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be opensourced for community development.

Logo DOGR-Engine

Overview

Well-annotated and diverse grounded data are crucial for improving the grounding and referring capabilities of MLLMs. Currently, there is a shortage of comprehensive and accurately labeled document grounded data. Manually annotating raw document images is both time-consuming and labor-intensive, as it requires not only marking the bounding boxes but also accurately annotating all the text within those boxes. To tackle this challenge, we collect a substantial volume of documents and develop the Logo DOGR-Engine to construct fine-grained document grounded datasets. Our document data sources primarily encompass three document data types: posters, charts, and PDF documents. As shown below, we start by filtering the raw data to remove low-quality samples or those with missing or broken information. An overview of training data statistical information and two strategies are also provided below.

You can download the DOGR-Bench on DOGR-Bench.

Logo DOGR-Bench & DOGR

data-overview

7-Task Definition of Logo DOGR-Bench.

data-composition

Model architecture of Logo DOGR.

Experiment Results

-->

Visualization Examples

BibTeX

@misc{zhou2025dogrversatilevisualdocument,
      title={DOGR: Towards Versatile Visual Document Grounding and Referring}, 
      author={Yinan Zhou and Yuxin Chen and Haokun Lin and Yichen Wu and Shuyu Yang and Zhongang Qi and Chen Ma and Li Zhu and Ying Shan},
      year={2025},
      eprint={2411.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.17125}, 
}
}