A Brief Look at the Homogenized Open-Source Document Parsing Framework - Docling

xll
xllAuthor
Published
3minRead time
A Brief Look at the Homogenized Open-Source Document Parsing Framework - Docling

With the rise of Retrieval-Augmented Generation (RAG), more and more people have begun to focus on the effect of document structured parsing, and this field has become highly homogenized.

In this blog, we’ll briefly explore the technologies within the Docling PDF document parsing framework.

Methods

Layout Analysis Model

Docling employs a layout analysis model, which is an object detector used to predict the bounding boxes and categories of various elements on a given page image. Its architecture is derived from RT-DETR and has been retrained on the DocLayNet dataset. The inference relies on onnxruntime. As I’ve mentioned in previous articles, layout analysis heavily depends on scene data. Consequently, this model, trained only on the DocLayNet dataset, fails to meet the requirements of some common Chinese scenarios. Moreover, since it’s trained with RT-DETR, it has relatively large model parameters. As an alternative, one can adopt the previously open-source lightweight layout analysis model that supports fine-grained layout detection in Chinese scenarios like research reports containing paragraph information and academic papers.

(Address: https://github.com/360AILAB-NLP/360LayoutAnalysis)

Table Structure Recognition Model

Secondly, Docling utilizes the TableFormer table structure recognition model. It can predict the logical row and column structures of a given table based on the input image. The inference depends on PyTorch.

OCR Text Recognition

Docling also offers optional OCR support, for example, for scanned PDF files. By default, Docling uses the EasyOCR engine, which performs OCR on high-resolution (216 dpi) page images to capture details of small fonts.

Processing Pipeline

Docling implements a linear processing pipeline that executes operations on each document in sequence. Each document is first parsed by the PDF backend to retrieve programmatic text markers and render a bitmap image of each page. Then, the standard model pipeline is independently applied to each page in the document to extract features and content such as layout and table structures. Finally, the results of all pages are aggregated and passed through a post-processing stage to enhance metadata, detect the document language, infer the reading order, and ultimately assemble a typed document object that can be serialized into JSON or Markdown.

Results

Processing Speed

On a MacBook Pro M3 Max, when using 4 threads, Docling’s solution time is 177 seconds, with a throughput of 1.27 pages per second and a peak memory usage of 6.20 GB. When using 16 threads, the solution time is 167 seconds, the throughput is 1.34 pages per second, and the peak memory usage is not recorded.

OCR Performance

By default, the OCR engine performs OCR on high-resolution page images, but it runs rather slowly (more than 30 seconds per page).

Resource Efficiency

On an Intel Xeon E5-2690, when using 4 threads, the solution time is 375 seconds, with a throughput of 0.60 pages per second and the peak memory usage not recorded. When using 16 threads, the solution time is 244 seconds, the throughput is 0.92 pages per second, and the peak memory usage is 6.16 GB.

Summary

Currently, intelligent document parsing is highly homogenized, and in fact, there are few practical tools that can effectively handle document parsing in specific scenarios. However, we can refer to the specific optimizations in the Docling project, such as multi-threading. By combining it with some other powerful open-source or self-developed small models and making replacements, we can build our own document parsing tools.

References

https://github.com/DS4SD/docling

📖See Also

Subscribe to Our Newsletter

Get the latest updates and exclusive content delivered straight to your inbox