OHRBench: Unveiling the Crucial Role of OCR in RAG Systems
In the ever-evolving landscape of artificial intelligence and natural language processing, Retrieval-augmented Generation (RAG) has emerged as a powerful technique to supercharge Large Language Models (LLMs). By integrating external knowledge, RAG systems can mitigate the issue of hallucinations and stay updated with the latest information without the need for time-consuming retraining.
The cornerstone of many RAG systems is the external knowledge base. A common approach to constructing these knowledge bases is by extracting structured data from unstructured PDF documents via Optical Character Recognition (OCR). However, this process is not without its challenges. The imperfect predictions of OCR and the non-uniform representation of structured data mean that knowledge bases often end up with various forms of OCR noise.
In this blog post, we’re going to introduce you to OHRBench, a revolutionary benchmark that aims to shed light on the cascading impact of OCR on RAG systems. OHRBench comprises 350 meticulously selected unstructured PDF documents sourced from six real-world RAG application domains. Additionally, it includes question and answer pairs derived from the multimodal elements present in these documents, presenting a significant challenge to the existing OCR solutions used in RAG setups.
In the RAG (Retrieval-Augmented Generation) system, imperfect extraction by OCR (Optical Character Recognition) from unstructured PDF documents and non-uniform representation of structured data can lead to OCR noise (semantic noise and format noise) in the knowledge base, which ultimately affects the performance of the RAG system.
Diagrams of different levels of semantic noise on plain text, equations, and tables are all perturbed based on existing OCR results.
Therefore, OHRBench is proposed and open-sourced to evaluate the applicability of current OCR solutions in real-world RAG applications:
- Pipeline-based OCR demonstrates the best performance. Among all OCR solutions, Marker achieves the best retrieval performance, while MinerU dominates in generation and overall evaluation.
- All OCR solutions have suffered a performance decline. Even the best solution has a drop of 1.9 in EM@1 and 2.93 in F1@1 in the overall evaluation, and the losses are even greater in the retrieval and generation stages.
- There is potential for directly using Vision-Language Models (VLMs) instead of OCR in the RAG system.
OHRBench is a benchmark for evaluating the impact of OCR on the RAG system. It includes 350 unstructured PDF documents carefully selected from six real-world RAG application domains, as well as question-answer pairs derived from the multimodal elements in the documents.
Construction and evaluation protocols of OHRBench:
- (1) Benchmark dataset: Collect PDF documents from six domains, extract manually verified ground truth structured data, and generate questions and answers from multimodal document elements.
- (2) RAG knowledge base: OCR-processed structured data used for benchmarking current OCR solutions, and perturbed structured data used for evaluating the impact of different OCR noise types.
- (3) Evaluate the impact of OCR on each component and the entire RAG system.
The layout of documents in OHRBench is complex, and each number represents the number of PDF pages with that attribute.
One of the real table cases used to introduce semantic noise. In the upper left corner is the original table in the ground truth, and in the upper right corner is a real example from the OCR results of MinerU. The lower left and lower right corners are the results of moderate and severe perturbations on the original table after being guided by the real example. For better presentation, some LaTeX code has been manually modified so that most of the table structures can be displayed properly.
To truly understand the implications of OCR on RAG systems, we’ve identified two main types of OCR noise: Semantic Noise and Formatting Noise. Through a process of perturbation, we’ve generated a set of structured data with different degrees of each OCR noise. This allows us to conduct in-depth experiments and evaluations.
When we used OHRBench to evaluate current OCR solutions comprehensively, the results were quite revealing. None of the existing solutions proved to be fully capable of constructing high-quality knowledge bases for RAG systems. This has significant implications for the overall performance and reliability of RAG applications.
We also systematically evaluated the impact of these two noise types on RAG systems. The findings clearly demonstrated the vulnerability of RAG systems to OCR noise. Even a relatively small amount of noise can lead to a degradation in performance, affecting both the retrieval and generation aspects of the RAG process.
Finally, we delved into an interesting alternative. We discussed the potential of using Vision-Language Models (VLMs) without relying on OCR in RAG systems. This could potentially bypass some of the issues associated with OCR and open up new avenues for improving the performance and robustness of RAG setups.
In conclusion, OHRBench is a crucial step forward in understanding the complex relationship between OCR and RAG systems. As the field continues to progress, it will be essential to address the challenges posed by OCR noise and explore alternative strategies to optimize the performance of RAG systems. Stay tuned as we continue to explore and innovate in this exciting area of research.
https://github.com/opendatalab/OHR-Bench OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation https://arxiv.org/pdf/2412.02592
📖See Also
- Process-of-parsing-a-PDF-in-undatasio-platform
- Mastering-RAG-Optimization-The-Ultimate-Guide-to-Unstructured-Document-Parsing
- Leveraging-UnDatasio-and-deepseek-to-Analyze-Tesla-Gen-Report-A-Step-by-Step-Guide
- Leveraging-UnDatasio-and-DeepSeek-to-Analyze-Tesla-Gen-Report-2-Intelligent-Question-Answering-Unveiled
- In-Depth-Analysis-of-API-Services-Graphlit-LlamaParse-UndatasIO-etc-for-Extracting-Complex-PDF-Tables-to-Markdown
- Improving-the-Response-Quality-of-RAG-Systems-High-Quality-Enterprise-Document-Parsing
- Improving-the-Response-Quality-of-RAG-Systems-Excel-and-TXT-Document-Parsing
Subscribe to Our Newsletter
Get the latest updates and exclusive content delivered straight to your inbox