Process of parsing a PDF in undatasio platform


undatasio platform provides a framework for automated PDF parsing, which can perform parsing operations based on the attributes of the uploaded PDF. The specific steps include:
Step 1: PDF pre-parse:
-
Analyze the properties of each PDF page, checking if it is encrypted, and whether each page is a scanned image version or an editable version.
-
Then, perform element layout recognition on each page to identify important elements (titles, body text, images, tables, formulas) ,their layout coordinates, and the reading order of the elements in eath page and save this information in JSON format.
Step2:PDF parse:
-
Parse each page of the PDF file. If it is an editable version, extract the title and body text based on coordinates. If it is a scanned version, recognize the text in the title and body areas using an OCR engine and write it into the corresponding JSON fields.
-
The analysis of the table section is the most complex part, and there are many scenarios to consider.
- In editable PDFs, the first step is to determine whether the table is generated by LaTeX or created as a table object by Word or Excel. If it is generated by Word or Excel, it can be directly restored to its original table object format through the API. If it is a table object generated by LaTeX, the table area needs to be processed into an image object, similar to how tables are handled in non-editable PDFs.
- The table processing method for scanned PDFs is similar to that for LaTeX versions. The complexity arises because scanned tables may sometimes have unclear outlines due to issues such as lighting, paper stains, or crumpled and folded pages. In such cases, a table structure recognition model is needed to identify the structure of the table, including whether it contains merged cells and to recognize the coordinates of the row and column elements. For some complex tables, the table structure recognition model may not perform well, and at this point, depending on the task’s objectives, it may be necessary to leverage the assistance of a visual large model to analyze the relevant targets. This will incur additional computational requirements and costs.
-
If it is a formula area, it will be converted into LaTeX syntax format using a formula recognition model, regardless of whether it is an editable version or a scanned version.
-
If it is an image, save that area directly as an image file in the corresponding folder.
Step3:Output
If the user needs to output the corresponding parsing results, a compressed file containing JSON and Markdown versions will generally be provided. The Markdown files will be generated according to the content and order of the JSON files, with tables displayed in HTML format containing tag labels for better visibility, and formulas presented in LaTeX syntax.
📖See Also
- Mastering-RAG-Optimization-The-Ultimate-Guide-to-Unstructured-Document-Parsing
- Leveraging-UnDatasio-and-deepseek-to-Analyze-Tesla-Gen-Report-A-Step-by-Step-Guide
- Leveraging-UnDatasio-and-DeepSeek-to-Analyze-Tesla-Gen-Report-2-Intelligent-Question-Answering-Unveiled
- In-Depth-Analysis-of-API-Services-Graphlit-LlamaParse-UndatasIO-etc-for-Extracting-Complex-PDF-Tables-to-Markdown
- Improving-the-Response-Quality-of-RAG-Systems-High-Quality-Enterprise-Document-Parsing
- Improving-the-Response-Quality-of-RAG-Systems-Excel-and-TXT-Document-Parsing
Subscribe to Our Newsletter
Get the latest updates and exclusive content delivered straight to your inbox