In-Depth Analysis of API Services( Graphlit, LlamaParse, UndatasIO etc.) for Extracting Complex PDF Tables to Markdown

xll
xllAuthor
Published
14minRead time
In-Depth Analysis of API Services( Graphlit, LlamaParse, UndatasIO etc.)   for Extracting Complex PDF Tables to Markdown

In today’s rapidly evolving digital realm, the need for efficient tools to handle complex PDF documents and convert them into user-friendly Markdown format remains crucial. Building on our previous exploration of API services for PDF extraction to Markdown, we are now diving deeper into evaluating and analyzing more complex tables.

As we mentioned earlier, Graphlit’s comprehensive comparison of various API services such as LlamaParse, Zerox, Reducto, and Chunrk provided valuable insights. Our active participation in this evaluation highlighted the capabilities and limitations of different platforms.

Now, with a focus on more intricate table structures that may include merged cells and complex formatting, we are set to conduct a new round of thorough evaluations. The sample tables for this analysis will be significantly more challenging than the standard ones used in our previous assessment.

UndatasIO, which demonstrated remarkable accuracy and relatively fast processing time in our initial tests, will once again be put to the test. We expect it to continue to shine in handling complex tables.

Sample Table

This is the sample table we are using for comparison (converted to PDF format for the testing).

Our evaluation will cover multiple aspects, just as before. We will analyze the parsing results for accuracy, measure the time taken to complete the extraction, and consider the cost implications. By doing so, we aim to provide users with a comprehensive understanding of which service is best suited for their specific needs when dealing with complex PDF tables.

For all third-party API services, we used their freely available demo APIs or online demos.

In the process of analyzing more complex tables, there are two challenging difficulties. One is that tables are easily parsed as misaligned. The other is that there are challenges in table understanding. If not looked at carefully, it is difficult for humans to read the correct position of the data in the tables.UndatasIO has taken better measures for this challenge. The process of parsing complex tables is as follows: first, perform table cropping of PDFs, then perform OCR, then perform header parsing, then parse the first column on the left, and finally parse each row.

UndatasIO

Rendered Markdown

Graphlit

Rendered Markdown

LlamaParse

Accurate mode

Rendered Markdown

Premium mode

Rendered Markdown

Zerox (from OmniAI)

Rendered Markdown

Unstructured.IO

Rendered Markdown

Results

From the parsing results of the sample table, regardless of which API service is used, there is a certain degree of difficulty, and the parsing effects are also different. For complex tables with merged cells, restoring their original structure is quite challenging.

Among them, the markdown result parsed by UndatasIO is closer to the original table format in terms of restoring the table structure. In terms of the accuracy of table text, UndatasIO also does a better job. It can basically parse the table text, but it is still not perfect.

While LlamaParse (Premium mode) and Unstructured.IO have good text parsing effects, but the table structure restoration effect is not good. Graphlit is even worse in table structure restoration, and the text parsing result is also not as good as LlamaParse.

Judging from the overall performance evaluation, among all the API services(Graphlit,LlamaParse(Accurate mode,Premium mode ),Zerox (from OmniAI),Unstructured.IO,UndatasIO)evaluated this time, Zerox doesn’t perform as well as expected. Its capability in restoring table structure is rather limited, and it also doesn’t demonstrate any particular advantages in text parsing. Compared to other API services, there is a noticeable disparity.

Results: Parsing of Table Text

Among the evaluated API services, UndatasIO has better accuracy in parsing table text, though not perfect. LlamaParse (Premium mode) and Unstructured.IO also perform well. Graphlit lags behind, and Zerox (from OmniAI) shows no particular advantage.

Results: Parsing of Table Structure

UndatasIO’s markdown result is closer to the original format. LlamaParse (Premium mode) and Unstructured.IO have weak restoration effects. Graphlit is worse, and Zerox (from OmniAI) has limited capability, showing a disparity with other services.

Speed and Cost

In terms of speed, by parsing the sample table, the comparison results show that LlamaParse (Accurate mode) is the fastest among all the API services. Zerox (from OmniAI) and Unstructured.IO are relatively slower compared to others. Graphlit’s speed is just average. LlamaParse (Premium mode) and UndatasIO have moderate speeds.

Summary

There are performance and cost differences with each of these approaches, but when looking for the most accurate extraction of Markdown from complex tables,UndatasIO will yield great results.

UndatasIO shows promising performance in parsing complex tables. In terms of table structure restoration, its markdown result is closer to the original format compared to other services. While not perfect, it also does a better job in parsing table text. In terms of speed, it has a moderate pace, falling between the fastest and the slowest services.

Overall, for those looking for a balance between accuracy and reasonable processing time in extracting Markdown from complex tables, UndatasIO is a viable option. However, as with all services, users should consider their specific needs and priorities when choosing an API for their particular use case.

Contact us

If you’re interested in the UndatasIO platform or following our evaluations, feel free to try it out and experience its capabilities.Try now.

Let’s explore the world of efficient PDF to Markdown conversions together.Stay tuned for our in-depth analysis and recommendations.

📖See Also

Subscribe to Our Newsletter

Get the latest updates and exclusive content delivered straight to your inbox