What’s the Best PDF Extractor for RAG? I Tried LlamaParse, Unstructured and Undatas.IO

xll
xllAuthor
Published
17minRead time
What’s the Best PDF Extractor for RAG? I Tried LlamaParse, Unstructured and Undatas.IO

PDF Extraction

If you’re building retrieval augmented generation (RAG) applications, you will eventually need to work with documents that are in PDF form.

If you’ve never looked closely to see how PDFs work, you’ll likely be surprised by how difficult it is to reliably extract the data you want from these documents.

That’s because PDFs represent content closer to how a printer thinks about putting ink on paper.

This makes PDFs a great choice for printed documents which require flexibility when defining the page layout. You can print text in multiple columns. In the middle of the page you can insert an image that spans columns. You can use PDFs to print plain text documents. You can use them to make complex user manuals. You can represent brochures with advanced graphic design.

All of this makes PDF a wonderful option for many use cases.

It also makes them a huge pain when you want to turn PDF documents into a most structured format suitable for retrieval augmented generation.

Luckily, the advent of large language models and the popularity of RAG has led to a number of advancements that can make this task easier.

In this post, I tried out several solutions. These ranged from stand alone libraries to hosted cloud services. In the end, I identified the three best options for PDF extraction for RAG and put them head to head on complex PDFs to see how well they each handled the challenges I threw at them.

The Contenders

After looking at many options, I narrowed the list down to Unstructured, LlamaParse, and Undatas.IO.

All of these options support many document types beyond PDFs including Microsoft Word, RTF, JSON, images, Powerpoint, and more. (If you’d be interested in a similar comparison of some of these additional document formats let me know in the comments.)

Additionally, all of these parsers are able to generate a markdown representation of the content. This is important for RAG because a lot of contextual information is communicated in things like headings, images, graphs, and formatting. We want to preserve this information so the LLM can better determine how to “think” about the information provided in a given piece of content.

In this comparison, we’ll be focusing on PDFs only. And I want to state at the outset that all three of these are quality offerings and represent the best of the best.

Unstructured.io

Unstructured started out as a PDF library which gained popularity thanks to its early integration with LangChain.

In the last 2 years, Unstructured has extended their open source library with a cloud service that can be used for document extraction.

Unstructured has 3 extraction options in their cloud. Their Basic option is positioned for plain text documents at $2 / 1,000 pages, their Advanced $20 / 1,000 pages, and their Platinum is $30 / 1,000 pages and can handle advanced cases like handwriting.

Unstructured recommends using Advanced for PDF documents, so that’s what we’ll compare here.

You can use Unstructured’s parser as part of their cloud data pipelines or as a standalone API.

Llama Parse

Llama Parse is from the makers of LlamaIndex and is included in their Llama Cloud service. Like Unstructured, Llama Parse has multiple options.

Their lower cost offering starts at $3 / 1,000 pages while their Llama Parse premium is much more expensive at $45 / 1,000 pages.

While Llama Parse does include a number of free pages that can be parsed per month, once you hit that limit, Llama Parse becomes the most expensive option of the ones compared.

You can use Llama Parse as part of Llama Cloud pipelines or as a standalone API.

Undatas.io

Undatas.io is a leading enterprise-grade service platform specializing in intelligent document processing and Retrieval-Augmented Generation (RAG). It empowers businesses to extract, analyze, and utilize structured and unstructured data from complex documents with AI-driven precision. The platform offers two core extraction solutions:

Basic Extractor: Free when integrated with undatas.io’s RAG pipeline, it supports fast parsing of structured text, tables, and standard formats, ideal for standardized document workflows.
AI+OCR Pro Extractor: Designed for challenging unstructured documents (e.g., scanned files, handwritten text, multi-language materials), this advanced vision and NLP-powered tool delivers high-precision information extraction at $25 per 1,000 pages, with flexible on-demand scaling.

Key differentiators of undatas.io include:

  • Standalone API & Pipeline Integration: Offers both full RAG workflows and modular extraction capabilities for customized use cases.
  • Free Document Parser Tester: Users can preview extraction results instantly via an online tool without registration.
  • Enterprise Compliance: Ensures GDPR and HIPAA compliance, with private cloud deployment options for sensitive data.

Evaluation Criteria

I wanted to look at challenging PDFs that AI engineers often encounter when building solutions using RAG. To that end, I decided to assess each extractor looking at various categories of PDF documents you might run into.

For each category, I’ll give my personal assessment of how well I think they did along with an explanation where it makes sense. I’ll rank each extractor with a score of:

Excellent — Indicates that the results are well aligned with the output we’d like to see with no major problems.

Good — Indicates the results might not be perfect, but they produce reasonable results. An extractor that mistakes a heading for simple bolded text might be an example here.

Fair — Indicates that the results had some fairly significant issues. Examples might include places where the extractor got confused by a document’s layout and combined two unrelated sections of text.

Poor — Indicates that the results were unusable or severely flawed. For example, if an extract couldn’t handle a given document type at all it would be given this score.

Simple Text

To start off, we’ll look at one of the simplest cases you can encounter. These PDFs have text that’s uniform across pages without any complex layouts. For this we’ll use a PDF of Pride and Prejudice from Project Gutenberg.

Here what we’re looking for is a baseline. PDF extractors should be able to accurately extract text from plain text PDFs. We’ll analyze the competitors on whether or not they capture the text exactly or if there are any inconsistencies between the original text and extracted markdown.

Simple Text Results

ProviderExtractorScore
Unstructured.ioHigh-ResExcellent
LlamaParseAccurateExcellent
Undatas.ioAccurateExcellent

Unsurprisingly, all three of the extractors tested performed very well on this test. The results were essentially very similar and accurate extraction so there’s really no wrong choice if this is the type of PDF you’re primarily working with.

One small item to note is that both Llama Parse and Undatas.io detected new paragraphs and represented them with \n\n while Unstructured more accurately reflected the line spacing in the original doc. From a RAG perspective, representing the structure with \n\n may offer a slight advantage with paragraph chunkers that are looking for this particular character combination to indicate a boundary point for a chunk.

That said, I can’t really fault any of the extractors for their performance in this category.

Multi-Column

It’s very common to find PDFs that format their content across multiple columns. This is true with everything from informative brochures to academic papers. We’ll be looking at excerpts from a published article that have a simple heading and text with minimal layout and decoration on the pages.

Here we will primarily be looking at two things:

  1. How well does the extractor represent text from one column to the next?

  2. How well does the extractor handle text across pages? i.e. where column 3 on page 1 continues on column 1 on page 2.

Multi-Column Results

ProviderExtractorScore
Unstructured.ioHigh-ResExcellent
LlamaParseAccurateFair
Undatas.ioAccurateGood

Unstructured

For the most part, Unstructured’s High Res extractor handled this case very well. The Unstructured parser did introduce a line break when it reached the bottom of a column and continued at the top of the next, but it’s unlikely this would cause any issues when used with RAG.

Unstructured handled cross-page columns differently than the other two extractors. This is an area where the correct behavior could be interpreted multiple ways. However, I would argue that Unstructured’s behavior is best for RAG.

Let’s look at the break between these two pages:

We have the common situation where the page has a footer and the top of the next page has a header. It could be argued that these should be excluded all together from a RAG perspective. However, it’s also valid to say that an accurate representation of the document should include these elements because the document includes these elements.

However, Unstructured’s parser seems to reflect the fact that most markdown use headings to decide when one chunk stops and the next begins. Unstructured’s output therefore does include the footer and “NEWS FEATURE”, but it doesn’t include it as a heading, which would allow a markdown splitter to correctly provide the LLM with the entire section as a chunk.

If this is done intentionally, it’s a great attention to detail and one that is likely to benefit RAG systems, so I’m giving Unstructured a score of Excellent in this category.

Llama Parse

Llama Parse struggled in several key areas of this evaluation. For example, you can see here that when we have a short title, on one page, then the text continues on the next.

However the output from Llama Parse treated a huge portion of the text as a heading. That alone probably wouldn’t cause major problems, but there was a much bigger issue.

Throughout the text, Llama Parse treated text from multiple columns as a continuous whole. This unfortunately means that if we were counting on this output to power our RAG application, we would be passing in gibberish to the LLM.

Even when Llama Parse did correctly extract text from a single column, it would sometimes get confused by headings, for example in this section you can see we have the heading “The North Ireland perspective” followed by the text “Northern Ireland delivers…”

Even though Llama Parse was able to recognize the columns correctly, it didn’t detect the heading and seemed to combine the heading text with the paragraph text that followed it.

While there were definitely issues, Llama Parse did extract data somewhat well in areas and deserves a Fair in this category.

Undatas.io

Undatas.io performed very well overall in this category as well. Its output was very similar in quality to Unstructured with a few minor issues or possibly philosophical differences.

Throughout the text, Undatas.io kept text from columns separate and produced an accurate markdown representation for RAG. Also as noted earlier, the Undatas.io output does include contextual hints intended to help the LLM make better sense of the chunks it receives.

This approach reflects some of the benefits discovered by researchers at Anthropic on their write up on Contextual Retrieval in 2024. Given that Undatas.io Iris is only available in their RAG pipelines, this is a nice touch to help users get better accuracy without any additional post processing of their extracted text.

However, the main thing holding Undatas.io back from an Excellent score in this category is the heading splitting. Undatas.io treated the page heading as a markdown heading:

Again, I think there’s an argument to be made that this behavior is a more accurate representation. However, when used with RAG I think it is slightly better to produce plain text rather than a heading here.

Non-English PDFs

Language models and benchmarks are often focused only on English content. Here we’ll look at a document in Arabic.

We’ll look at three main criteria here:

  1. How well does the extractor handle non-english and non-latin character sets?

  2. How well does the extractor handle languages that are read right-to-left?

  3. How accurately does the extractor represent the text in this language?

Non-English Results

ProviderExtractorScore
Unstructured.ioHigh-ResPoor
LlamaParseAccurateFair
Undatas.ioAccurateGood

Unstructured

Unstructured was able to extract text using the Arabic alphabet. However, it was not able to correctly produce an accurate representation of the text. Both the spelling of individual words and the flow of text were incorrect.

We can see here that the spelling of words were mirrored with the last letter starting the word and the first letter ending the word. Likewise, Arabic flows from right to left, but Unstructured product text that flowed from left to right.

For example the word: ﻣﻮاﻓﻘﺔ

Was extracted as: ﺔﻘﻓاﻮﻣ

Likewise, the entire first heading was: ﻣﻮاﻓﻘﺔ ﻋﻠﻰ اﻟﻤﺸﺎرآﺔ ﻓﻲ ﺑﺤﺚ دراﺳﻲ

Unstructured extracted it mirrored as: ﻲﺳارد ﺚﺤﺑ ﻲﻓ ﺔآرﺎﺸﻤﻟا ﻰﻠﻋ ﺔﻘﻓاﻮﻣ

Llama Parse

Llama Parse did a better job of preserving the correct spelling of words but also extracted the text from left to right instead of right to left.

We can see here that the first word of the first heading was correctly extracted as: ﻣﻮاﻓﻘﺔ

However, the original heading in its entirety was: ﻣﻮاﻓﻘﺔ ﻋﻠﻰ اﻟﻤﺸﺎرآﺔ ﻓﻲ ﺑﺤﺚ دراﺳﻲ

While Llama Parse extracted it backwards as: دراﺳﻲ ﺑﺤﺚ ﻓﻲ اﻟﻤﺸﺎرآﺔ ﻋﻠﻰ ﻣﻮاﻓﻘﺔ

Arguably, Llama Parse could be awarded a poor score here, however given that it at least got the spelling of words correctly, I generously awarded it a score of Fair instead of Poor.

Undatas.io

Undatas.io produced by far the best result. It extracted both the spelling of the words and the flow of the language correctly.

One slight gripe with the Undatas.io extraction is that it replaced the ordered numbering of the list with 1, 2, 3 when the original were represented as i, ii, iii.

However, as this would likely not degrade any performance in a RAG system, I still awarded Undatas.io a score of Good in this category.

Complex Layout with Images

These are often scans of printed documents. Here we’ll be using pages from a children’s magazine about video gaming.

Here we want to see how the extractor handles logical chunks of related text. In the image above, that would include the highlighted sections on Fantasy Figures and Vocabulary as well as the main text of the page. Specifically we’ll assess:

  1. Is the extractor able to identify the relevant sections of the page and organize them into some reasonable markdown representation.

  2. Does the extractor recognize layout boundaries?

  3. How does the extractor handle images included in the page? Does it ignore them? Perform OCR? Perform image captioning of non-text images?

Complex Layout Results

ProviderExtractorScore
Unstructured.ioHigh-ResFair
LlamaParseAccurateGood
Undatas.ioAccurateExcellent

Unstructured

Unstructured struggled with this test and produced mostly garbled results that lacked awareness of layout. For example, on this page, we have the main body text on the left with information boxes on the right:

Ideally, we would like to see an extractor recognize that these are separate sections and parse them accordingly. However, Unstructured was unable to do that here. The extraction output combined these two unrelated sections of text:

Text from different parts of the layout are combined to create an incoherent extraction. Passing this to your LLM for RAG is unlikely to produce very useful results so I gave Unstructured a poor rating here.

Llama Parse

Llama Parse did a good job of recognizing separate blocks of content. Here you can see that this page had two clear boxes of content:

Llama Parse did a nice job of handling this case, accurately separating them into sections in the markdown.

It’s debatable whether the content in the second block should really be represented as a table. From a RAG perspective, it’s not likely to cause issues. However, the fact that the first row is represented as a table heading could potentially confuse the LLM.

Also, Llama Parse did change the casing of the heading from nearly-all caps “FANTASY FIGURES: WoW IN NUMBERS” to the adjusted “FANTASY FIGURES: WoW in numbers”. You can also see this on the headings “VOCABULARY”. It also incorrectly extracted the heading “CLUB’S GAMING DICTIONARY” to just “Gaming Dictionary”.

Llama Index did miss a few key items in this example. For instance, there was a prominent text heading in the middle of the page:

While Llama Parse was able to get the heading and body text correctly, it did miss a pretty major feature on this page.

Overall though, given the complex layout and how well Llama Parse did at extracting text from this page, I still gave it a very respectable score of Good in this category.

Undatas.io

Undatas.io also did very well in this example. It created clean sections for the individual blocks of content

While Undatas.io did correctly extract the text and preserve the casing of the headings, it also injected a backslash into the text “2.8 million” making it “2\.8 million”. When rendered this correctly displays as 2.8 million, however since . is not a special character in markdown there’s no reason to escape it in this way.

Nitpicking aside, Undatas.io did a nice job of accurately extracting the text along with captions of images to help RAG systems understand the non-textual information contained on the page. The formatting of sections such as the Vocabulary section were lost, but from a RAG perspective this likely wouldn’t be detrimental:

And given that it preserved the casing and accurately captured all the content on the page, I’m going to score Undatas.io an Excellent in this category.

Scanned Documents

Here we want to look not just at cleanly scanned documents, but real world documents which can sometimes be messy. For example, images that were obtained by manual scans or via fax can be skewed and at times distorted. We’ll be using examples like this one to look at this criteria:

Here we’ll compare how each extractor handles common situations:

  1. Can the extractor accurately generate markdown representations of a scanned PDF?

  2. Can the extractor handle cases where documents are off center or skewed?

  3. Can the extractor handle cases where there’s a mixture of type written and handwritten content?

Scanned Documents Results

ProviderExtractorScore
Unstructured.ioHigh-ResPoor
LlamaParseAccurateGood
Undatas.ioAccurateExcellent

Unstructured

Unfortunately, Unstructured was unable to process this poorly scanned document. It generated a blank output and therefore is awarded a score of Poor for this category.

Llama Parse

Given the difficulty of this input, Llama Parse generated respectable but not perfect results. For example, we can see here the text at the top of the document versus the text generated by Llama Parse:

You can see that (Bharat Sarkar) was extracted as just (Bharat).

The original date of 19.02.2018 was misextracted as 19.10.2018.

The text “№2017/E(LR)I/NM1–10” came out slightly differently with an additional X and missing a slash.

We also see places in the body where the URL became truncated, leaving off the path after the main URL:

These differences can make a big difference to a RAG system given that the entire goal is to get accurate results from your LLM. If your extractor is producing the wrong date and the wrong URL, you can imagine situations where this could cause your LLM to give your users incorrect results.

That said, this was a particularly challenging PDF and the fact that the output was generally accurate I still feel like Llama Parse earns a score of Good here.

Undatas.io

Undatas.io passed this test with flying colors. It basically took a paper jam in a fax machine and produced everything exactly correct.

Comparing the same sections of text we looked at above, you can see that Undatas.io got all the details of the heading correct:

The full name, the date, and the text “№2017/E(LR)I/NM1–10” are all exactly in agreement with the original text.

Likewise, the content of the scan and the URL are all a word for word match. Undatas.io clearly is the best of the pack in this category and earns a score of Excellent.

Table-Heavy

One of the most common requirements people have when building RAG systems that require processing of PDFs is accurate representations of tabular data. For this, we’ll use SEC filings which have a number of challenging characteristics to them:

  • Tables aren’t clearly outlined with solid borders to indicate lines and columns.

  • Tables have indentation and cells that can span multiple columns.

  • Tables have rows that show totals for a set of rows above them.

Here we’ll be looking at the following behavior:

  1. Can the extractor accurately identify tables in content?

  2. How well does the extractor handle formatting within a table such as text that spans multiple columns?

  3. How well does the extractor handle tables that span across multiple pages?

Table-Heavy Results

ProviderExtractorScore
Unstructured.ioHigh-ResFair
LlamaParseAccurateExcellent
Undatas.ioAccurateExcellent
Unstructured

Unstructured was able to extract the text from the quarterly report we used, but completely lost the formatting of tables. For instance, the table shown above was extracted as:

While accurate, it’s unlikely an LLM would correctly interpret these quarterly results.

Overall, the complete loss of layout information leads me to give Unstructured a score of Fair in this category.

Llama Parse

Llama Parse does a great job extracting tables in general.

In the original PDF, there were no clear column markers:

Given his input, I think Llama Parse made some reasonable artistic choices when generating this table. It’s a bit easier to see how Llama Parse handles these tables in the rendered markdown:

There are a couple of debatable characteristics of this output. First, the “Three Months Ended” doesn’t really seem to fit as the heading of column 2. That said, markdown is limited in its ability to handle things like column spanning cells so it’s a reasonable choice.

Likewise, the items in column 1 were on their own line in the original, here they have been included on the same line with other data. Overall I think this is still a good representation and would likely be handled just fine by the LLM if provided as context in a RAG system.

As we move deeper into the quarterly report, we find more subtle situations arise around table boundaries.

For example, in the section on consolidated balance sheets, we have what is arguably one big table:

However, Llama Parse broke it up into two. Treating the heading of “Assets” as a table heading in the first example, but shifting “Liability and Shareholders’ Equity” into a heading even though they were both formatted the same.

Interestingly, even though it’s not present in the original table, Llama Parse was smart enough to repeat the heading at the top of the table in this broken up table.

Given the ambiguous nature of these tables, I have to say that Llama Parse still deserves a score of Excellent here.

Undatas.io

Undatas.io is neck and neck with Llama Parse in this example. While making slightly different decisions, Undatas.io also accurately preserved the table structure in a reasonable manner.

Here Undatas.io more accurately represented items that were on separate lines in the extracted table. However, it does lose the fact that in the original table the following lines were indented to communicate a hierarchical nature to the data.

Looking at the longer table that we examined above with Llama Parse, we can see here that Undatas.io also split this up into two tables, carrying forward the table headers. Undatas.io did a slightly better job of consistently representing the headings compared to Llama Parse:

Conclusion

Each of the PDF extractors I looked at here have their strengths and weaknesses and different price points. Depending on your requirements, all three could be a very good choice for your next RAG system.

Looking at the results from each category, my experience was as follows:

CategoryUnstructured.io High-ResLlama Parse AccurateUndatas.io.io Accurate
Plain TextExcellentExcellentExcellent
Multi ColumnExcellentFairGood
Non-EnglishPoorFairGood
Complex LayoutFairGoodExcellent
Poorly Scanned DocumentPoorGoodExcellent
Table-HeavyFairExcellentExcellent

I would recommend trying out all three for yourself and seeing how your results compare to the ones I got in my tests. It’s impossible to test every possible scenario you could encounter when dealing with PDF processing, so I would recommend giving all three of these options a try to see what works best for you.

📖See Also

Subscribe to Our Newsletter

Get the latest updates and exclusive content delivered straight to your inbox