Chunking Text for RAG: Improving Information Retrieval with Semantic Approaches


In the era of advanced AI and machine learning, effective document processing is crucial for optimizing information retrieval systems. Undatas.io provide powerful tools for analyzing and transforming documents, enabling users to extract meaningful insights from vast amounts of text. By leveraging techniques such as semantic chunking, we can significantly enhance the performance of Retrieval Augmented Generation (RAG) systems.
After processing documents through our platform, the extracted insights are transformed into a structured Markdown format.
Sample
Markdown Text
### 2. Related Work
### 2.1. Proprietary Commercial MLLMs
Large language models (LLMs) [1, 4, 7, 8, 11, 25, 104, 106, 108, 112, 113, 122, 123, 141] have greatly advanced AGI by enabling complex language tasks previously thought human-exclusive. Building on this, the development of pro- prietary commercial MLLMs represents a significant evo- lution. For example, OpenAI’s GPT-4V [87] extends GPT- 4’s capabilities by incorporating visual inputs, allowing it to handle both text and image content, which stands as a sig- nificant development in the domain of MLLMs. Afterward, Google’s Gemini series progresses from Gemini 1.0 [107] to Gemini 1.5 [92], enhancing MLLMs with the ability to process text, images, and audio and support up to 1 million tokens, which boosts performance significantly. The Qwen- VL-Plus/Max are Alibaba’s leading models in the Qwen- VL series [5], renowned for superior capacity in multimodal tasks without needing OCR tools. Recent advancements in proprietary MLLMs include Anthropic’s Claude-3V se- ries [3], HyperGAI’s HPT Pro [35], Apple’s MM1 [84], StepFun’s Step-1V [102], and xAI’s Grok-1.5V [125].
### 2.2. Open-Source MLLMs
The development of open-source MLLMs [2, 13, 43, 48, 51, 55, 56, 69, 70, 103, 110, 118, 120, 124, 138, 139] has sig- nificantly influenced the AGI landscape by integrating and enhancing capabilities in processing both visual and tex- tual data. Over the past year, many open-source MLLMs have become well-known, including the LLaVA series [62– 64], MiniGPT-4 [142], VisionLLM [116], Qwen-VL [5], CogVLM [117], Shikra [15], and others [18, 23, 90, 119]. However, these models are typically trained on images with small, fixed resolutions such as $336\!\times\!336$ , or $448\!\times\!448$ , which leads to sub-optimal performance on images with un- usual aspect ratios or document data. To address this is- sue, many approaches have been explored for training on high-resolution images. Currently, there are two common technical routes: one involves designing a dual-branch im- age encoder [32, 53, 76, 77, 121], and the other involves dividing a high-resolution image into many low-resolution tiles [24, 33, 47, 55, 57, 64, 68, 126, 127]. Despite these explorations in high-resolution training, these open-source models still exhibit significant gaps in understanding doc- uments, charts, and infographics, as well as recognizing scene texts, compared to leading commercial models.
### 2.3. Vision Foundation Models for MLLMs
Vision foundation models (VFMs) are a focal point of re- search within the MLLM community. Currently, models like CLIP-ViT [91] and SigLIP [136] are prevalently uti- lized; however, many studies have been conducted to find the most suitable vision encoders for MLLMs [57, 71, 76,

Figure 3. Overall Architecture. InternVL 1.5 adopts the ViT- MLP-LLM architecture similar to popular MLLMs [62, 64], com- bining a pre-trained InternViT-6B [18] with InternLM2-20B [11] through a MLP projector. Here, we employ a simple pixel shuffle to reduce the number of visual tokens to one-quarter.
111]. For instance, Tong et al. [111] observed notable dif- ferences in the visual patterns of CLIP and DINOv2 [88], leading to the development of a mixture-of-features mod- ule that combines these two VFMs. LLaVA-HR [76] in- troduced a dual-branch vision encoder utilizing CLIP-ViT for low-resolution pathways and CLIP-ConvNext for high- resolution pathways. Similarly, DeepSeek-VL [71] adopted a dual vision encoder design, using SigLIP-L for low- resolution images and SAM-B for high-resolution images. In this report, we propose a continuous learning strategy for our vision foundation model—InternViT-6B [18], which continuously boosts the visual understanding capabilities and can be transferred and reused across different LLMs.
### 3. InternVL 1.5
### 3.1. Overall Architecture
As illustrated in Figure 3, InternVL 1.5 employs an archi- tecture akin to widely-used open-source MLLMs, specifi- cally the “ViT-MLP-LLM” configuration referenced in var- ious existing studies [18, 23, 62–64, 71, 142]. Our im- plementation of this architecture integrates a pre-trained InternViT-6B [18] with a pre-trained InternLM2-20B [11] using a randomly initialized MLP projector.
During training, we implemented a dynamic resolution strategy, dividing images into tiles of $448\!\times\!448$ pixels in sizes ranging from 1 to 12, based on the aspect ratio and resolution of the input images. During testing, this can be zero-shot scaled up to 40 tiles (i.e., 4K resolution). To en- hance scalability for high resolution, we simply employed a pixel shuffle operation to reduce the number of visual to- kens to one-quarter of the original. Therefore, in our model, a $448\!\times\!448$ image is represented by 256 visual tokens.
Rendered Markdown
This conversion not only enhances readability but also allows for easier integration into various applications and systems. By utilizing Markdown, users can benefit from a clean and organized presentation of information, making it simpler to share and collaborate on findings.
This blog explores innovative strategies for chunking text, focusing on how these approaches can improve the qbloguality of information retrieval and reduce the occurrence of irrelevant results.
Context
Text splitting, or chunking, is usually the first in a RAG (Retrieval Augmented Generation) workflow. It simply means transforming long text documents to smaller chunks that are embedded, indexed, stored then later used for information retrieval.
A typical RAG system
Some “naive chunking strategies” include:
- Size-based chunking:
You just split the document into chunks of a specific size, regardless of semantics.
- Paragraph-based:
You split your document based on “end of paragraph” characters, like “\n\n”, “\n”, “;”…etc Obviously, this is an approximation for semantic chunking, where you assume (or hope) that each paragraph is holds semantically distinct information.
Problem
The chunking approaches mentioned above are purely syntactic. However, you would prefer to split your document into semantically distinct chunks.
Why?
Because when you do retrieval (at query time), you would like to return the chunk/chunks that is/are semantically closest to your query. If your chunks are not distinct enough semantically, then you may return information that was not asked for/about in the query, leading to lower quality results and higher LLM hallucination rate.
Solution: Semantic Chunking
The following are some strategies for more useful text splitting.
Sentence clustering-based chunking (needs a better name!):
The idea is to build your semantic chunks from the ground up.
-
Start with splitting your document into sentences. A sentence is usually a semantic unit as it contains a single idea about a single topic.
-
Embed the sentences.
-
Cluster close sentences together forming chunks, while respecting sentence order.
Semantic chunking — semantic sentence clustering
Propositional chunking
The idea is to iteratively build chunks with the help of an external LLM.
-
Start with a syntactic chunking iteration; Paragraph-based for example.
-
For each paragraph, generate standalone statements (propositions) using an LLM, with a prompt like “What are topics discussed in this text?” Propositions must be semantically self-contained and distinct statements.
-
Remove redundant propositions.
-
Index and store the generated propositions.
-
At query time, retrieve from the propositions corpus instead of the original documents corpus.
Semantic chunking — Propositional chunking
This paper proposes a propositional chunking algorithm that is similar to the one we describe.
https://arxiv.org/pdf/2312.06648.pdf
Conclusion
Naively splitting documents into chunks may result in suboptimal performance in downstream tasks like Q&A. We discussed two semantic chunking approaches that can greatly improve the quality of your RAG system.
📖See Also
- Undatas-io-2025-New-Upgrades-and-Features
- UndatasIO-Feature-Upgrade-Series1-Layout-Recognition-Enhancements
- UndatasIO-Feature-Upgrade-Series2-OCR-Multilingual-Expansion
- Undatas-io-Feature-Upgrade-Series3-Advanced-Table-Processing-Capabilities
- Undatas-io-2025-New-Upgrades-and-Features-French
- Undatas-io-2025-New-Upgrades-and-Features-Korean
Subscribe to Our Newsletter
Get the latest updates and exclusive content delivered straight to your inbox