Improving the Response Quality of RAG Systems: Excel and TXT Document Parsing

xll
xllAuthor
Published
10minRead time
Improving the Response Quality of RAG Systems: Excel and TXT Document Parsing

In the previous article, we talked about the importance of turning enterprise private data into knowledge and the diversity of storage media for enterprise private data. We also introduced in detail the parsing of DOC and DOCX type documents. In this article, we will discuss the parsing of Excel and TXT type documents, with a focus on how to construct knowledge documents of Excel and TXT types and how to parse these types of documents.

Which kinds of knowledge are suitable for storage in Excel?

The biggest advantage of the RAG (Retrieval-Augmented Generation) system is that it can quickly convert the knowledge stored in various documents into an enterprise knowledge base without the need for system integration. However, the document types in different departments of an enterprise are complex and vary greatly, and thus parsing documents has become the biggest challenge for the RAG system. The reason is that although enterprises have a large number of knowledge documents, these documents are very scattered and there are numerous document types. Without sorting them out and relying solely on the RAG system to parse and identify them, it is quite difficult to form a high-quality knowledge base.

In fact, in some scenarios, Excel and TXT type documents can be used to organize data. Through manual screening and sorting, the structure of knowledge can be ensured to be comprehensive and accurate. When imported into the RAG system, high-quality “chunks” can be formed. After vectorization, the recall rate in searches is also quite high, which can ensure that the RAG system generates high-quality answers.

Then which kinds of knowledge are suitable for sorting? After all, excessive manpower consumption is not the original intention of the RAG system. Based on the actual feedback from enterprises, we have summarized several business scenarios that are most suitable for being sorted into Excel documents:

  1. Company rules and regulations, especially for group companies or enterprises where large departments have independent regulations.
  2. Enterprise financial data (with proper internal access control).
  3. Customer information management.
  4. Market research data.
  5. Knowledge that enterprises and institutions need to study frequently (sorted according to chapter content).
  6. Common knowledge about family life (for example, how to clean a toilet).

In fact, there are many other scenarios that are not listed one by one. The reason why these scenarios are suitable for this labor-intensive approach is that the knowledge in these scenarios mainly consists of text, supplemented by pictures and tables. Therefore, once sorted out manually, it can be used all the time.

How to Parse Excel and TXT Documents

Excel Document Parsing

The core principle of Excel document parsing lies in extracting tabular data from spreadsheet files (such as.xls or.xlsx) and transforming it into data structures that can be used for further processing (such as lists, dictionaries in Python, etc.).

Principles of Excel Document Parsing:

  1. File Reading and Parsing Libraries: By using libraries like openpyxl, pandas, xlrd, etc., Python programs can directly read Excel files and operate on the data within them. These libraries are capable of parsing the basic elements of Excel files, such as rows, columns, and cells.
  2. Multi-sheet and Multi-dimensional Parsing: Excel usually contains multiple worksheets. When parsing, it is necessary to handle the situation of multiple tables. For each worksheet, the parsing program reads the data row by row and column by column and converts it into a structure that can be manipulated by the program.
  3. Cell Data Type Identification: The data in Excel cells may be text, numbers, dates, formulas, etc. When parsing, it is necessary to correctly handle these data types to ensure the accuracy of the extracted information.
  4. Data Cleaning and Standardization: During the parsing process, data cleaning is usually also required, such as removing blank rows, duplicate rows, and unifying the data format. This helps with subsequent semantic segmentation and the construction of the knowledge base.

Example Code for Excel Document Parsing

The following code demonstrates how to use pandas and openpyxl to parse an Excel file and extract its data into a Python DataFrame for processing:

import pandas as pd

def parse_excel(file_path):
    xls = pd.ExcelFile(file_path)
    all_data = {}

    for sheet_name in xls.sheet_names:
      df = pd.read_excel(xls, sheet_name=sheet_name)
      df.dropna(how='all', inplace=True)
      df.drop_duplicates(inplace=True)
       all_data[sheet_name] = df
    return all_data

file_path = 'your_excel_file.xlsx'
parsed_data = parse_excel(file_path)

print(parsed_data[list(parsed_data.keys())[0]].head())

Code Explanation:

  1. pandas.ExcelFile: Firstly, use the `pandas.ExcelFile` object to open the Excel file and obtain all the worksheets in it.
  2. pd.read_excel(): Read each worksheet one by one and convert the data of each worksheet into a `DataFrame` object for the convenience of subsequent processing.
  3. Data Cleaning: Remove blank rows through the `dropna()` method and remove duplicate data through the `drop_duplicates()` method.
  4. Data Storage: The data of each worksheet is stored in a dictionary, with the worksheet name as the key and the data as the value.
  5. Result Display: You can view the parsed data through the `head()` method.

The above is the parsing principle and a simple example of Excel. Next, we will present a piece of real parsing code in enterprise RAG.

class Excel(ExcelParser):
    def __call__(self, fnm, binary=None, callback=None):
        if not binary:
            wb = load_workbook(fnm)
        else:
            wb = load_workbook(BytesIO(binary))
        total = 0
        for sheetname in wb.sheetnames:
            total += len(list(wb[sheetname].rows))

        res, fails = [], []
        # Due to the limited number of characters for the code, the remaining code is omitted here.

def rmPrefix(txt):
    return re.sub(
        r"^(user|assistant|Q|A|Question|Answer|)[\t::]+", "", txt.strip(), flags=re.IGNORECASE)

This code snippet defines a class named `Excel` that inherits from `ExcelParser` and is used to extract question and answer (Q&A) pairs from Excel files, along with a callback function for progress feedback. Here is a detailed explanation of the code:

1. `__call__`method

This method is an overloaded `__call__` that processes the file by passing in the Excel file name or binary data.

Parameters:

  • fnm: The file name, representing the path of the Excel file.
  • binary: The binary data of the Excel file. If provided, it will be used instead of the file path.
  • callback: The callback function used for real-time feedback on the processing progress.

Loading the workbook:

  • wb = load_workbook(fnm): When binary data is not passed in, the Excel file is loaded using the file path.
  • wb = load_workbook(BytesIO(binary)): If binary data is passed in, it is wrapped into a `BytesIO` object and parsed using `load_workbook`.

2. Counting the total number of rows.

total = 0
for sheetname in wb.sheetnames:
    total += len(list(wb[sheetname].rows))

Calculate the total number of rows in all tables by looping through each worksheet so that it can be used for progress calculation later.

3. Process the row data in the worksheets

Extract question and answer pairs by traversing the rows of each worksheet: Use `q` and `a` to represent the question and the answer respectively. Read the value of each cell in each row in sequence. If the question (`q`) and the answer (`a`) are found, add them to the `res` list. If a question or an answer is not correctly found in a certain row, the index of that row will be added to the `fails` list.

if q and a:
   res.append((q, a))
else:
   fails.append(str(i + 1))

Invoke the `callback` function once to report the current progress every time 999 pairs of questions and answers are processed.

4. Callback function

The `callback` function is responsible for updating the progress and displaying the processing progress, showing the number of successfully processed Q&A pairs and the row numbers of the processing failures. When calculating the progress, divide the number of successfully processed rows by the total number of rows and multiply by 0.6, which represents a processing weight of 60%.

5. Language detection

After the extraction of question and answer pairs is completed, use the `is_english` method to detect whether the extracted questions are in English. `random_choices(res, k = 30)` will randomly select 30 questions from `res` for detection to determine whether it is an English document.

6. `rmPrefix` method

This method is used to remove specific prefixes in front of questions and answers, such as “Question”, “Answer”, etc., to ensure that the format of the extracted question and answer pairs is clean.

Summary of the code’s functionality:

This class is used to extract question and answer pairs from Excel files and comes with a progress feedback feature. The main steps include:

  1. loading the Excel file
  2. traversing the worksheets and extracting the question and answer pairs
  3. performing data cleaning on the extracted question and answer pairs (such as removing prefixes)
  4. providing real-time processing progress and feeding it back to the caller through the callback function

Parsing of TXT Documents

Parsing TXT documents is relatively the simplest type because TXT files are plain text files that lack structured information (such as tables, pictures, styles, etc.). In the RAG (Retrieval-Augmented Generation) system, parsing TXT documents mainly involves steps like text segmentation, cleaning, word segmentation, and semantic extraction.

  1. Reading the file:

    Use Python’s built-in functions or other libraries like `open()` to read the text content.

  2. Data cleaning:

    This includes removing blank lines, invalid symbols or dealing with format issues, such as punctuation marks, line breaks, etc.

  3. Segmentation:

    Segment the document by lines, paragraphs or preset rules (such as semantics).

  4. Word segmentation and preprocessing:

    Perform word segmentation on the text and conduct lemmatization or stemming.

  5. Semantic extraction:

    Use NLP (Natural Language Processing) models (such as Word2Vec, BERT) to conduct vectorization processing on the text and capture its semantic information.

  6. Chunk segmentation:

    Divide the text into small pieces of content suitable for retrieval by the RAG system according to preset rules (such as semantic similarity).

Code Example

The following is a simple Python-based example code for parsing TXT files, which is used to parse TXT documents into semantic chunks and prepare them for the Retrieval-Augmented Generation (RAG) task.

import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
class TXTParser:
    def __init__(self, file_path):
        self.file_path = file_path
        self.text_data = self.read_file()

    def read_file(self):
        if not os.path.exists(self.file_path):
            raise FileNotFoundError(f"{self.file_path} does not exist.")

        with open(self.file_path, 'r', encoding='utf-8') as f:
            return f.readlines()

    def clean_data(self, lines):
        cleaned_lines = [line.strip() for line in lines if line.strip()]
        return cleaned_lines

    def vectorize_text(self, cleaned_data):
        vectorizer = TfidfVectorizer(stop_words='english')
        vectors = vectorizer.fit_transform(cleaned_data)
        return vectors

    def semantic_clustering(self, vectors, num_clusters=5):
        kmeans = KMeans(n_clusters=num_clusters, random_state=42)
        kmeans.fit(vectors)
        return kmeans.labels_

    def create_chunks(self, cleaned_data, labels):
        chunks = {}
        for i, label in enumerate(labels):
            if label not in chunks:
                chunks[label] = []
            chunks[label].append(cleaned_data[i])
        return chunks

    def parse(self):
        cleaned_data = self.clean_data(self.text_data)
        vectors = self.vectorize_text(cleaned_data)
        labels = self.semantic_clustering(vectors)
        chunks = self.create_chunks(cleaned_data, labels)
        return chunks

if __name__ == "__main__":
    parser = TXTParser('example.txt')
    chunks = parser.parse()

    for cluster, lines in chunks.items():
        print(f"Chunk {cluster}:")
        for line in lines:
            print(line)
            print("-" * 50)

Code Explanation

  • Reading the file (read_file):

    Read the content of the file and return a list of all lines.

  • Data cleaning (clean_data):

    Remove blank lines and excessive white spaces.

  • Text vectorization (vectorize_text):

    Use the TF-IDF (Term Frequency-Inverse Document Frequency) model to convert each line of text into a vector, facilitating subsequent semantic analysis.

  • Semantic clustering (semantic_clustering):

    Use the K-means clustering algorithm to cluster text vectors, categorizing sentences or paragraphs with similar semantics into the same group.

  • Generating Chunk (create_chunks):

    Based on the clustering labels, group similar text paragraphs to generate the final chunks.

📖See Also

Subscribe to Our Newsletter

Get the latest updates and exclusive content delivered straight to your inbox