Document Loading and Chunking

Purpose

To load documents and segment their content as completely as possible to facilitate comprehensive retrieval and accurate question answering.

Loader Processing for Each File Format

File Extension	Processing Logic
.pdf	Default: Extract text and images page by page
	OCR: Convert extracted text to MD format
.docx	Convert to MD format, remove headers and footers
.doc	Convert to .docx for processing
.xlsx/.xls	Concatenate content from each sheet and cell
.pptx/.ppt	Convert to PDF for processing
.txt	Directly read text
.md	Directly read text
.htm/.html	Not supported currently

Chunking Strategies

Fixed-Length Chunking

The simplest strategy is to split based on document length. This straightforward and effective method ensures that each data chunk does not exceed a specified size limit.
Key advantages of length-based splitting include:

Simple implementation
Consistent chunk size
Easy to adapt to the requirements of different models

Recursive Chunking

Natural language text is usually composed of hierarchical units, such as paragraphs, sentences, and words. We can leverage this structure to create a chunking strategy that maintains semantic coherence while adapting to different levels of text granularity. This strategy prioritizes keeping larger units (such as paragraphs) intact. If a unit exceeds the chunk size limit, the chunking proceeds to the next level (such as sentences). If further splitting is needed, it continues down to the word level.

Semantic Chunking

Unlike the previous methods, semantic-based chunking focuses more on the content of the text. While other methods use document or text structure to represent semantics indirectly, this method directly analyzes the semantic content of the text. There are various ways to implement this approach, but the core idea is to split the text when there is a significant change in meaning. For example, we can use a sliding window technique to generate text embeddings and compare them to identify major semantic differences:

First, generate embeddings for the initial sentences.
Then, slide the window to the next set of sentences and generate new embeddings.
Compare these embeddings to detect significant differences, which may indicate a semantic "breakpoint."

This technique can produce more semantically coherent text chunks, potentially improving the performance of downstream tasks such as retrieval or summarization.

Purpose​

Loader Processing for Each File Format​

Chunking Strategies​

Fixed-Length Chunking​

Recursive Chunking​

Semantic Chunking​