Document Loading and Chunking
Purpose
To load documents and segment their content as completely as possible to facilitate comprehensive retrieval and accurate question answering.
Loader Processing for Each File Format
File Extension | Processing Logic |
---|---|
Default: Extract text and images page by page | |
OCR: Convert extracted text to MD format | |
.docx | Convert to MD format, remove headers and footers |
.doc | Convert to .docx for processing |
.xlsx/.xls | Concatenate content from each sheet and cell |
.pptx/.ppt | Convert to PDF for processing |
.txt | Directly read text |
.md | Directly read text |
.htm/.html | Not supported currently |
Chunking Strategies
Fixed-Length Chunking
The simplest strategy is to split based on document length. This straightforward and effective method ensures that each data chunk does not exceed a specified size limit.
Key advantages of length-based splitting include:
- Simple implementation
- Consistent chunk size
- Easy to adapt to the requirements of different models
Recursive Chunking
Natural language text is usually composed of hierarchical units, such as paragraphs, sentences, and words. We can leverage this structure to create a chunking strategy that maintains semantic coherence while adapting to different levels of text granularity. This strategy prioritizes keeping larger units (such as paragraphs) intact. If a unit exceeds the chunk size limit, the chunking proceeds to the next level (such as sentences). If further splitting is needed, it continues down to the word level.
Semantic Chunking
Unlike the previous methods, semantic-based chunking focuses more on the content of the text. While other methods use document or text structure to represent semantics indirectly, this method directly analyzes the semantic content of the text. There are various ways to implement this approach, but the core idea is to split the text when there is a significant change in meaning. For example, we can use a sliding window technique to generate text embeddings and compare them to identify major semantic differences:
- First, generate embeddings for the initial sentences.
- Then, slide the window to the next set of sentences and generate new embeddings.
- Compare these embeddings to detect significant differences, which may indicate a semantic "breakpoint."
This technique can produce more semantically coherent text chunks, potentially improving the performance of downstream tasks such as retrieval or summarization.