Document Loading and Chunking
Purpose
Load documents and chunk the content as completely as possible to facilitate the integrity of subsequent retrieval and the accuracy of Q&A.
Loader Handling for Each File Format
| File Extension | Extraction Logic |
|---|---|
| Default: Extract text and images page by page | |
| OCR: Extract text and convert to MD | |
| .docx | Convert to MD format, remove headers and footers |
| .doc | Convert to docx for processing |
| .xlsx/.xls | Concatenate content from each sheet and each cell |
| .pptx/.ppt | Convert to PDF for processing |
| .txt | Read text directly |
| .md | Read text directly |
| .htm/.html | Not supported yet |
Chunking Strategies
Fixed-Length Chunking
The most straightforward strategy is to split based on document length. This simple and effective method ensures that each chunk does not exceed the set size limit. The main advantages of length-based splitting include:
- Simple to implement
- Consistent chunk size
- Easily adaptable to different model requirements
Recursive Chunking
Natural language text is usually composed of hierarchical units such as paragraphs, sentences, and words. We can leverage this structure to formulate splitting strategies that ensure semantic coherence while adapting to different levels of text granularity. It prioritizes keeping larger units (such as paragraphs) intact. If a unit exceeds the chunk size limit, splitting proceeds to the next level (such as sentences). If further splitting is needed, it continues to the word level.
Semantic Chunking
Unlike previous methods, semantic-based splitting focuses more on the content of the text. While other methods use document or text structure to indirectly represent semantics, this approach directly analyzes the semantic content of the text. There are various ways to implement this, but the core idea is to split when there is a significant change in the meaning of the text. For example, we can use a sliding window technique to generate text embeddings and compare these embeddings to detect major semantic differences: First, generate embeddings for the initial sentences. Then, slide the window to the next group of sentences and generate new embeddings. Compare these embeddings to find significant differences, indicating possible semantic "breakpoints." This technique can produce more semantically coherent text chunks, which may improve the quality of downstream tasks such as retrieval or summarization.