Skip to main content

Document Loading and Chunking

Purpose

Load documents and chunk the content as completely as possible to facilitate the integrity of subsequent retrieval and the accuracy of Q&A.

Loader Handling for Each File Format

File ExtensionExtraction Logic
.pdfDefault: Extract text and images page by page
OCR: Extract text and convert to MD
.docxConvert to MD format, remove headers and footers
.docConvert to docx for processing
.xlsx/.xlsConcatenate content from each sheet and each cell
.pptx/.pptConvert to PDF for processing
.txtRead text directly
.mdRead text directly
.htm/.htmlNot supported yet

Chunking Strategies

Fixed-Length Chunking

The most straightforward strategy is to split based on document length. This simple and effective method ensures that each chunk does not exceed the set size limit. The main advantages of length-based splitting include:

  1. Simple to implement
  2. Consistent chunk size
  3. Easily adaptable to different model requirements

Recursive Chunking

Natural language text is usually composed of hierarchical units such as paragraphs, sentences, and words. We can leverage this structure to formulate splitting strategies that ensure semantic coherence while adapting to different levels of text granularity. It prioritizes keeping larger units (such as paragraphs) intact. If a unit exceeds the chunk size limit, splitting proceeds to the next level (such as sentences). If further splitting is needed, it continues to the word level.

Semantic Chunking

Unlike previous methods, semantic-based splitting focuses more on the content of the text. While other methods use document or text structure to indirectly represent semantics, this approach directly analyzes the semantic content of the text. There are various ways to implement this, but the core idea is to split when there is a significant change in the meaning of the text. For example, we can use a sliding window technique to generate text embeddings and compare these embeddings to detect major semantic differences: First, generate embeddings for the initial sentences. Then, slide the window to the next group of sentences and generate new embeddings. Compare these embeddings to find significant differences, indicating possible semantic "breakpoints." This technique can produce more semantically coherent text chunks, which may improve the quality of downstream tasks such as retrieval or summarization.