Skip to main content

Document Loading and Chunking

Purpose

To load documents and segment their content as completely as possible to facilitate comprehensive retrieval and accurate question answering.

Loader Processing for Each File Format

File ExtensionProcessing Logic
.pdfDefault: Extract text and images page by page
OCR: Convert extracted text to MD format
.docxConvert to MD format, remove headers and footers
.docConvert to .docx for processing
.xlsx/.xlsConcatenate content from each sheet and cell
.pptx/.pptConvert to PDF for processing
.txtDirectly read text
.mdDirectly read text
.htm/.htmlNot supported currently

Chunking Strategies

Fixed-Length Chunking

The simplest strategy is to split based on document length. This straightforward and effective method ensures that each data chunk does not exceed a specified size limit.
Key advantages of length-based splitting include:

  1. Simple implementation
  2. Consistent chunk size
  3. Easy to adapt to the requirements of different models

Recursive Chunking

Natural language text is usually composed of hierarchical units, such as paragraphs, sentences, and words. We can leverage this structure to create a chunking strategy that maintains semantic coherence while adapting to different levels of text granularity. This strategy prioritizes keeping larger units (such as paragraphs) intact. If a unit exceeds the chunk size limit, the chunking proceeds to the next level (such as sentences). If further splitting is needed, it continues down to the word level.

Semantic Chunking

Unlike the previous methods, semantic-based chunking focuses more on the content of the text. While other methods use document or text structure to represent semantics indirectly, this method directly analyzes the semantic content of the text. There are various ways to implement this approach, but the core idea is to split the text when there is a significant change in meaning. For example, we can use a sliding window technique to generate text embeddings and compare them to identify major semantic differences:

  1. First, generate embeddings for the initial sentences.
  2. Then, slide the window to the next set of sentences and generate new embeddings.
  3. Compare these embeddings to detect significant differences, which may indicate a semantic "breakpoint."

This technique can produce more semantically coherent text chunks, potentially improving the performance of downstream tasks such as retrieval or summarization.