Skip to main content

Preprocessing

The preprocessing Pipeline is mainly used to define how documents are processed when they are added to the knowledge base. It takes effect automatically during document upload and ingestion, including document parsing, text chunking, vectorization, and other stages.

Users can customize processing strategies based on document type or business requirements to meet differentiated processing needs for multi-source document ingestion, ensuring that knowledge content is correctly parsed, segmented, and indexed during ingestion, thereby improving the recall quality of subsequent retrieval.

Usage

A knowledge base can be associated with multiple preprocessing Pipelines to adapt to the processing requirements of different file types. When files are uploaded, the system matches applicable preprocessing rules in order. If none are matched, it falls back to the default Pipeline.

  • Platform built-in default rules: The system provides out-of-the-box default preprocessing Pipelines, which can be used directly or imported for reference.
  • Customization and override: Supports creating new custom Pipelines, and also allows copying the default Pipeline for modification; the default Pipeline supports deletion.
  • Rule matching mechanism: Preprocessing rules are matched in priority order. Once a match is found, the corresponding process is executed; if no match is found, default processing is used.

Recommendation: After adjusting preprocessing configurations, you can verify the processing effect by uploading test files.

Create a preprocessing Pipeline

  1. On the preprocessing list page, click the "Add" button to open the creation window.
  2. Fill in the basic information:
    • Name: The name of the preprocessing Pipeline.
    • Enabled: After checking, the preprocessing takes effect and can be associated with a knowledge base for use.
    • Description: Supplementary notes on the applicable scenarios or configuration points of this preprocessing.
  3. Click "Confirm" to complete creation, and the system will automatically navigate to the preprocessing editing canvas interface.

Detailed node functions

After entering the canvas editing interface, you can drag the required nodes from the node library onto the canvas and connect them to form a complete file preprocessing workflow.

The node library is divided into the following categories by function: Text Extraction, Text Chunking, Field Extraction, Post-processing, Plugins, and Data Processing.

Note:

  • A corresponding storage node must be added at the end of each preprocessing workflow to ensure that the processing results of each stage are correctly persisted to the database.
  • For more detailed descriptions of each node, click the "" icon in the upper-right corner of any node's configuration page to view the documentation.

Text extraction nodes

Extract raw text content from various file formats as the basis for subsequent processing.

Node NameFunction Description
Store File TextStore the content extracted from the file into the database.
DOCX File TextUse the pandoc library to extract content from docx files.
Video File TextExtract content from video files.
Image Description GenerationExtract content from image files.
Audio File TextExtract content from audio files.
Spreadsheet File TextUse the pandas library to extract content from spreadsheet files.
PDF File TextUse the pypdf library to extract content from PDF files.
Markdown File TextExtract content from markdown files.
TXT File TextExtract content from txt files.
Azure-OCR Parse PDF FileUse Azure Document Intelligence layout/read mode to extract content. Only .pdf format is supported, and noise data can be cleaned automatically.
Multimodal LLM Parse PDF FileUse the LLM OCR model to extract content.
Spire File ConversionUse the Spire library for file format conversion.
LibreOffice File ConversionUse the LibreOffice library for file format conversion.

Text chunking nodes

Split extracted long text into multiple paragraphs or fragments according to a specified strategy to facilitate subsequent indexing and retrieval.

Node NameFunction Description
Fixed Character Count ChunkingSplit documents by a fixed size.
Fixed Character Count Splitting (with Page Number Information)Split documents by a fixed size while carrying page start position information.
Spreadsheet File ChunkingSplit spreadsheet documents into paragraphs.
Page-based ChunkingSplit documents into paragraphs by page.
Heading-based ChunkingSplit documents into paragraphs by headings.
Store File SegmentsStore segmented data into the database.

Field extraction nodes

Extract key information from document content or metadata to generate summaries, keywords, or structured fields.

Node NameFunction Description
Store Paragraph Enhancement DataStore extended enhancement data of document paragraphs into the document index.
Store Document MetadataStore extracted document information into the document index.
Metadata ExtractionUse LLM to extract metadata from documents.
Keyword ExtractionUse LLM to extract keywords from each paragraph of the document.
Paragraph Metadata ExtractionUse LLM to extract metadata from each document paragraph.
Paragraph Summary GenerationSummarize paragraphs.
Advanced Table Summary GenerationUse LLM to generate table-level summaries and group-level narrative summaries.
Image Description GenerationUse image descriptions to enhance paragraphs.
Document Summary GenerationSummarize the entire document.
Table Row Record Summary GenerationUse table descriptions to enhance paragraphs.

Post-processing nodes

Perform subsequent processing such as tokenization and vectorization on chunked text to complete preparation before indexing.

Node NameFunction Description
Store Chunk TokensStore segmented token metadata into the database.
Tokenize Chunks Based on SpaCyUse the SpaCy tokenizer for tokenization.
Vectorize Chunk Data and StoreUse a model to embed paragraphs and store the embedding vectors into the vector database.

Data processing nodes

Provide workflow control and variable processing capabilities for building more complex preprocessing logic.

Node NameFunction Description
Variable AggregatorGroup and aggregate multiple variables into output variables. Supports two strategies: "take the first non-empty value" and "merge into a list". Aggregation behavior is dynamically configured through set_output_mapping().
Conditional NodePerform branch control of the workflow based on conditions. The condition evaluation logic is handled externally by the pipeline engine, and the node itself does not produce output data.
TemplateUse Jinja2 template syntax to process and format variables.

Trial Run

After configuration is completed, you can verify whether the preprocessing workflow executes as expected through the trial run feature. The system supports uploading files locally or selecting files from the knowledge base for testing.

Note: To ensure testing efficiency, it is recommended that uploaded files do not exceed 5MB in size and 20 pages in length.

  • View logs: Click "View Logs" to expand the detailed input and output content of each node, making it easier to troubleshoot issues node by node and accurately locate the specific stage where processing exceptions occur.
  • Segment preview: Supports previewing processed text segments, allowing intuitive judgment of whether the effects of chunking, extraction, and other stages meet expectations.
  • Data download: Due to display limitations, the preview area shows only the first 10 records by default. If complete data is needed, click the "Download" button to obtain all processing results.