Preprocessing

The preprocessing Pipeline is mainly used to define how documents are processed when they are added to the knowledge base. It takes effect automatically during document upload and ingestion, including document parsing, text chunking, vectorization, and other stages.

Users can customize processing strategies based on document type or business requirements to meet differentiated processing needs for multi-source document ingestion, ensuring that knowledge content is correctly parsed, segmented, and indexed during ingestion, thereby improving the recall quality of subsequent retrieval.

Usage

A knowledge base can be associated with multiple preprocessing Pipelines to adapt to the processing requirements of different file types. When files are uploaded, the system matches applicable preprocessing rules in order. If none are matched, it falls back to the default Pipeline.

Platform built-in default rules: The system provides out-of-the-box default preprocessing Pipelines, which can be used directly or imported for reference.
Customization and override: Supports creating new custom Pipelines, and also allows copying the default Pipeline for modification; the default Pipeline supports deletion.
Rule matching mechanism: Preprocessing rules are matched in priority order. Once a match is found, the corresponding process is executed; if no match is found, default processing is used.

Recommendation: After adjusting preprocessing configurations, you can verify the processing effect by uploading test files.

Create a preprocessing Pipeline

On the preprocessing list page, click the "Add" button to open the creation window.
Fill in the basic information:
- Name: The name of the preprocessing Pipeline.
- Enabled: After checking, the preprocessing takes effect and can be associated with a knowledge base for use.
- Description: Supplementary notes on the applicable scenarios or configuration points of this preprocessing.
Click "Confirm" to complete creation, and the system will automatically navigate to the preprocessing editing canvas interface.

Detailed node functions

After entering the canvas editing interface, you can drag the required nodes from the node library onto the canvas and connect them to form a complete file preprocessing workflow.

The node library is divided into the following categories by function: Text Extraction, Text Chunking, Field Extraction, Post-processing, Plugins, and Data Processing.

Note:

A corresponding storage node must be added at the end of each preprocessing workflow to ensure that the processing results of each stage are correctly persisted to the database.

For more detailed descriptions of each node, click the "" icon in the upper-right corner of any node's configuration page to view the documentation.

Text extraction nodes

Extract raw text content from various file formats as the basis for subsequent processing.

Node Name	Function Description
Store File Text	Store the content extracted from the file into the database.
DOCX File Text	Use the pandoc library to extract content from docx files.
Video File Text	Extract content from video files.
Image Description Generation	Extract content from image files.
Audio File Text	Extract content from audio files.
Spreadsheet File Text	Use the pandas library to extract content from spreadsheet files.
PDF File Text	Use the pypdf library to extract content from PDF files.
Markdown File Text	Extract content from markdown files.
TXT File Text	Extract content from txt files.
Azure-OCR Parse PDF File	Use Azure Document Intelligence layout/read mode to extract content. Only `.pdf` format is supported, and noise data can be cleaned automatically.
Multimodal LLM Parse PDF File	Use the LLM OCR model to extract content.
Spire File Conversion	Use the Spire library for file format conversion.
LibreOffice File Conversion	Use the LibreOffice library for file format conversion.

Text chunking nodes

Split extracted long text into multiple paragraphs or fragments according to a specified strategy to facilitate subsequent indexing and retrieval.

Node Name	Function Description
Fixed Character Count Chunking	Split documents by a fixed size.
Fixed Character Count Splitting (with Page Number Information)	Split documents by a fixed size while carrying page start position information.
Spreadsheet File Chunking	Split spreadsheet documents into paragraphs.
Page-based Chunking	Split documents into paragraphs by page.
Heading-based Chunking	Split documents into paragraphs by headings.
Store File Segments	Store segmented data into the database.

Field extraction nodes

Extract key information from document content or metadata to generate summaries, keywords, or structured fields.

Node Name	Function Description
Store Paragraph Enhancement Data	Store extended enhancement data of document paragraphs into the document index.
Store Document Metadata	Store extracted document information into the document index.
Metadata Extraction	Use LLM to extract metadata from documents.
Keyword Extraction	Use LLM to extract keywords from each paragraph of the document.
Paragraph Metadata Extraction	Use LLM to extract metadata from each document paragraph.
Paragraph Summary Generation	Summarize paragraphs.
Advanced Table Summary Generation	Use LLM to generate table-level summaries and group-level narrative summaries.
Image Description Generation	Use image descriptions to enhance paragraphs.
Document Summary Generation	Summarize the entire document.
Table Row Record Summary Generation	Use table descriptions to enhance paragraphs.

Post-processing nodes

Perform subsequent processing such as tokenization and vectorization on chunked text to complete preparation before indexing.

Node Name	Function Description
Store Chunk Tokens	Store segmented token metadata into the database.
Tokenize Chunks Based on SpaCy	Use the SpaCy tokenizer for tokenization.
Vectorize Chunk Data and Store	Use a model to embed paragraphs and store the embedding vectors into the vector database.

Data processing nodes

Provide workflow control and variable processing capabilities for building more complex preprocessing logic.

Node Name	Function Description
Variable Aggregator	Group and aggregate multiple variables into output variables. Supports two strategies: "take the first non-empty value" and "merge into a list". Aggregation behavior is dynamically configured through `set_output_mapping()`.
Conditional Node	Perform branch control of the workflow based on conditions. The condition evaluation logic is handled externally by the pipeline engine, and the node itself does not produce output data.
Template	Use Jinja2 template syntax to process and format variables.

Trial Run

After configuration is completed, you can verify whether the preprocessing workflow executes as expected through the trial run feature. The system supports uploading files locally or selecting files from the knowledge base for testing.

Note: To ensure testing efficiency, it is recommended that uploaded files do not exceed 5MB in size and 20 pages in length.

View logs: Click "View Logs" to expand the detailed input and output content of each node, making it easier to troubleshoot issues node by node and accurately locate the specific stage where processing exceptions occur.
Segment preview: Supports previewing processed text segments, allowing intuitive judgment of whether the effects of chunking, extraction, and other stages meet expectations.
Data download: Due to display limitations, the preview area shows only the first 10 records by default. If complete data is needed, click the "Download" button to obtain all processing results.

Usage​

Create a preprocessing Pipeline​

Detailed node functions​

Text extraction nodes​

Text chunking nodes​

Field extraction nodes​

Post-processing nodes​

Data processing nodes​

Trial Run​