在基于 DPR (Deep Passage Retrieval) 的 RAG 系统中, Chunking 是离线索引阶段非常重要的环节, 对RAG的最终效果有重要影响。

Chunking 将文档划分为多个易于处理的小块,使得在线检索阶段仅需返回相关的文档块, 而不是整个文档, 从而避免LLM的上下文窗口限制。然而, 要返回与查询高度相关的文档块, 提前是文档能被合理地分块。否则,有效的信息可能会因为错误的分块而被噪声淹没,或者语义和上下文信息被破坏,这些都会降低检索的有效性。

文本介绍与Chunking相关的不同策略, 介绍其基本思想、实现细节及使用方式。

文档分块在RAG中的位置

Chunking策略

1. 固定大小分块 (Fixed-size Chunking)

这是最简单粗暴的分块策略,不考虑文档的内容和结构,将文档划分成固定大小的文本块。 这里的固定大小, 可以是固定的字符、token、单词或者句子个数。 这种方式实现最简单, 同时缺点也最明显: 很容易损坏语义的完整性和连贯性,粒度越细, 损坏越严重。为了缓解这种情况, 通常会在相邻的chunk间有一定重叠(chunk overlap),但这在带来一定数据冗余的同时, 仍只能有限挽救分块边界对语义的破坏。

在固定大小分块策略中, 除了切分的粒度, 分块的大小(chunk_size)及分块重叠的大小(chunk_overlap)也是影响分块效果的重要因素。一般来说, 分块越小, 重叠越大, 检索效果越好(Recall/Precision/F1等)。

在 LangChain 中, 可以使用 CharacterTextSplitter 来实现固定字符大小的文档分块。也可以利用 Chonkie 库实现,以下代码实现了固定token数量的文档分块,其中tokenizer使用的是gpt2同款:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from chonkie import TokenChunker

chunker = TokenChunker(
tokenizer="gpt2", # Supports string identifiers
chunk_size=16, # Maximum tokens per chunk
chunk_overlap=4 # Overlap between chunks
)

text = """In Chinese mythology, there are five islands in the Bohai Sea, inhabited by immortal beings who have discovered the elixir of life. Many have searched for the islands, but no one have yet found them. I came close, however, four years ago this very week, when I travelled deep into the Nevada desert, to go to Burning Man
"""

chunks = chunker(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i} ({chunk.token_count} tokens): \n{chunk.text}")

输出:

1
2
3
4
5
6
7
8
9
10
11
12
Chunk 0 (16 tokens): 
In Chinese mythology, there are five islands in the Bohai Sea, inhabited by
Chunk 1 (16 tokens):
Sea, inhabited by immortal beings who have discovered the elixir of life. Many
Chunk 2 (16 tokens):
of life. Many have searched for the islands, but no one have yet found
Chunk 3 (16 tokens):
one have yet found them. I came close, however, four years ago this
Chunk 4 (16 tokens):
four years ago this very week, when I travelled deep into the Nevada desert,
Chunk 5 (10 tokens):
the Nevada desert, to go to Burning Man

2. 递归分块

递归分块是对固定大小分块的一种改进, 其基本思想是基于不同优先级的分隔字符, 以层次和迭代的方式对文档进行分割。 如果初始分割的分块不满足对chunk_size的要求,则在不满足要求的分块上做进一步更细粒度的分割, 同时,对较小的相邻分块进行合并。

LangChain提供了递归分块类 RecursiveCharacterTextSplitter, 默认以 ["\n\n", "\n", " ", ""] 作为层次分隔符来对文档进行递归分块。
用法如下:

1
2
3
4
5
6
7
8
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 100,
chunk_overlap = 0,
length_function = len,
)
text_splitter.split_text(documents)

3. 基于文档结构的分块

这种方法利用文档的结构、格式和内容流程来对文档进行分块(不适用于缺乏清晰结构的文档)。

例如, 对于Markdown文件, 以Markdown标记作为分割的边界, 对于 HTML文件, 以段落标记(<p>)作为分割的边界等。 在 LangChain 中提供了针对不同文档类型的分块工具。对于代码, LangChain针对多个编程语言内置了预设的分隔符。例如, 下面的代码演示了如何对Python代码进行分隔:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PYTHON_CODE = """
def hello_world():
print("Hello, World!")

# Call the function
hello_world()
"""

# 查看内置分隔符
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

输出:

1
2
[Document(page_content='def hello_world():\n    print("Hello, World!")'),
Document(page_content='# Call the function\nhello_world()')]

更多细节见LangChain 文档 How to split code

对于某些特殊的文档结构, 比如表格, 或非文本模态内容如图片、音频与视频, 可以通过LLM对其进行文字总结后再进行分块处理。

4. 语义分块

以上分块策略均涉及通过指定分隔符对文档进行分割, 且通常需要指定块的大小。 语义分块则是以语义上下文的相关性来作为分块边界的依据。 它对文档进行基础分块(比如划分为句子)后, 对每个分块进行 Embedding处理。 然后, 以句子出现的先后次序, 评估当前分块 Embedding 与其前面一定窗口内分块Embedding均值的余弦距离。 在余弦距离大于一定阈值(绝对值、分位数等)的地方进行分割。

语义分块相比前两种文档分块方式, 无论是性能还是成本都不占优势。 在论文 Is Semantic Chunking Worth the Computational Cost?中, Renyi Qu等人对其效果以及成本的合理性进行了挑战。 指出语义分块仅在主题多样性较高的拼接数据集中表现较好。 在非合成数据集中, 基于固定大小的分块策略不仅更快、成本更低, 效果也更好。

5. Agentic 分块

Agentic 分块算法通过LLM来决定文档块中应该包含哪些信息以及如何得到文档块。

为了得到初始分块, 它首先利用LLM从原始文本中提取出独立陈述(称为 Propositions, 也有翻译为命题,即文本中的原子表达式, Proposition 封装了一个独特的事实,并以简洁、自包含的自然语言格式呈现, 详见论文《Dense X Retrieval: What Retrieval Granularity Should We Use?》)

LangChain示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from typing import List
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_groq import ChatGroq

# Data model
class GeneratePropositions(BaseModel):
"""List of all the propositions in a given document"""

propositions: List[str] = Field(
description="List of propositions (factual, self-contained, and concise information)"
)


# LLM with function call
llm = ChatGroq(model="llama-3.1-70b-versatile", temperature=0)
structured_llm= llm.with_structured_output(GeneratePropositions)

# Few shot prompting --- We can add more examples to make it good
proposition_examples = [
{"document":
"In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.",
"propositions":
"['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']"
},
]

example_proposition_prompt = ChatPromptTemplate.from_messages(
[
("human", "{document}"),
("ai", "{propositions}"),
]
)

few_shot_prompt = FewShotChatMessagePromptTemplate(
example_prompt = example_proposition_prompt,
examples = proposition_examples,
)

# Prompt
system = """Please break down the following text into simple, self-contained propositions. Ensure that each proposition meets the following criteria:

1. Express a Single Fact: Each proposition should state one specific fact or claim.
2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses."""
prompt = ChatPromptTemplate.from_messages(
[
("system", system),
few_shot_prompt,
("human", "{document}"),
]
)

proposition_generator = prompt | structured_llm

这些独立陈述随后被输入LLM进行评价, 从accuracy, clarity, completeness, and conciseness多个维度进行打分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Data model
class GradePropositions(BaseModel):
"""Grade a given proposition on accuracy, clarity, completeness, and conciseness"""

accuracy: int = Field(
description="Rate from 1-10 based on how well the proposition reflects the original text."
)

clarity: int = Field(
description="Rate from 1-10 based on how easy it is to understand the proposition without additional context."
)

completeness: int = Field(
description="Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers)."
)

conciseness: int = Field(
description="Rate from 1-10 based on whether the proposition is concise without losing important information."
)

# LLM with function call
llm = ChatGroq(model="llama-3.1-70b-versatile", temperature=0)
structured_llm= llm.with_structured_output(GradePropositions)

# Prompt
evaluation_prompt_template = """
Please evaluate the following proposition based on the criteria below:
- **Accuracy**: Rate from 1-10 based on how well the proposition reflects the original text.
- **Clarity**: Rate from 1-10 based on how easy it is to understand the proposition without additional context.
- **Completeness**: Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).
- **Conciseness**: Rate from 1-10 based on whether the proposition is concise without losing important information.

Example:
Docs: In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.

Propositons_1: Neil Armstrong was an astronaut.
Evaluation_1: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_2: Neil Armstrong walked on the Moon in 1969.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_3: Neil Armstrong was the first person to walk on the Moon.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_4: Neil Armstrong walked on the Moon during the Apollo 11 mission.
Evaluation_4: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_5: The Apollo 11 mission occurred in 1969.
Evaluation_5: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Format:
Proposition: "{proposition}"
Original Text: "{original_text}"
"""
prompt = ChatPromptTemplate.from_messages(
[
("system", evaluation_prompt_template),
("human", "{proposition}, {original_text}"),
]
)

proposition_evaluator = prompt | structured_llm

得分大于一定阈值的独立陈述才会被保留。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Define evaluation categories and thresholds
evaluation_categories = ["accuracy", "clarity", "completeness", "conciseness"]
thresholds = {"accuracy": 7, "clarity": 7, "completeness": 7, "conciseness": 7}

# Function to evaluate proposition
def evaluate_proposition(proposition, original_text):
response = proposition_evaluator.invoke({"proposition": proposition, "original_text": original_text})

# Parse the response to extract scores
scores = {"accuracy": response.accuracy, "clarity": response.clarity, "completeness": response.completeness, "conciseness": response.conciseness} # Implement function to extract scores from the LLM response
return scores

# Check if the proposition passes the quality check
def passes_quality_check(scores):
for category, score in scores.items():
if score < thresholds[category]:
return False
return True

evaluated_propositions = [] # Store all the propositions from the document

# Loop through generated propositions and evaluate them
for idx, proposition in enumerate(propositions):
scores = evaluate_proposition(proposition.page_content, doc_splits[proposition.metadata['chunk_id'] - 1].page_content)
if passes_quality_check(scores):
# Proposition passes quality check, keep it
evaluated_propositions.append(proposition)
else:
# Proposition fails, discard or flag for further review
print(f"{idx+1}) Propostion: {proposition.page_content} \n Scores: {scores}")
print("Fail")

保留下来的独立陈述经过Embedding模型进行编码进入索引。

1
2
3
4
5
6
# Add to vectorstore
vectorstore_propositions = FAISS.from_documents(evaluated_propositions, embedding_model)
retriever_propositions = vectorstore_propositions.as_retriever(
search_type="similarity",
search_kwargs={'k': 4}, # number of documents to retrieve
)

代码示例来源于: RAG_Techniques

总结

本文介绍了RAG常见的文档分块策略, 介绍了其基础思想和部分实践代码。 在实际应用场景中, 需要综合考虑文档结构、规模、成本与效果因素, 选择合适的分块策略。

参考

  1. Five Levels of Chunking Strategies in RAG| Notes from Greg’s Video
  2. Understanding LangChain’s RecursiveCharacterTextSplitter
  3. Semantic Chunking: Improving AI Information Retrieval
  4. 从零开始优化 RAG:7 种 Chunking 方法让你的系统更智能
  5. Greg Kamradt的RAG教程代码
  6. Dense X Retrieval: What Retrieval Granularity Should We Use?