文章/答案/技术大牛

发布

多模态RAG实战指南：完整Python代码实现AI同时理解图片、表格和文本

文章来源：企鹅号 - deephub

传统RAG系统在处理纯文本应用场景中已展现出显著效果，然而现实世界的信息载体往往呈现多模态特征。文档中普遍包含图像、表格、图表等承载关键信息的视觉元素，这些多模态内容的有效处理正是多模态RAG系统的核心价值所在。

多模态RAG最优方案选择

经过系统性研究和实验验证，我们将介绍一个在RAG系统中处理多模态内容的最佳实现方案。该方案在性能表现、准确性指标和实现复杂度之间实现了优化平衡。

图1：多模态RAG系统整体架构图，展示从文档处理到向量化存储的完整工作流程

架构优势分析

架构采用模态特定处理与后期融合相结合的技术路线。相比其他技术方案，该架构具有以下显著优势：

首先，在模态信息保留方面，该方法避免了统一嵌入方法可能导致的模态特有信息丢失问题，通过针对各模态优化的专用工具实现精确的内容类型处理。其次，系统具备良好的灵活性和模块化特征，支持单独组件的升级优化（例如更换更高性能的图像理解模型），而无需重构整个系统架构。

在检索精度方面，研究数据表明，该方法在处理复杂多模态查询时的性能相比统一方法提升23%。同时，该架构基于广泛可用的开源工具和模型构建，确保了大多数组织的技术可达性和实施可行性。

多模态文档处理工作流程

以下详细阐述推荐工作流程的各个环节，说明各组件如何协同工作以构建统一的系统架构：

图2：多模态RAG方法的连接工作流程图

1、结构保留的文档分割

该模块的核心功能是将文档分解为可管理的片段，同时保持其逻辑结构和不同内容类型之间的关联关系。

结构感知分割对于系统性能至关重要，它确保相关内容（如图像及其标题）在分割过程中保持关联，这对准确理解和检索具有决定性作用。

importfitz # PyMuPDF

defsplit_pdf_by_structure(pdf_path):

"""根据PDF文档的逻辑结构进行拆分。"""

doc=fitz.open(pdf_path)

sections= []

# 提取文档结构（简化示例）

toc=doc.get_toc()

iftoc:

# 使用目录进行结构化拆分

fori, (level, title, page) inenumerate(toc):

next_page=toc[i+1][2] ifi<len(toc)-1elselen(doc)

section= {

"title": title,

"start_page": page-1, # 0 索引

"end_page": next_page-1,

"level": level

}

sections.append(section)

else:

# 回退到页面级拆分

foriinrange(len(doc)):

sections.append({

"title": f"Page {i+1}",

"start_page": i,

"end_page": i,

"level": 1

})

returnsections, doc

研究结果表明，在分割过程中保持文档结构能够显著提升多模态内容的检索质量指标。

2、模态特定内容提取

该模块采用针对特定模态优化的专用工具处理各类内容（文本、图像、表格）。

不同内容类型需要采用相应的处理技术才能有效提取其信息内容，通用方法往往产生次优结果。

defextract_multimodal_content(sections, doc):

"""使用专用工具从每种模态中提取内容。"""

extracted_content= []

forsectioninsections:

section_content= {

"title": section["title"],

"level": section["level"],

"text_elements": [],

"images": [],

"tables": []

}

# 处理节中的每个页面

forpage_numinrange(section["start_page"], section["end_page"] +1):

page=doc[page_num]

# 使用 PyMuPDF 的文本提取功能提取文本

text_blocks=page.get_text("blocks")

forblockintext_blocks:

ifblock[6] ==0: # 文本块

section_content["text_elements"].append({

"text": block[4],

"bbox": block[:4],

"page": page_num

})

# 使用 PyMuPDF 的图像提取功能提取图像

image_list=page.get_images(full=True)

forimg_index, img_infoinenumerate(image_list):

xref=img_info[0]

base_image=doc.extract_image(xref)

image_data= {

"image_data": base_image["image"],

"extension": base_image["ext"],

"bbox": page.get_image_bbox(img_info),

"page": page_num

}

section_content["images"].append(image_data)

# 使用专门的表格提取工具提取表格

# 在此示例中，我们将使用简化方法

tables=extract_tables_from_page(page)

fortableintables:

section_content["tables"].append({

"data": table,

"page": page_num

})

extracted_content.append(section_content)

returnextracted_content

defextract_tables_from_page(page):

"""

使用专门的表格检测从页面中提取表格。

在生产系统中，您将使用专用的表格提取

库，如 Camelot、Tabula 或深度学习模型。

"""

# 为说明目的简化实现

tables= []

# 使用启发式或机器学习来识别表格区域

# 然后从这些区域提取结构化数据

returntables3、关系保留的HTML转换

该模块将提取的多模态内容转换为结构化HTML格式，同时保留内容元素间的关联关系。

HTML作为标准化格式能够有效表示混合模态内容并保持结构完整性，为后续处理提供理想的数据基础。

frombs4importBeautifulSoup

importos

importbase64

defconvert_to_structured_html(extracted_content, output_dir):

"""将提取的多模态内容转换为保留关系的结构化 HTML。"""

os.makedirs(output_dir, exist_ok=True)

html_files= []

forsectioninextracted_content:

# 为此部分创建一个新的 HTML 文档

soup=BeautifulSoup("<article></article>", "html.parser")

article=soup.find("article")

# 添加节标题

header=soup.new_tag(f"h{section['level']}")

header.string=section["title"]

article.append(header)

# 按页面和位置对所有元素进行排序

all_elements= []

# 添加文本元素

fortext_eleminsection["text_elements"]:

all_elements.append({

"type": "text",

"data": text_elem,

"page": text_elem["page"],

"y_pos": text_elem["bbox"][1] # 用于排序的 y 坐标

})

# 添加图像

fori, img_data_iteminenumerate(section["images"]):

# 将图像保存到文件

img_filename=f"{section['title'].replace(' ', '_')}_img_{i}.{img_data_item['extension']}"

img_path=os.path.join(output_dir, img_filename)

withopen(img_path, "wb") asf:

f.write(img_data_item["image_data"])

all_elements.append({

"type": "image",

"data": {

"path": img_path,

"bbox": img_data_item["bbox"]

"page": img_data_item["page"],

"y_pos": img_data_item["bbox"][1] # 用于排序的 y 坐标

})

# 添加表格

fori, tableinenumerate(section["tables"]):

all_elements.append({

"type": "table",

"data": table["data"],

"page": table["page"],

"y_pos": 0 # 在生产环境中会使用实际位置

})

# 按页面然后按 y 位置对元素进行排序

all_elements.sort(key=lambdax: (x["page"], x["y_pos"]))

# 按正确顺序将元素添加到 HTML

foreleminall_elements:

ifelem["type"] =="text":

p=soup.new_tag("p")

p.string=elem["data"]["text"]

article.append(p)

elifelem["type"] =="image":

figure=soup.new_tag("figure")

img_tag=soup.new_tag("img", src=elem["data"]["path"])

figure.append(img_tag)

# 查找潜在的标题（图像正下方的文本元素）

idx=all_elements.index(elem)

ifidx+1<len(all_elements) andall_elements[idx+1]["type"] =="text":

next_elem=all_elements[idx+1]

ifnext_elem["page"] ==elem["page"] andnext_elem["y_pos"] -elem["y_pos"] <50:

# 这段文字很可能是一个标题

figcaption=soup.new_tag("figcaption")

figcaption.string=next_elem["data"]["text"]

figure.append(figcaption)

article.append(figure)

elifelem["type"] =="table":

# 将表格数据转换为 HTML 表格

table_tag=soup.new_tag("table")

forrow_datainelem["data"]:

tr=soup.new_tag("tr")

forcellinrow_data:

td=soup.new_tag("td")

td.string=str(cell)

tr.append(td)

table_tag.append(tr)

article.append(table_tag)

# 保存 HTML 文件

html_filename=f"{section['title'].replace(' ', '_')}.html"

html_path=os.path.join(output_dir, html_filename)

withopen(html_path, "w", encoding="utf-8") asf:

f.write(str(soup))

html_files.append(html_path)

returnhtml_files

在实施过程中，建议使用语义HTML5标签（如、、、）来保留不同内容元素的语义含义，而非仅关注其视觉呈现效果。

4、关系保留的语义分块

HTML转换为多模态内容的标准化表示提供了统一的处理基础，同时保持了结构完整性。

该模块将HTML内容划分为语义完整的片段，同时维护不同元素间的关联关系。

有效的分块策略对检索质量具有决定性影响。过大的块会降低检索精度，而过小的块则会丢失重要的上下文信息。

frombs4importBeautifulSoup

importnetworkxasnx

defcreate_semantic_chunks_with_relationships(html_files, max_chunk_size=1000):

"""创建语义块，同时保留元素之间的关系。"""

chunks= []

relationship_graph=nx.DiGraph()

forhtml_fileinhtml_files:

withopen(html_file, "r", encoding="utf-8") asf:

html_content=f.read()

soup=BeautifulSoup(html_content, "html.parser")

# 提取节标题

section_title=soup.find(["h1", "h2", "h3", "h4", "h5", "h6"]).get_text()

section_id=f"section_{len(chunks)}"

# 将节节点添加到关系图

relationship_graph.add_node(section_id, type="section", title=section_title)

# 查找用于分块的语义边界

boundaries=soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6", "section"])

iflen(boundaries) <=1:

# 没有内部分界线，处理整个部分

current_chunk= {

"id": f"chunk_{len(chunks)}",

"html": str(soup),

"text": soup.get_text(separator=" ", strip=True),

"parent": section_id

}

chunks.append(current_chunk)

relationship_graph.add_node(current_chunk["id"], type="chunk")

relationship_graph.add_edge(section_id, current_chunk["id"], relation="contains")

else:

# 处理每个子部分

foriinrange(len(boundaries) -1):

start=boundaries[i]

end=boundaries[i+1]

# 收集开始和结束之间的所有元素

elements= []

current=start.next_sibling

whilecurrentandcurrent!=end:

ifcurrent.name: # 跳过 NavigableString

elements.append(current)

current=current.next_sibling

# 从这些元素创建块

ifelements:

chunk_soup=BeautifulSoup("<div></div>", "html.parser")

chunk_div=chunk_soup.find("div")

# 添加标题

chunk_div.append(start.copy())

# 添加所有元素

forelementinelements:

chunk_div.append(element.copy())

# 检查块是否太大

chunk_text=chunk_div.get_text(separator=" ", strip=True)

iflen(chunk_text) >max_chunk_size:

# 进一步拆分此块

sub_chunks=split_large_chunk(chunk_div, max_chunk_size)

forsub_chunkinsub_chunks:

sub_id=f"chunk_{len(chunks)}"

sub_chunk_obj= {

"id": sub_id,

"html": str(sub_chunk),

"text": sub_chunk.get_text(separator=" ", strip=True),

"parent": section_id

}

chunks.append(sub_chunk_obj)

relationship_graph.add_node(sub_id, type="chunk")

relationship_graph.add_edge(section_id, sub_id, relation="contains")

else:

# 按原样添加块

chunk_id=f"chunk_{len(chunks)}"

chunk_obj= {

"id": chunk_id,

"html": str(chunk_div),

"text": chunk_text,

"parent": section_id

}

chunks.append(chunk_obj)

relationship_graph.add_node(chunk_id, type="chunk")

relationship_graph.add_edge(section_id, chunk_id, relation="contains")

# 为图像和表格添加特殊处理，以确保它们正确连接

process_special_elements(soup, chunks, relationship_graph)

returnchunks, relationship_graph

defsplit_large_chunk(chunk_div, max_chunk_size):

"""根据段落将大块拆分为较小的块。"""

# 为简洁起见，省略了实现细节

return [chunk_div] # 占位符

defprocess_special_elements(soup, chunks, graph):

"""处理图像和表格以确保正确的••关系。"""

# 为简洁起见，省略了实现细节

pass

在实施中，建议使用图数据结构显式表示块间关系。这种方法支持更复杂的检索策略，能够沿着关系链路查找相关内容。

5、多模态向量化与存储

该模块将语义块转换为向量表示，并将其存储在向量数据库中以实现高效检索。

不同模态需要采用相应的向量化方法才能有效捕获其语义内容特征。

图3：推荐方法采用模态特定处理和后期融合的技术架构

fromsentence_transformersimportSentenceTransformer

fromPILimportImage

importtorch

importchromadb

importjson

defvectorize_and_store_multimodal_chunks(chunks, relationship_graph, output_dir):

"""使用特定模态模型对块进行矢量化，并与关系一起存储。"""

# 初始化嵌入模型

text_embedder=SentenceTransformer("all-MiniLM-L6-v2")

image_embedder=SentenceTransformer("clip-ViT-B-32")

# 初始化向量数据库

client=chromadb.Client()

collection=client.create_collection(name="multimodal_docs")

# 处理每个块

forchunkinchunks:

# 解析 HTML

soup=BeautifulSoup(chunk["html"], "html.parser")

# 提取用于嵌入的文本

text_content=soup.get_text(separator=" ", strip=True)

# 提取用于多模态嵌入的图像

images=soup.find_all("img")

image_embeddings= []

forimg_taginimages:

try:

# 加载图像并生成嵌入

img_path=img_tag["src"]

img_embedding=image_embedder.encode(Image.open(img_path))

image_embeddings.append(img_embedding)

exceptExceptionase:

print(f"Error processing image {img_tag.get('src', 'unknown')}: {e}")

# 生成文本嵌入

text_embedding=text_embedder.encode(text_content)

# 合并嵌入（简化方法）

# 在生产环境中，您将使用更复杂的融合技术

final_embedding=text_embedding

ifimage_embeddings:

# 平均图像嵌入

avg_img_embedding=sum(image_embeddings) /len(image_embeddings)

# 与文本嵌入连接并规范化

final_embedding=torch.cat([

torch.tensor(text_embedding),

torch.tensor(avg_img_embedding)

]).mean(dim=0).numpy()

# 获取关系元数据

relationships= []

foredgeinrelationship_graph.edges(chunk["id"]):

source, target=edge

relationships.append({

"source": source,

"target": target,

"relation": relationship_graph.edges[edge].get("relation", "related")

})

# 存储在向量数据库中

collection.add(

ids=[chunk["id"]],

embeddings=[final_embedding.tolist()],

metadatas=[{

"html_content": chunk["html"],

"parent": chunk.get("parent", ""),

"relationships": json.dumps(relationships)

}],

documents=[text_content]

)

# 保存关系图以供检索

nx.write_gpickle(relationship_graph, f"{output_dir}/relationships.gpickle")

returncollection

对于生产系统，建议考虑使用更复杂的融合方法（如交叉注意力机制或门控融合），以替代简单的串联或平均方法来组合不同模态的嵌入向量。

检索流程：系统集成实现

在完成多模态RAG系统构建后，以下展示其查询处理机制：

defretrieve_multimodal_content(query, collection, relationship_graph, k=5):

"""根据查询检索相关的多模态内容。"""

# 分析查询以确定相关模态

query_modalities=analyze_query_modalities(query)

# 生成查询嵌入

if"image"inquery_modalities:

# 对于有关图像的查询，请使用图像感知嵌入器

query_embedding=image_text_embedder.encode(query) # 假设 image_text_embedder 已定义

else:

# 对于纯文本查询，请使用文本嵌入器

query_embedding=text_embedder.encode(query) # 假设 text_embedder 已定义

# 执行初始检索

results=collection.query(

query_embeddings=[query_embedding.tolist()],

n_results=k

)

# 利用关系感知增强结果

enhanced_results=enhance_with_relationships(

results, relationship_graph, query, collection

)

returnenhanced_results

defanalyze_query_modalities(query):

"""分析查询以确定其针对的模态。"""

# 基于关键字的简单方法

image_keywords= ["image", "picture", "photo", "figure", "diagram", "chart"]

table_keywords= ["table", "data", "row", "column", "cell"]

modalities= ["text"]

ifany(keywordinquery.lower() forkeywordinimage_keywords):

modalities.append("image")

ifany(keywordinquery.lower() forkeywordintable_keywords):

modalities.append("table")

returnmodalities

defenhance_with_relationships(results, graph, query, collection):

"""使用关系信息增强检索结果。"""

enhanced_results= []

retrieved_ids=set()

fori, result_idinenumerate(results["ids"][0]):

retrieved_ids.add(result_id)

enhanced_results.append({

"id": result_id,

"text": results["documents"][0][i],

"metadata": results["metadatas"][0][i],

"score": results["distances"][0][i] if"distances"inresultselse1.0-i/len(results["ids"][0])

})

# 查找可能相关的相关块

forresultinenhanced_results[:]: # 复制以避免在迭代期间修改

# 从元数据中获取关系

relationships=json.loads(result["metadata"].get("relationships", "[]"))

forrelinrelationships:

related_id=rel["target"]

ifrelated_idnotinretrieved_ids:

# 检查此相关块是否与查询相关

related_metadata=collection.get(ids=[related_id])

ifrelated_metadataandrelated_metadata["ids"]:

related_text=related_metadata["documents"][0]

# 简单相关性检查（在生产环境中会更复杂）

ifany(terminrelated_text.lower() forterminquery.lower().split()):

retrieved_ids.add(related_id)

enhanced_results.append({

"id": related_id,

"text": related_text,

"metadata": related_metadata["metadatas"][0],

"score": result["score"] *0.9, # 相关内容的得分略低

"relation": "related to "+result["id"]

})

# 按分数排序

enhanced_results.sort(key=lambdax: x["score"], reverse=True)

returnenhanced_results方法优势对比分析

推荐方案相比其他技术路线在以下关键维度具有显著优势：

在混合模态处理能力方面，通过使用专用工具处理各模态后进行结果整合，能够捕获每种内容类型的独特特征。在关系保留机制上，通过显式建模和保留内容元素间的关系，维护了准确理解和检索所需的上下文信息。

在自适应检索能力方面，检索过程能够根据查询的模态需求进行适应性调整，确保无论内容格式如何都能检索到最相关的信息。在实际可行性层面，该方法基于广泛可用的工具和模型实现，为大多数组织提供了良好的技术可达性。

总结

本文提出的多模态RAG方法采用模态特定处理、后期融合和关系保留的技术架构，在性能表现、准确性指标和实现复杂度之间实现了最佳平衡。通过遵循该技术路线，能够构建一个有效处理复杂文档中全部信息的RAG系统。

在后续研究中，我们将重点探讨多模态RAG系统从实验阶段向生产就绪阶段的迁移方法，着重关注系统可扩展性、监控机制和持续优化策略等关键技术问题。

参考文献

Lang Mei, Siyu Mo, Zhihan Yang, Chong Chen. “A Survey of Multimodal Retrieval-Augmented Generation.” arXiv:2504.08748, March 2025.

“Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG.” MarkTechPost, May 2025.

Shah, Suruchi and Dharmapuram, Suraj. “Bridging Modalities: Multimodal RAG for Advanced Information Retrieval.” InfoQ, April 2025.

“How Multimodal RAG unlocks human-like reasoning in real-time.” LinkedIn Pulse, March 2025.

“Trends in Active Retrieval Augmented Generation: 2025 and Beyond.” Signity Solutions, February 2025.

作者：Ashwindevelops

发表于: 2025-05-262025-05-26 09:19:41
原文链接：https://page.om.qq.com/page/OX169Km8kHAivnIn3wK1-V4A0
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

多模态RAG实战指南：完整Python代码实现AI同时理解图片、表格和文本

相关快讯

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐