RAG-检索查询耗时太长如何优化

chroma db中有大概5000多条向量数据， llm,embedding都是使用第3方在线api接口，但是检索查询耗时太长了，每次大概要10秒多，这种情况该如何处理？

代码如下：

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex
import textwrap
from llama_index.core import  Settings
from llama_index.llms.openai_like import OpenAILike
import time
import chromadb
# ================== 初始化模型 ==================
def init_models():
    """初始化模型并验证"""

    from llama_index.embeddings.openai_like import OpenAILikeEmbedding
    embed_model = OpenAILikeEmbedding(
        model_name="BAAI/bge-m3",
        api_base="https://api.siliconflow.cn/v1",
        api_key="sk-sss",
        embed_batch_size=10,
    )


    llm = OpenAILike(
        model="deepseek-ai/DeepSeek-V3",
        api_base="https://api.siliconflow.cn/v1",
        api_key="sk-ssss",
        context_window=128000,
        is_chat_model=True,
        is_function_calling_model=False,
    )

    Settings.embed_model = embed_model
    Settings.llm = llm

    # 验证模型
    test_embedding = embed_model.get_text_embedding("测试文本")
    print(f"Embedding维度验证：{len(test_embedding)}")

    return embed_model

print("初始化模型...")
start_time = time.time()
init_models()
print(f"初始化模型耗时：{time.time() - start_time:.2f}s")

start_time = time.time()

# 连接pgsql数据库
# set up ChromaVectorStore and load in data
# 创建chroma数据库客户端，和传关键集合collection
chroma_client = chromadb.PersistentClient(path=r"D:\\\\Test\\\\LLMTrain\\\\pgvector_test\\\\chroma_db")
chroma_collection = chroma_client.get_or_create_collection("erp_qa_data")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


# 从chromadb中读取
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

query_engine = index.as_query_engine()
response = query_engine.query("2017有什么好看的小说")

print(textwrap.fill(str(response), 100))
print(f"查询耗时：{time.time() - start_time:.2f}s")

执行的结果：

初始化模型...
Embedding维度验证：1024
初始化模型耗时：0.95s
2017年值得推荐的小说包括《将夜》《择天记》《冒牌大英雄》《无限恐怖》《恐怖搞校》《大国医》《龙魔导》《大唐悬疑录：长恨歌密码》《风雪追击》《草原动物园》《有匪2：离恨楼》《我们住在一起》《月都花落
，沧海花开》《天定风华》《寻找爱情的邹小姐》《应许之日》《星光的彼端》《他来了，请闭眼》等。
查询耗时：17.18s

2025年05月16日 | 651人阅读

回答 | 共 10 个

按点赞量排序

沐雪

结果是：

初始化模型...
Embedding维度验证：1024
初始化模型耗时：0.95s
2017年值得推荐的小说包括《将夜》《择天记》《冒牌大英雄》《无限恐怖》《恐怖搞校》《大国医》《龙魔导》《大唐悬疑录：长恨歌密码》《风雪追击》《草原动物园》《有匪2：离恨楼》《我们住在一起》《月都花落
，沧海花开》《天定风华》《寻找爱情的邹小姐》《应许之日》《星光的彼端》《他来了，请闭眼》等。
查询耗时：17.18s

2025年05月16日

请先登录 · 注册

聚客AI-挽风

LlamaIndex 在第一次调用时，会构建或加载 Chroma 的集合和索引结构，这一步包含节点索引化和存储上下文初始化，通常比较费时，你可以先测一下这一步耗时几秒，最后大模型给出响应耗时几秒

2025年05月16日

沐雪 | 2025年05月16日

第2次访问也很慢。

请先登录 · 注册

聚客AI-Li

建议先定位原因，比如把数据换成100条甚至10条，再看下耗时多少。

换成其他api或模型是否也是同样耗时，排除api或模型问题。

可以把后面每一步的耗时打印出来看下，很可能是访问数据耗时太多

2025年05月16日

沐雪 | 2025年05月16日

我刚刚试了下，20条数据，也很慢。

沐雪 | 2025年05月16日

我本地电脑没有独立显卡，跟这个有关系吗？

聚客AI-Li | 2025年05月17日

回复 @沐雪: 我查资料，说是query_engine.query() 在 Settings.llm 已设置的情况下，默认会使用llm。返回给你的内容，是存入的原始数据，还是大模型加工后的数据？加了参数verbose=True会显示使用情况，你把llm取消掉试试

沐雪 | 2025年05月17日

回复 @聚客AI-Li: 有没有可能是 llm大模型在合成文本比较耗时呢？llamaindex如何将内部的一些日志打印出来呢？

聚客AI-Li | 2025年05月17日

回复 @沐雪: llm返回内容时，肯定有时间的，你的问题是检索耗时，rag检索与llm无关，所以把llm取消掉，你再看下耗时长短

请先登录 · 注册

沐雪

第2次访问也很慢。

代码片段如下：

# 从chromadb中读取
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

query_engine = index.as_query_engine()
response = query_engine.query("2017有什么好看的小说")

print(textwrap.fill(str(response), 100))
print(f"查询耗时：{time.time() - start_time:.2f}s")

print("======第2次=====")
start_time = time.time()
response = query_engine.query("爬行垫什么材质的好")
print(textwrap.fill(str(response), 100))
print(f"第2词查询耗时：{time.time() - start_time:.2f}s")

2025年05月16日

请先登录 · 注册

游客