GraphRAG实战(openai+langchain+neo4j)

最后发布时间 : 2025-03-15 21:49:53 浏览量 : 63

Static Badge Static Badge

本文我们将讨论GraphRAG(Graph-based Retrieval Augmented Generation)的实现流程,其中使用OpenAI进行自然语言处理,使用neo4j作为图数据库。在这个流程中,我们将展示:

  • 首先将文本转换为图结构
  • 然后将图结构存储在neo4j中
  • 最后提取用户问题中的实体,使用提取到的实体检索相关的实体和他们的关系,再借助llm生成回答
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.

将上述文本使用OpenAI将文本转换为图表示,并存储在neo4j中

生信小木屋

在上图中,紫色的节点(df48cdaf)代表文档,红色节点(Nobel Prize)诺贝尔奖,两个蓝色的节点代表人名(玛丽·居里、皮埃尔·居里),灰色的代表(University Of Paris)巴黎大学。其中文档和其他所有节点的关系是提及(mentions)。

GraphRAG实现

为了快速了解GraphRAG背后的逻辑,可以使用OpenAi apineo4j sandbox在快速开始实验

安装及导入包

!pip install langchain
!pip install -U langchain-community
!pip install sentence-transformers
!pip install faiss-gpu
!pip install pypdf
!pip install faiss-cpu


!pip install  langchain-openai
!pip install  langchain-experimental
!pip install json-repair
!pip install neo4j

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.chat_models import ChatOllama
from langchain_community.vectorstores import Neo4jVector
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

配置neo4j的连接

graph = Neo4jGraph(
    url= "bolt://44.204.252.192" ,
    username="neo4j", #default
    password="quarterdeck-gross-dials" #change accordingly
)

使用OpenAI将文本转换为Graph

将文本转换为document

text = """
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""
documents = [Document(page_content=text)]

加载大模型,将文本转换为graph

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo",api_key="sk-FgKk2OO5RYzYRJEf7eaMytOLsuIbZecGxaJvRnWDg1GCIkNh")
llm_transformer_filtered = LLMGraphTransformer(llm=llm)
graph_documents = llm_transformer_filtered.convert_to_graph_documents(documents)

graph_documents的内容如下

[GraphDocument(nodes=[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={}), Node(id='Nobel Prize', type='Award', properties={})], relationships=[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Nobel Prize', type='Award', properties={}), type='WINNER', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='PROFESSOR', properties={}), Relationship(source=Node(id='Pierre Curie', type='Person', properties={}), target=Node(id='Nobel Prize', type='Award', properties={}), type='CO-WINNER', properties={})], source=Document(metadata={}, page_content='\nMarie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.\nShe was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.\nHer husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.\nShe was, in 1906, the first woman to become a professor at the University of Paris.\n'))]

将生成的graph存储在neo4j

graph.add_graph_documents(
      graph_documents,
      baseEntityLabel=True,
      include_source=True
  )

为了复杂查询在neo4j中创建embedding

embed = OpenAIEmbeddings(model="text-embedding-3-large",base_url="https://xiaoai.plus/v1",api_key="sk-FgKk2OO5RYzYRJEf7eaMytOLsuIbZecGxaJvRnWDg1GCIkNh")
vector_index = Neo4jVector.from_existing_graph(
    embedding=embed,
    search_type="hybrid",
    node_label="Document",
    text_node_properties=["text"],
    embedding_node_property="embedding",
    url="bolt://44.204.252.192",
    username="neo4j", #default
    password="quarterdeck-gross-dials" #change accordingly
)
vector_retriever = vector_index.as_retriever()

此时在neo4j中可以看到如下数据

{
  "identity": 0,
  "labels": [
    "Document"
  ],
  "properties": {
    "id": "df48cdafbdaada2de04aaeb7c6a271a0",
    "text": "
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.....",
    "embedding": [
      0.013757660053670406,
      -0.035230763256549835,
      -0.014454838819801807,
      ...
    ]
  },
  "elementId": "4:56545626-8926-4df0-bdb3-73bbd4de10d6:0"
}
{
  "identity": 1,
  "labels": [
    "Person",
    "__Entity__"
  ],
  "properties": {
    "id": "Marie Curie"
  },
  "elementId": "4:56545626-8926-4df0-bdb3-73bbd4de10d6:1"
}

在neo4j中查询实体

一旦我们将graph存储在了neo4j中,我们可以提取用户问题中的实体, 并在graph中查找相关的实体及其关系

定义从文本中提取实体的模型

class Entities(BaseModel):
    names: list[str] = Field(..., description="All entities from the text")

定义提取实体的提示词

prompt = ChatPromptTemplate.from_messages([
        ("system", "Extract organization and person entities from the text."),
        ("human", "Extract entities from: {question}")
    ])

结合提示词和llm创建提取实体的链,输出结果将是一个结构化的匹配实体的模型

entity_chain = prompt | llm.with_structured_output(Entities, include_raw=True)
response = entity_chain.invoke({"question": "Who are Marie Curie and Pierre Curie?"})
entities =  response['raw'].tool_calls[0]['args']['names']

response 内容如下

{'raw': AIMessage(content='', additional_kwargs={'tool_calls': [{'id': 'chatcmpl-WKYa1IBDY3cBqBgp8JbP6KtvlHniV', 'function': {'arguments': '{"names":["Marie Curie","Pierre Curie"]}', 'name': 'Entities'}, 'type': 'function'}], 'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 13, 'prompt_tokens': 72, 'total_tokens': 85, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'gpt-4-turbo', 'system_fingerprint': 'fp_5b26d85e12', 'finish_reason': 'stop', 'logprobs': None}, id='run-41e2ac51-9573-4366-bc80-3080bd464fa6-0', tool_calls=[{'name': 'Entities', 'args': {'names': ['Marie Curie', 'Pierre Curie']}, 'id': 'chatcmpl-WKYa1IBDY3cBqBgp8JbP6KtvlHniV', 'type': 'tool_call'}], usage_metadata={'input_tokens': 72, 'output_tokens': 13, 'total_tokens': 85, 'input_token_details': {}, 'output_token_details': {}}),
 'parsed': Entities(names=['Marie Curie', 'Pierre Curie']),
 'parsing_error': None}

迭代提取的实体,在neo4j数据库中查询其关联实体及关系

graph_data  = ""
for entity in entities:
    query_response = graph.query(
        """MATCH (p:Person {id: $entity})-[r]->(e)
        RETURN p.id AS source_id, type(r) AS relationship, e.id AS target_id
        LIMIT 50""",
        {"entity": entity}
    )
    graph_data  += "\n".join([f"{el['source_id']} - {el['relationship']} -> {el['target_id']}" for el in query_response])
graph_data 

graph_data 内容如下

Marie Curie - WINNER -> Nobel Prize
Marie Curie - PROFESSOR -> University Of ParisPierre Curie - CO-WINNER -> Nobel Prize

使用向量搜索

vector_data = [el.page_content for el in vector_retriever.invoke( "Who are Marie Curie and Pierre Curie?")]
vector_data

vector_data内容如下

['\ntext: \nMarie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.\nShe was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.\nHer husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.\nShe was, in 1906, the first woman to become a professor at the University of Paris.\n']

结合图搜索和向量搜索结果生成回答

context= f"Graph data: {graph_data}\nVector data: {'#Document '.join(vector_data)}"

定义提示词模板,为了基于上下文生成回答

template = """Answer the question based only on the following context:
{context}
Question: {question}
Answer:"""

使用模板创建提示词,这将采用上下文和提问作为输入

prompt = ChatPromptTemplate.from_template(template)

创建处理链:

  • 使用上述生成的结果作为上下文输入
  • 应用提示词模板生成最终问题
  • 使用llm生成回答
  • 使用StrOutputParser格式化输出为字符串
chain = (
        {
            "context": lambda input: context,  # Generate context from the question
            "question": RunnablePassthrough(),  # Pass the question through without modification
        }
        | prompt  # Apply the prompt template
        | llm  # Use the language model to answer the question based on context
        | StrOutputParser()  # Parse the model's response as a string
    )

当输入问题Who are Marie Curie and Pierre Curie?最终结果如下

Marie Curie was a Polish and naturalised-French physicist and chemist known for her research on radioactivity. 
She was the first woman to win a Nobel Prize, the first person to win it twice, and the only person to win in two scientific fields. 
She also became the first woman professor at the University of Paris. Pierre Curie, her husband, was a co-winner of her first Nobel Prize. 
Together, they were the first married couple to win the Nobel Prize.

结果比较

不使用graphRGA的输出结果

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo",base_url="https://xiaoai.plus/v1",api_key="sk-FgKk2OO5RYzYRJEf7eaMytOLsuIbZecGxaJvRnWDg1GCIkNh")
response = llm.invoke("Who are Marie Curie and Pierre Curie?")
print(response)
Marie Curie and Pierre Curie were a married couple who were both pioneering scientists in the field of radioactivity. 
Marie Curie, originally from Poland, was the first woman to win a Nobel Prize and the only person to win Nobel Prizes in two different scientific fields, physics and chemistry. 
Pierre Curie was a French physicist who made significant contributions to the study of crystallography, magnetism, and radioactivity. 
Together, they discovered the elements polonium and radium, and conducted groundbreaking research on the properties of radioactive materials. 
Their work laid the foundation for the development of nuclear physics and the use of radiation in medicine.

使用基于向量的RAG的输出结果

text = """
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""
docs = [Document(page_content=text)]
embeddings = OpenAIEmbeddings(model="text-embedding-3-large"api_key="sk-FgKk2OO5RYzYRJEf7eaMytOLsuIbZecGxaJvRnWDg1GCIkNh")
# Create FAISS vector store
vectorstore = FAISS.from_documents(docs, embeddings)
# Save and reload the vector store
vectorstore.save_local("faiss_index_")
persisted_vectorstore = FAISS.load_local("faiss_index_", embeddings, allow_dangerous_deserialization=True)
# Create a retriever
retriever = persisted_vectorstore.as_retriever()
result = qa.invoke("Who are Marie Curie and Pierre Curie?")
print(result)
Marie Curie was a Polish and naturalised-French physicist and chemist known for her research on radioactivity. 
She was the first woman to win a Nobel Prize, the first person to win twice, and the only person to win in two different scientific fields. 
Pierre Curie was her husband and a co-winner of her first Nobel Prize. 
Together, they conducted significant research in the field of radioactivity, and their collaboration marked them as the first-ever married couple to win the Nobel Prize.

参考