关于 BioC (XML) 与 PubTator格式

最后发布时间 : 2025-09-30 08:31:18 浏览量 :

1️⃣ BioC (XML) 格式

1.1 概念

BioC 是由 National Center for Biotechnology Information (NCBI) 提出的生物医学文献数据交换标准
主要用于：
- 文献的句子划分
- 实体标注（疾病、基因、化学物质等）
- 关系标注
支持 XML / JSON 两种存储形式（常用 XML）

1.2 结构（XML 版）

BioC XML 的核心结构：

<collection>
  <source>PubMed</source>
  <date>2025-09-29</date>
  <document>
    <id>PMID123456</id>
    <passage>
      <offset>0</offset>
      <text>Alzheimer Disease is a neurodegenerative disorder.</text>
      <annotation>
        <id>1</id>
        <infon key="type">Disease</infon>
        <location offset="0" length="18"/>
        <text>Alzheimer Disease</text>
      </annotation>
    </passage>
  </document>
</collection>

collection：文献集合
document：每篇文献
passage：文献段落或句子
annotation：实体标注
- type → 实体类别（Disease, Gene, Chemical…）
- location → 在文本中的起止位置
- text → 实体文本

1.3 特点

支持多级结构（collection → document → passage → sentence → annotation）
易扩展，可存储关系标注（relation 元素）
常用于 BioNLP / NER / RE 数据集

2️⃣ PubTator 格式

2.1 概念

PubTator 是 NCBI 提供的生物医学文献自动注释工具
输出格式主要用于：
- 文献摘要（PMID）
- 实体标注（Gene, Disease, Chemical, Species）
- 支持下游关系抽取任务

2.2 文件结构

PubTator 的常见 tab-delimited 格式：

PMID|Title|Abstract
123456|Alzheimer Disease is a neurodegenerative disorder.|...

PMID    Start   End     Mention         Type
123456  0       18      Alzheimer Disease   Disease
123456  55      60      APP                 Gene

第一部分：PMID + 文本（title + abstract）
第二部分：实体标注
- Start / End：实体在文本中的位置
- Mention：实体文本
- Type：实体类型

2.3 特点

扁平化表格，易于解析
适合快速构建实体-关系数据集
与 BioC 相比结构更简单，但关系信息少

3️⃣ BioC 与 PubTator 的对比

特性	BioC (XML)	PubTator
文件类型	XML / JSON	文本/TSV
层级结构	多层（collection → document → passage → sentence → annotation）	扁平（PMID + 实体列表）
支持关系标注	是	不直接支持（可扩展）
易解析性	需要 XML parser	简单文本解析
典型用途	NER / RE / BioNLP	实体识别 / 数据集构建

4️⃣ 解析示例（Python）

解析 PubTator

entities = []
with open("pubtator.txt") as f:
    for line in f:
        if line.strip() and not line.startswith("PMID"):
            parts = line.strip().split('\t')
            pmid, start, end, mention, etype = parts
            entities.append({
                "pmid": pmid,
                "start": int(start),
                "end": int(end),
                "text": mention,
                "type": etype
            })

解析 BioC XML

import bioc
with open("bioc.xml") as f:
    collection = bioc.load(f)
    for doc in collection.documents:
        for passage in doc.passages:
            text = passage.text
            for ann in passage.annotations:
                print(ann.text, ann.infons['type'], ann.locations[0].offset)

💡 总结

BioC：标准化、层次化、支持关系，适合复杂 NLP 任务
PubTator：扁平化、快速解析，适合实体抽取和大规模数据集构建