BioC 是由 National Center for Biotechnology Information (NCBI) 提出的生物医学文献数据交换标准
主要用于:
支持 XML / JSON 两种存储形式(常用 XML)
BioC XML 的核心结构:
<collection> <source>PubMed</source> <date>2025-09-29</date> <document> <id>PMID123456</id> <passage> <offset>0</offset> <text>Alzheimer Disease is a neurodegenerative disorder.</text> <annotation> <id>1</id> <infon key="type">Disease</infon> <location offset="0" length="18"/> <text>Alzheimer Disease</text> </annotation> </passage> </document> </collection>
collection:文献集合
document:每篇文献
passage:文献段落或句子
annotation:实体标注
type
location
text
relation
PubTator 是 NCBI 提供的生物医学文献自动注释工具
输出格式主要用于:
PubTator 的常见 tab-delimited 格式:
PMID|Title|Abstract 123456|Alzheimer Disease is a neurodegenerative disorder.|... PMID Start End Mention Type 123456 0 18 Alzheimer Disease Disease 123456 55 60 APP Gene
第一部分:PMID + 文本(title + abstract)
第二部分:实体标注
entities = [] with open("pubtator.txt") as f: for line in f: if line.strip() and not line.startswith("PMID"): parts = line.strip().split('\t') pmid, start, end, mention, etype = parts entities.append({ "pmid": pmid, "start": int(start), "end": int(end), "text": mention, "type": etype })
import bioc with open("bioc.xml") as f: collection = bioc.load(f) for doc in collection.documents: for passage in doc.passages: text = passage.text for ann in passage.annotations: print(ann.text, ann.infons['type'], ann.locations[0].offset)
💡 总结