TDC: input data format
Overview
TDC (Tabular Document-Concept) is a format specificailly designed to represent the biomedical literature as a collection of documents represented by their concepts. In particular it facilitates the extraction of a knowledge graph of concepts and can be used as a support for Literature-Based Discovery (LBD).
The format is meant as a form of standard which disconnects the stage of data extraction from the the stage of high level exploitation:
- Upstream, different datasets and extraction methods can be represented as TDC.
- Downtream, different applications and methods (e.g. LBD methods) can exploit any dataset represented as TDC.
Currently two main options are proposed to obtain a TDC representation of the biomedical literature:
- PubTatorCentral conveniently provides Medline/PMC data after annotation and disambiguation.
- The raw Medline/PMC data can be converted to TDC and disambiguated using the following tools:
- This fork of the Knowedge Discovery (KD) repository (The original code was made by Jake Lever)
- The in-house disambiguation code “KD Data Tools”
The TDC format
- Note: description from this KnowledgeDiscovery fork. The format was originally designed for the KD data, hence the use UMLS CUIs as concept identifiers (CUIs can be replaced with any concept id).
For each input document, three output files are generated: .raw, .tok, .cuis. The format of these files is described below.
.raw file
Abstracts:
<pmid> <year> <title> <abstract content>
Full article:
<pmid> <year> <title+subtitle> <abstract content> <paper content>
Where <paper content> includes the xml elements article, back and floating.
.tok file
Full content of the sentences with ids
<pmid> <year> <partId> <elemId> <sentNo> <sentence words>
.cuis file
Extracted CUIs by position, i.e. for every position and length where at least one CUI is found the list of candidate CUIs (synonyms).
The CUIs are provided as integer ids, as used internally by the original KD system.
<pmid> <partId> <elemId> <sentNo> <cuis list> <position> <length>
Collecting PubTator Central (PTC) data
The PubTator Central (PTC) data can be downloaded in bulk in the BioC format as follows:
wget ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/PubTatorCentral_BioCXML/*
Estimated duration: 12 hours.
Despite the .gz extension the files are simple tar archives. To extract them:
for f in BioCXML.*gz; do tar xf $f; done
- '’Caution:’’ the uncompressed data requires 438G (January 2021 version).
To save space the resulting directory can be compressed:
mksquashfs BioCXML/ BioCXML.sqsh -comp xz
- Estimated duration: 30 hours.
- The compressed size is 69G.
Details about the TDC output format for PTC and differences with KD output
- In the TDC fomat the
.tokfile is meant to primarily represent sentence-level tokenization. Word-level tokenization is optional.- In the KD output in TDC format, word-level tokenization is also provided and the position of a concept is represented as a a token index
- By contrast the PTC output doesn’t provide word-level tokenization and represents the position as a char offset.
- This is due to the fact that the PTC BioC format doesn’t provide word-level tokenization, and applying SciSpacy tokenization turned out not to be compatible with the PTC annotations.
- The format of the concept column is also different between KD and PTC output:
- In KD an occurrence can match multiple UMLS concepts, so the concept column is made of a comma-separated list of concepts.
- Incidentally this is why the KD output requires a disambiguation step.
- In PTC the concepts are already disambiguated and it’s very rare that multiple concepts are provided for the same occurrence. Therefore the concept column represents a single concept. The rare cases of multiple concepts are stored on different lines.
- In PTC the concept column is made of an identifier and a type separated by
@:id@type - In PTC a concept identifier can contain pretty much any character, including the comma
,and even the separator@(at least one case).- Important: when parsing the concept string
id@typeonly the last@should be interpreted as separator in the string (thetypecannot contain the separator but theidcan).
- Important: when parsing the concept string
- In KD an occurrence can match multiple UMLS concepts, so the concept column is made of a comma-separated list of concepts.