View on GitHub

tdc-tools

Tools for manipulating Tabular Document-Concept format

TDC: input data format

Overview

TDC (Tabular Document-Concept) is a format specificailly designed to represent the biomedical literature as a collection of documents represented by their concepts. In particular it facilitates the extraction of a knowledge graph of concepts and can be used as a support for Literature-Based Discovery (LBD).

The format is meant as a form of standard which disconnects the stage of data extraction from the the stage of high level exploitation:

Currently two main options are proposed to obtain a TDC representation of the biomedical literature:

The TDC format

For each input document, three output files are generated: .raw, .tok, .cuis. The format of these files is described below.

.raw file

Abstracts:

<pmid> <year> <title> <abstract content>

Full article:

<pmid> <year> <title+subtitle> <abstract content> <paper content>

Where <paper content> includes the xml elements article, back and floating.

.tok file

Full content of the sentences with ids

<pmid> <year> <partId> <elemId> <sentNo> <sentence words>

.cuis file

Extracted CUIs by position, i.e. for every position and length where at least one CUI is found the list of candidate CUIs (synonyms).

The CUIs are provided as integer ids, as used internally by the original KD system.

<pmid> <partId> <elemId> <sentNo> <cuis list> <position> <length>

Collecting PubTator Central (PTC) data

The PubTator Central (PTC) data can be downloaded in bulk in the BioC format as follows:

wget ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/PubTatorCentral_BioCXML/*

Estimated duration: 12 hours.

Despite the .gz extension the files are simple tar archives. To extract them:

for f in BioCXML.*gz; do tar xf $f; done

To save space the resulting directory can be compressed:

mksquashfs BioCXML/ BioCXML.sqsh -comp xz

Details about the TDC output format for PTC and differences with KD output