View on GitHub

tdc-tools

Tools for manipulating Tabular Document-Concept format

tdc-tools documentation

This is the documentation of the TDC Tools repository. TDC stands for Tabular Document-Concept.

Contents

Overview

This repository contains Python and Bash scripts to generate and manipulate data in the Tabular Document-Concept (TDC) format. TDC is a format specificailly designed to represent the biomedical literature as a collection of documents represented by their concepts. In particular it facilitates the extraction of a knowledge graph of concepts and can be used as a support for Literature-Based Discovery (LBD).

Most of the biomedical literature is available for download from Medline and PubMedCentral (PMC). PubTatorCentral (PTC) offers an alternative to the raw data format with the BioC format. While the PTC data is much richer and BioC more convenient than the raw xml format, these formats are all fairly low level: very detailed, quite complex to parse, and not very convenient to capture high-level relations between articles or concepts. By contrast the TDC format is a high-level representation of the literature where each document is considered as a collection of concepts and the documents are grouped by year of publication. The format is meant to facilitate the extraction of the concepts individual and joint frequency.

Software requirements

Most scripts require only Python 3 and a few standard Python libraries.

Data requirements

The scripts can be used with any dataset in the TDC format. See TDC: input data format.

Setup

In this documentation we assume that the scripts are available in the $PATH environment variable. For this setup run the following command from the tdc-tools directory:

export PATH=$PATH:$(pwd)/code

Note that the scripts can also be called with their path, e.g. code/build-doc-concept-matrix-all-variants.sh.