View on GitHub

tdc-tools

Tools for manipulating Tabular Document-Concept format

Converting PTC data to TDC format

Output structure

The output data is stored n the TDC format in three subdirectories:

The output is structured this way in order to:

Inside every subdirectory the files are organized by year (first part of the filename). There is a maximum number of documents by file in order to facilitate batch processing so there can be several files for every year. There is a special case for when the year of the document is undefined (treated as year 0000, see below).

Important: the process requires a lot of computing time (a few months if using a single process) and the resulting data is 317 GB (for the January 2021 PTC data).

Requirements

The conversion script PTC-to-TDC.py requires the following Python modules: bioc, spacy, scispacy.

Optional: udocker environment

In a non-root environment a udocker container can be created as follows:

udocker create --name=bioc ubuntu
udocker run -v ~ pybioc
apt update
apt upgrade
apt install python python3-pip
pip3 install bioc spacy scispacy

Usage

The script PTC-to-TDC.py can be used in two ways described below.

Single full process

PTC-to-TDC.py <PTC input dir> <output dir>

This command reads the full PTC data, processes every document and writes the corresponding data in the output TDC format.

Multiple processes by range of years

Currently the only way to decompose the task into parallel batches is to process a range of years:

PTC-to-TDC.py <PTC input dir> <output dir> <start year> <end year>

The script generate-PTC-to-TDC-jobs-by-year.sh can be used to generate slurm jobs by year as follows:

rm -rf ptc-jobs; mkdir ptc-jobs; tdc-tools/code/generate-PTC-to-TDC-jobs-by-year.sh data/PubTatorCentral.sqsh 1950 ptc-jobs/job PTC.TDC compute
for f in ptc-jobs/*sh; do sbatch $f; done

Note that some jobs (recent years) will require around 10 days of computation.

Handling of errors/oddities in the PTC data

See also Details about the TDC output format for PTC and differences with KD output

Wrapping up (optional)

Global log summary across batches

If the script was run independently by range of years the outputs by year can be aggregated as follows:

ls *out | grep -v slurm | while read f; do grep "Info:" "$f"; done | cut -f 1 -d '='  | sort -u | while read line; do echo -n "$line = ";  ls *out | grep -v slurm | while read f; do grep "$line =" "$f"; done | cut -d '=' -f 2 | awk '{s+=$1} END {print s}'; done

Archiving the logs

mkdir PTC.TDC/PTC-to-TDC.logs
rm -f slurm*out
ls *err *out
mv *err *out  PTC.TDC/PTC-to-TDC.logs
create-symlinks-data-views.sh PTC.TDC

Compressing

mksquashfs PTC2021-TDC PTC2021-TDC.sqsh -comp xz