View on GitHub

medline-discoveries

medline-discoveries: data collection

Introduction

This directory contains the scripts and explanations (below) related to the data collection process, i.e. how the input data was generated.

This part is not required to reproduce the experiments in the paper. It is also possible to obtain the input data using some other sources and methods, as long as the input format is respected (see details in the main documentation).

Extracting the MeSH descriptors by paper from Medline

This part is a bit complex and relies on several other repositories:

Please follow the instructions provided in the TDC Tools documentation. The final output should be the following two directories:

Preparing the input data

Requirements

export PATH=$PATH:<my-path-to>/tdc-tools/code/

Generating the “full” variant of the input data

Replace <my-path-to> with the appropriate path in the command below:

data-collection/scripts/prepare-dataset.sh <my-path-to>/concept-freq/ <my-path-to>/joint-freq/ data/input

This process may take 15 to 30 mn. It should generate the files described in the main documentation in data/input: indiv.full.min100 indiv.full.total joint.full.min100, as well as the static directory.

Generating the ND subset of the input data

Replace <path to umls> with the path where the UMLS data is located (the one containing the META directory) in the command below:

data-collection/scripts/prepare-dataset-ND.sh -m data/input/ data/ND.mesh.targets <path to umls> data/SemGroups.txt

This process takes a few minutes. It creates the files indiv.ND.min100 and joint.ND.min100 in data/input as well as indiv.min100.ND joint.min100.ND in data/input/static.