Convert the Medline MeSH descriptors by PMID to DCM (Doc-Concept Matrix) format
Overview
Instructions for converting the list of Medline MeSH descriptors by PMID to the DCM format, usable by other parts of tdc-tools
.
Requirements
The input data is the list of MeSH descriptors by PMID.
- The file
mesh-descriptors-by-pmid.tsv
can be obtained with the KD fork, as described in the documentation here. - The duplicates should be removed using the KD Tools repository. This process is described here and summarized below:
extract-non-latest-pmid-versions.pl mesh-descriptors-by-pmid.tsv non-latest-pmid-versions.tsv
echo mesh-descriptors-by-pmid.tsv | ../kd-data-tools/bin/discard-non-latest-pmid-versions.pl -c 3 non-latest-pmid-versions.tsv output
mv output/mesh-descriptors-by-pmid.tsv mesh-descriptors-by-pmid.deduplicated.tsv
Format of the file mesh-descriptors-by-pmid.deduplicated.tsv
(copied from here)
<pmid> <year> <pmid version> <journal> <title> <mesh list>
Where <mesh list>
is a comma-separated list of Mesh descriptors together with the value for ‘MajorTopicYN’ after each of them (separated by |
). Example:
D005845|N,D006268|Y,D006273|Y,D006739|Y,D006786|N,D014481|N
Generating the DCM format
mkdir medline-mesh-decriptors
build-dcm-from-mesh-descriptors-by-pmid.py mesh-descriptors-by-pmid.deduplicated.tsv medline-mesh-decriptors/dcm
Temporary fix for the KD bug “season instead of year”
To this date there is a bug in the KD fork leading to some entries having an invalid year.
In case this bug hasn’t been fixed by then, this will remove the erroneous entries from the data:
cd medline-mesh-decriptors/dcm
rm -f Autu fall Fall spri Spri summ Summ Wint
cd ../..
Collecting the individual and joint frequency by concept
Same process as described here:
mkdir medline-mesh-descriptors/concept-freq
mkdir medline-mesh-descriptors/joint-freq
cd medline-mesh-descriptors/
for f in dcm/*; do echo $f; ls $f | get-frequency-from-doc-concept-matrix.py concept-freq/$(basename $f); done
for f in dcm/*; do echo $f; ls $f | get-frequency-from-doc-concept-matrix.py -j joint-freq/$(basename $f); done