View on GitHub

tdc-tools

Tools for manipulating Tabular Document-Concept format

Collecting individual and joint frequency by concept

I. Individual frequency by concept

Introduction

Format

“Data views” (directory structure)

General: structure <level>/<view>/ where:

In every <level>/<view>/ directory, for every year there are two files:

File <year>

<year> <concept id> <doc frequency> <total frequency>

where:

Note: the number of lines is the number of unique concepts present in the year.

File <year>.total

<year> <nb unique concepts> <nb docs> <nb concepts occurrences>

Usage

PTC data

get-frequency-from-doc-concept-matrix-all-variants.sh  PTC.dcm/ PTC.concept-freq

KD data

get-frequency-from-doc-concept-matrix-all-variants.sh  KD.dcm/ KD.concept-freq

Collecting data stats (optional)

For a single data dir:

collect-data-stats.sh PTC.concept-freq/ >17-PTC-2021/data-stats.tsv
../tdc-tools/code/collect-data-stats.sh KD.concept-freq >16-KD-2021/data-stats.tsv

For several data dirs:

collect-data-stats.sh PTC.concept-freq/ KD.concept-freq >18-contrast-method-experiments/global-stats.tsv

II. Joint frequency by pairs of concepts

Format

The output follows exactly the same directory structure as the individual frequency output (see above).

File year

<year> <concept1> <concept2> <joint freq>

where joint freq is the number of documents containing the pair of concepts.

Usage

In the examples below a list of 10 target concepts is used.

PTC

Running all the tasks at once:

get-frequency-from-doc-concept-matrix-all-variants.sh -p -j PTC.dcm/ PTC.joint-targets PTC.targets

real    365m33.960s
user    359m29.992s
sys     1m18.763s

KD

Printing the tasks for parallel execution:

rm -f KD.joint-targets
get-frequency-from-doc-concept-matrix-all-variants.sh -w -j KD.dcm/ KD.joint-targets KD.targets >tasks
mkdir jobs
split -a 4 -d -l 1 tasks jobs/job.
for f in jobs/job.*; do echo -e '#!/bin/bash\n#SBATCH -p compute\n' > $f.sh; cat "$f" >> "$f.sh"; done
for f in jobs/*.sh; do sbatch $f; done