medline-discoveries: detecting surges (main documentation)

Overview

This is the main documentation for the Medline Discoveries repository, which is the companion code for the paper “Mining impactful discoveries from the biomedical literature” TODO link.

There are three parts in the documentation:

  • The data collection process describes how the input data was created.
  • The “Detecting surges” main documentation (this document) describes how to generate the output data (i.e. detecting surges) using the provided R implementation.
  • The “Analysis” part proposes some additional analysis of the results, including how to generate the graphs and tables found in the paper.

Rmd Options

This document was generated from an R Markdown source file. The source file is provided in the repository, allowing full reproducibility of the experiments presented in the paper. It can be executed through the RStudio interface (“knit” button) or as follows:

rmarkdown::render('1-detecting-surges.Rmd')

The .Rmd source document can be configured by modifying the following lines:

# input directory
dataPath <- 'data/input'
# output dir
outputDir='data/output'

# by default using the Medline ND subset:
data.suffix <- '.min100.ND'
# uncomment the following line in order to use the full Medline data instead:
# data.suffix <- '.min100'

# set to TRUE for quick execution with a random subset of the data
shortVersion <- FALSE
  • Note: using the full Medline data with data.suffix <- '.min100' requires at least 128GB RAM.

Requirements

Software

This code requires R and a few R libraries:

  • data.table
  • ggplot2
  • plyr
  • cowplot
  • scales
  • knitr and rmarkdown (only needed for executing the Rmd source files)

Data

The input data used for the experiment described in the paper can be obtained from https://zenodo.org/record/5888572. The process used to collect this data is described in the data collection part.

By default the input data is expected in the subdirectory data/input.

Input data format

The input data format is purposefully simple in order to permit the use of the software with different sources of data. There are two “variants” of the data:

  • The dynamic data contains the frequency value for every years
  • The static data contains the cumulated frequency across all years.
    • By default located in a subdirectory static.

Both are made of two files, one for the individual frequency by concept and the other for the joint frequency by pair of concepts. The format for each file is as follows:

  • indiv.<suffix>: <year> <concept> <frequency> <multi frequency>
    • The last column <multi frequency> is ignored.
  • joint.<suffix>: <year> <concept1> <concept2> <frequency>
  • static/indiv.<suffix>: <concept> <frequency> [multi frequency] <term> <group>
    • The last two columns are required for displaying the name of the concept/relation or filtering by semantic group.
  • static/joint.<suffix>: <concept1> <concept2> <frequency>

Additionally the two corresponding “totals” files are needed:

  • indiv.<suffix>.total: <year> <total unique concepts> <total documents> <total concepts occurrences>
  • static/indiv.<suffix>.total: <total documents> <total concepts occurrences>

In the joint files, it is expected that every pair of concepts (A,B) appears only once in the data for a given year (i.e. not “A B” and also “B A”). This condition can be satisfied by ordering every pair of concepts alphabetically: A < B.

Main process

Note: the full process is described in detail below. Readers who are only interested in the simplest way to run the process can go directly to the last part “Processing multiple parameters and saving surges data”.

Initialization

The code can be loaded in the R interpreter with:

source('discoveries.R')

Loading the input data

dynamic_joint <- loadDynamicData(dir=dataPath,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dir=dataPath,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dir=dataPath)

The following can be used to select a random sample of relations:

dynamic_joint<-pickRandomDynamic(dynamic_joint,n=1000)

Statistics

  • Number of concepts:
nrow(unique(dynamic_indiv,by=key(dynamic_indiv)))
## [1] 1803
  • Number of relations:
nrow(unique(dynamic_joint,by=key(dynamic_joint)))
## [1] 290797
  • Number of rows, i.e. pairs (relation,year):
nrow(dynamic_joint)
## [1] 10711862

Preprocessing

This step fills the gap years with 0 frequency values and calculates the moving average if needed.

  • This step is necessary even if the moving average is not used (default window of size 1).

Example with a moving average over a window of size 5:

relations.ma <- computeMovingAverage(dynamic_joint,dynamic_total, window=5)
indiv.ma <- computeMovingAverage(dynamic_indiv,dynamic_total, window=5)

New size:

nrow(relations.ma)
## [1] 13709214

Caculating a set of measures by year

This step calculates the measures used as a basis for calculating the trend (next step). One or several measures can be calculated.

Available measures:

  • prob.joint is the simple joint probability (it is already calculated from the previous step but this step is still recommended for consistency).
  • pmi and npmi: Pointwise Mutual Information and its normalized variant.
  • mi and nmi: “binary” Mutual Information and its normalized variant. The events considered are simply based on whether each concept is present or not (hence the word “binary”).
  • scp
  • pmi2 and pmi3
rel.measures<-addDynamicAssociationToRelations(relations.ma,indiv.ma,measures = c('prob.joint','pmi','nmi'))

Calculating the trend for every year and every relation

This step calculates the trend for every year and every relation.

  • The input data table is modified in place for the sake of efficiency. It contains additional columns after executing the function, but the number of rows is not modified.
  • The trend is calculated using the column provided with the argument measure and stored in a new column trend.
  • The indicator argument determines how the trend value is calculated:
    • rate (default) is the relative rate: \(\frac{p_{y}-p_{y-1}}{p_{y-1}}\). Experimetal results show that this indicator gives very poor results.
    • diff is the simple difference: \(p_{y}-p_{y-1}\)

Example:

computeTrend(rel.measures, indicator='diff', measure='nmi')
  • Note: there is no need to store the output data table since the data table is modified by reference.

Detecting surges

For every relation, this step marks the years where the trend value is higher than some threshold \(t\) as surge.

\(t\) can be a custom threshold or defined with one of the two proposed methods:

  • method 1 uses the standard outlier threshold calculated with the inter-quartile range: \(t=Q_3 + 3 IQR\).
  • method 2 (recommended) uses the inflection point in the quantile graph (see details in the paper).

The input data table is modified in place for the sake of efficiency.

# threshold <- calculateThresholdTopOutliers(rel.measures$trend)
threshold <- calculateThresholdInflectionPoint(rel.measures$trend)
print(threshold)
## [1] 0.003574462
detectSurges(rel.measures, globalThreshold=threshold)

Calculate ‘first year’ information (optional)

  • Given a relation between two concepts c1 and c2 which appear for the first time at years \(y_1\) and \(y_2\) respectively, the earliest possible year for a cooccurrence (and consequently for a surge) is \(max(y_1 , y_2)\). This can be calculated and added to the data table (column year.first.both) as follows. The difference between the surge year and this year is added as column duration.
  • The year of the first cooccurrence between the two concepts can be added as well as column year.first.joint. The difference between the surge year and this year is added as column duration.joint.
rel.measures<-calculateDiffYears(rel.measures,dynamic_indiv,dynamic_joint)
  • Note: if the last argument dynamic_joint is not provided, only year.first.both is added.

Adjusting the surge year in the sliding window (optional)

Using a moving average window (size higher than 1) can cause the surge year to be detected too early. This can lead to a meaningless result, in particular if the surge year has no cooccurrence at all. This step calculates the next non-zero year for every year and every relation. This “adjusted year” can be used to replace the surge year in some applications.

This step is optional.

rel.measures <- addNextNonZeroYear(rel.measures)

Statistics surges

surges_stats <- countSurgesByRelation(rel.measures)
kable(surges_stats[n.surges<=10,])
n.surges n prop
0 280078 0.9631392
1 4595 0.0158014
2 1941 0.0066748
3 1154 0.0039684
4 773 0.0026582
5 622 0.0021389
6 421 0.0014477
7 296 0.0010179
8 213 0.0007325
9 185 0.0006362
10 115 0.0003955
ggplot(surges_stats,aes(n.surges,prop))+geom_col()

Processing multiple parameters and saving surges data

A convenience function is provided which computes the surges based on multiple parameters and saves the resulting data to a file. This function encapsulates all the steps presented above.

system.time(computeAndSaveSurgesData(dataPath,outputDir=outputDir,suffix=data.suffix, ma_windows=c(1,3,5),measures=c('prob.joint','pmi','npmi','mi','nmi','scp'),indicators=c('rate','diff')))
## [1] "processing and saving to data/output/prob.joint.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.diff.5.min100.ND.tsv"
##     user   system  elapsed 
## 2667.165  167.954 2836.279