medline-discoveries: analysis

Overview

This document is part of the documentation for the Medline Discoveries repository, which is the companion code for the paper “Mining impactful discoveries from the biomedical literature” TODO link.

  • The main documentation describes how to generate the output data (i.e. detecting surges) using the provided R implementation.
  • This part of the documentation covers additional analysis of the results, including how the graphs and tables presented in the paper are obtained.

Rmd Options

This document was generated from an R Markdown source file. The source file is provided in the repository, allowing full reproducibility of the experiments presented in the paper. It can be executed through the RStudio interface (“knit” button) or as follows:

rmarkdown::render('2-analysis.Rmd')

The .Rmd source document can be configured by modifying the following lines:

# directory for the input data (Medline frequency data)
dataPathInput <- 'data/input'
dataPathInputStatic <- paste(dataPathInput,'static',sep='/')
# directory for the output data (extracted surges from the main process in step 1)
dataPathOutput <- 'data/output'

# by default using the Medline ND subset:
# the static suffix is different from the dynamic one for some reason (sorry for that)
data.suffix <- '.min100.ND'
# uncomment the following line in order to use the full Medline data instead (not tested with this Rmd!):
# data.suffix <- '.min100'

# set to TRUE for quick execution with a random subset of the data
shortVersion <- FALSE
# set to TRUE in order to save the graphs used in the paper
savePaperGraphs <- TRUE

Notes:

  • If the option shortVersion is set to FALSE, the process requires at least 128GB RAM.
  • Note: the option to use the full Medline data (data.suffix <- '.min100') has not been tested with this Rmd document.
source('discoveries.R')

Requirements

The software requirements are the same as for the first part

This part requires the same input data as in the first part, but also the output generated as a result of applying the process. Both datasets can be downloaded from https://zenodo.org/record/5888572.

By default the input data is expected in the subdirectory data/input and the output data (unsurprisingly) in data/output.

Loading

dynamic_joint <- loadDynamicData(dataPathInput,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dataPathInput,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dataPathInput)
static_data <- loadStaticData(dir=dataPathInputStatic,suffix=data.suffix)

Selecting some of the specific concepts presented in the paper

als <- filterConcepts(dynamic_joint, 'D000690')
selectedALS <- filterConcepts(als,c('D051379','D000073885','D000870','D019782','D008875'))
parkinson <- filterConcepts(dynamic_joint, 'D010300')
caseParkinson <- filterConcepts(parkinson, 'D013378')

Selecting a random subset

Note: in some of the graphs (trend distribution), the representation of all the points is very computationally heavy, causing extremely long duration even to just load the pdf file. This is why a random subset of relations is used even in the case where shortVersion is set to FALSE.

if (shortVersion) {
  dynamic_joint<-pickRandomDynamic(dynamic_joint,n=1000)
  dynamic_joint.rndsubset<-dynamic_joint
} else {
  dynamic_joint.rndsubset<-pickRandomDynamic(dynamic_joint,n=20000)
}
## [1] "selecting among n pairs =  290797"

In order to avoid cases with insufficient data, it is possible to filter out rows with less than some minimum frequency \(m\) (the relation must have at least one year with a frequency higher than \(m\)). It is also possible to specify the number of cases to pick:

three_relations<-pickRandomDynamic(dynamic_joint,n=3, minFreqYear = 200)
## [1] "selecting among n pairs =  25266"

Displaying relations frequency across time with their concepts names

Using the relations selected randomly above:

displaySeveralPairsData(three_relations,static_data,dynamic_total)
## Warning: Removed 51 rows containing missing values (position_stack).
## Warning: Removed 5 rows containing missing values (geom_col).
## Warning: Removed 2 rows containing missing values (geom_path).

Displaying the ALS cases

g<-displaySeveralPairsData(selectedALS,static_data,dynamic_total,excludeConceptsFromName = 'D000690',ncol=5)
g
## Warning: Removed 94 rows containing missing values (position_stack).
## Warning: Removed 8 rows containing missing values (geom_col).
## Warning: Removed 2 rows containing missing values (geom_path).

ggsave('expl1.pdf',g)
## Saving 10 x 4 in image
## Warning: Removed 94 rows containing missing values (position_stack).
## Warning: Removed 8 rows containing missing values (geom_col).
## Warning: Removed 2 rows containing missing values (geom_path).

Displaying measures

g <- displayMultiMeasure(selectedALS,dynamic_indiv,dynamic_total,static_data,windows=c(1,3,5),measures=c('prob.joint','pmi','npmi','mi','nmi','scp'),excludeConceptsFromName = 'D000690')
g
## Warning: Removed 6 rows containing missing values (geom_path).

ggsave('expl-measures.pdf',g+ theme(text=element_text(size=14),legend.position="none",axis.title.x=element_blank(),axis.title.y=element_blank()),width=20,height=16,units = "cm")
## Warning: Removed 6 rows containing missing values (geom_path).

Study of the trend parameters

Using simple frequency, indicator ‘rate’

displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='rate',measure = 'prob.joint',excludeConceptsFromName = 'D000690')
## Warning: Removed 182 rows containing missing values (position_stack).
## Warning: Removed 16 rows containing missing values (geom_col).

Using NMI, indicator ‘rate’

displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='rate',measure = 'nmi',excludeConceptsFromName = 'D000690')
## Warning: Removed 182 rows containing missing values (position_stack).
## Warning: Removed 16 rows containing missing values (geom_col).

Using NMI, indicator ‘diff’

displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='diff',measure = 'nmi',excludeConceptsFromName = 'D000690')
## Warning: Removed 182 rows containing missing values (position_stack).
## Warning: Removed 16 rows containing missing values (geom_col).

Examples for indicators and surges

This graph shows the surge year(s) in the plot of the probability across years (this is not very useful actually).

Using simple frequency

g <- displaySurges(caseParkinson,dynamic_indiv,dynamic_total,static_data,windows=c(1,3), indicators=c('rate','diff'),withTitle = TRUE)
g
## Warning: Removed 4 rows containing missing values (position_stack).

Using NMI

g <- displaySurges(caseParkinson,dynamic_indiv,dynamic_total,static_data,valueCol='nmi',windows=c(1,3,5), indicators=c('rate','diff'),withTitle = TRUE)
g
## Warning: Removed 12 rows containing missing values (position_stack).

# saving the graph as pdf:
# ggsave('expl1.pdf',g)

Displaying the trend distribution

‘Standard’ version

relations.ma <- computeMovingAverage(dynamic_joint.rndsubset,dynamic_total, window=5)
indiv.ma <- computeMovingAverage(dynamic_indiv,dynamic_total, window=5)

The following graphs show different versions of the histogram of the trend values: with/without logarithmic scale on the X and/or Y axis.

The vertical lines show the outlier threshold calculated as \(Q_3+k \times IQR\), where \(k\) is either 1.5 or 3 (“far outliers”).

l<-displayTrendDistribution1(relations.ma,indiv.ma)
l$x0.y0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1290146 rows containing non-finite values (stat_bin).

l$x0.y1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1290146 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 137 rows containing missing values (geom_bar).

l$x1.y0
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5301922 rows containing non-finite values (stat_bin).

l$x1.y1
## Warning in self$trans$transform(x): NaNs produced

## Warning in self$trans$transform(x): Transformation introduced infinite values in
## continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5301922 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 14 rows containing missing values (geom_bar).

Quantile plot (“flat” distribution)

These graphs show the “flat” distribution (probably not a standard name):

  • The points are sorted by trend and their relative rank (rank divided by number of points) is plotted as X. In other words, the X axis represents all the quantile positions with respect to trend.
  • The trend value shown on the Y axis is normalized between [0,1] for every measure.

Note that the random selection can affect the graph significantly since the extreme points are rare.

#displayTrendDistribution2(relations.ma,indiv.ma,withRect = FALSE)
g<-displayTwoPlotsDistribTrend(relations.ma,indiv.ma)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 36 rows containing missing values (geom_bar).
g

Note: the pdf format makes the graph take a very long time to load, this is why png is recommended for this graph.

ggsave('expl-trend-distrib.pdf',g,width=20,height=8,units = "cm")
ggsave('expl-trend-distrib.png',g,width=20,height=8,units = "cm")

Evaluation

Gold standard data

Below the gold-standard dataset presented in the paper is loaded and displayed:

gold <- loadGoldDiscoveries(dir='data',file='ND-discoveries-year.tsv')
kable(gold)
c1 c2 term.c1 group.c1 term.c2 group.c2 year
D002894 D006816 Chromosomes, Human, Pair 4 NA Huntington Disease DISO 1983
D000071617 D010300 Protein Deglycase DJ-1 CHEM Parkinson Disease DISO 2003
D000690 D057180 Amyotrophic Lateral Sclerosis DISO Frontotemporal dementia DISO 2009
D000073885 D057180 C9orf72 Protein CHEM Frontotemporal dementia DISO 2009
D000073885 D000690 C9orf72 Protein CHEM Amyotrophic Lateral Sclerosis DISO 2009
D000072105 D000690 superoxide dismutase 1 CHEM Amyotrophic Lateral Sclerosis DISO 1993
D000068836 D000544 rivastigmine CHEM Alzheimer’s Disease DISO 1997
D007980 D010300 levodopa CHEM Parkinson Disease DISO 1961
D007980 D020734 levodopa CHEM Parkinsonian Disorders DISO 1961
D000544 D001714 Alzheimer’s Disease DISO Bipolar Disorder DISO 2019
D004298 D010300 dopamine CHEM Parkinson Disease DISO 1971
D004298 D012559 dopamine CHEM Schizophrenia DISO 1977

Parameters

if (shortVersion) {
  my.measures <- c('prob.joint','pmi','nmi','scp')
  my.ma <- c(1,5)
  my.indicators <- 'diff'
} else {
  my.measures <- c('prob.joint','pmi','npmi','mi','nmi','scp')
  my.ma <- c(1,3,5)
  my.indicators <- c('diff',"rate")
}
data.eval <- collectEvalDataSurgesAgainstGold(gold, dir=dataPathOutput,suffix=data.suffix,evalAt=NA,ma_windows=my.ma,measures=my.measures,indicators = my.indicators)
## [1] "data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.1.min100.ND.tsv"
## [1] "data/output/pmi.diff.1.min100.ND.tsv"
## [1] "data/output/pmi.rate.1.min100.ND.tsv"
## [1] "data/output/npmi.diff.1.min100.ND.tsv"
## [1] "data/output/npmi.rate.1.min100.ND.tsv"
## [1] "data/output/mi.diff.1.min100.ND.tsv"
## [1] "data/output/mi.rate.1.min100.ND.tsv"
## [1] "data/output/nmi.diff.1.min100.ND.tsv"
## [1] "data/output/nmi.rate.1.min100.ND.tsv"
## [1] "data/output/scp.diff.1.min100.ND.tsv"
## [1] "data/output/scp.rate.1.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.3.min100.ND.tsv"
## [1] "data/output/pmi.diff.3.min100.ND.tsv"
## [1] "data/output/pmi.rate.3.min100.ND.tsv"
## [1] "data/output/npmi.diff.3.min100.ND.tsv"
## [1] "data/output/npmi.rate.3.min100.ND.tsv"
## [1] "data/output/mi.diff.3.min100.ND.tsv"
## [1] "data/output/mi.rate.3.min100.ND.tsv"
## [1] "data/output/nmi.diff.3.min100.ND.tsv"
## [1] "data/output/nmi.rate.3.min100.ND.tsv"
## [1] "data/output/scp.diff.3.min100.ND.tsv"
## [1] "data/output/scp.rate.3.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.5.min100.ND.tsv"
## [1] "data/output/pmi.diff.5.min100.ND.tsv"
## [1] "data/output/pmi.rate.5.min100.ND.tsv"
## [1] "data/output/npmi.diff.5.min100.ND.tsv"
## [1] "data/output/npmi.rate.5.min100.ND.tsv"
## [1] "data/output/mi.diff.5.min100.ND.tsv"
## [1] "data/output/mi.rate.5.min100.ND.tsv"
## [1] "data/output/nmi.diff.5.min100.ND.tsv"
## [1] "data/output/nmi.rate.5.min100.ND.tsv"
## [1] "data/output/scp.diff.5.min100.ND.tsv"
## [1] "data/output/scp.rate.5.min100.ND.tsv"
res.eval<-evalSurgessAgainstGold(data.eval,eval_windows = c(1,3,5))

Top individual configurations

eval3 <- res.eval[eval.window==3 & mode=='first.year',]
kable(head(eval3[order(-perf),],12))
ma.window measure indicator mode eval.at perf eval.window
5 scp diff first.year NA 0.6666667 3
1 nmi diff first.year NA 0.5833333 3
1 scp diff first.year NA 0.5833333 3
3 nmi diff first.year NA 0.5833333 3
3 scp diff first.year NA 0.5833333 3
5 nmi diff first.year NA 0.5833333 3
5 npmi diff first.year NA 0.5833333 3
3 npmi diff first.year NA 0.5000000 3
5 prob.joint rate first.year NA 0.4166667 3
1 npmi diff first.year NA 0.2500000 3
3 prob.joint rate first.year NA 0.2500000 3
1 prob.joint rate first.year NA 0.1666667 3

Comparison by parameter

g<-displayAvgPerfByParam(res.eval,evalWindow = 3)
g

ggsave('plot-params.pdf',g,width = 20,height=8,unit='cm')

Overlap by pair of configurations

The overlap coefficient is calculated based on the number of relations in common between the two configurations, where “in common” means that the relation appears in both lists with their surge year within 5 years of each other.

  • Note: in the code below the window of years (5) can be set to NA. In this case the overlap coefficient is only based on relations in common (no year comparison).
surges<-readMultipleSurgesFiles(measures=my.measures,indicator='diff',ma_windows = my.ma)
## [1] "data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "data/output/pmi.diff.1.min100.ND.tsv"
## [1] "data/output/npmi.diff.1.min100.ND.tsv"
## [1] "data/output/mi.diff.1.min100.ND.tsv"
## [1] "data/output/nmi.diff.1.min100.ND.tsv"
## [1] "data/output/scp.diff.1.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "data/output/pmi.diff.3.min100.ND.tsv"
## [1] "data/output/npmi.diff.3.min100.ND.tsv"
## [1] "data/output/mi.diff.3.min100.ND.tsv"
## [1] "data/output/nmi.diff.3.min100.ND.tsv"
## [1] "data/output/scp.diff.3.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "data/output/pmi.diff.5.min100.ND.tsv"
## [1] "data/output/npmi.diff.5.min100.ND.tsv"
## [1] "data/output/mi.diff.5.min100.ND.tsv"
## [1] "data/output/nmi.diff.5.min100.ND.tsv"
## [1] "data/output/scp.diff.5.min100.ND.tsv"
mat<-correlationMatrixMultiParam(surges, 5)
## 1: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 2: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 3: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 4: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 5: 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 6: 7 8 9 10 11 12 13 14 15 16 17 18 
## 7: 8 9 10 11 12 13 14 15 16 17 18 
## 8: 9 10 11 12 13 14 15 16 17 18 
## 9: 10 11 12 13 14 15 16 17 18 
## 10: 11 12 13 14 15 16 17 18 
## 11: 12 13 14 15 16 17 18 
## 12: 13 14 15 16 17 18 
## 13: 14 15 16 17 18 
## 14: 15 16 17 18 
## 15: 16 17 18 
## 16: 17 18 
## 17: 18
g<-heatMapCommonMatrix(mat)
g

ggsave('heatmap-overlap.pdf',g,width=20,height=15,unit='cm')

Distribution of surges across time

Duration between the year of the first cooccurrence and the surge year

surges.scp <-loadSurgesData(dir=dataPathOutput,suffix = data.suffix,ma_window = 5,measure='scp',indicator = 'diff')
## [1] "data/output/scp.diff.5.min100.ND.tsv"
surges.scp[,first.surge:=(year==min(year)),by=key(surges.scp)]
ggplot(surges.scp,aes(duration.joint,fill=first.surge))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Trend of surges by year

ggplot(surges.scp,aes(as.factor(year),trend))+geom_boxplot()+ylim(c(0,.02))+theme(axis.text.x=element_text(angle = -90, hjust = 0))
## Warning: Removed 2602 rows containing non-finite values (stat_boxplot).

First cooccurrences and surges

doublePlotAcrossTime(dynamic_joint, surges.scp,bins=68,fontsize=16)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 4 rows containing missing values (geom_bar).

Duration until cooccurrences and until first surge

first.surges.scp <- surges.scp[first.surge==TRUE,]
nrow(first.surges.scp)
## [1] 7514
g3 <- plotFirstOccIndivJoint(first.surges.scp)
g3

Statistics related to time:

print('average duration between concepts exist and first surge:')
## [1] "average duration between concepts exist and first surge:"
first.surges.scp[,mean(year-year.first.both)]
## [1] 17.88635
print('average duration between concepts exist and first cooccurrence:')
## [1] "average duration between concepts exist and first cooccurrence:"
first.surges.scp[,mean(year.first.joint-year.first.both)]
## [1] 10.30516
print('average duration between first cooccurrence and first surge:')
## [1] "average duration between first cooccurrence and first surge:"
first.surges.scp[,mean(year-year.first.joint)]
## [1] 7.581182
print('distribution (quantiles) for time before cooccurrence:')
## [1] "distribution (quantiles) for time before cooccurrence:"
quantile(first.surges.scp[,year.first.joint-year.first.both],probs=seq(0,1,.1))
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    0    0    0    2    4    7   11   14   18   26   61
print('distribution (quantiles) for time before first surge:')
## [1] "distribution (quantiles) for time before first surge:"
quantile(first.surges.scp[,year-year.first.both],probs=seq(0,1,.1))
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##   -1    2    5    9   12   15   18   22   29   40   67
print('proportion of surges happening the first year the concepts both exist:')
## [1] "proportion of surges happening the first year the concepts both exist:"
nrow(first.surges.scp[year <= year.first.both,])/nrow(first.surges.scp)
## [1] 0.05270162
print('proportion of cooccurrences happening the first year the concepts both exist:')
## [1] "proportion of cooccurrences happening the first year the concepts both exist:"
nrow(first.surges.scp[year.first.joint == year.first.both,])/nrow(first.surges.scp)
## [1] 0.2102742
print('proportion of cooccurrences happening at the same time as first cooccurrence:')
## [1] "proportion of cooccurrences happening at the same time as first cooccurrence:"
nrow(first.surges.scp[year == year.first.joint,])/nrow(first.surges.scp)
## [1] 0.07040192
ggsave('durations.pdf',g3,width=20,height=12,unit='cm')

Combination 2

g <- doublePlotAcrossTime(dynamic_joint, surges.scp,addThird = g3)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 4 rows containing missing values (geom_bar).
g

ggsave('surges-by-year2.pdf',g,width=20,height=20,unit='cm')

Relation between surges and frequency

Here we observe how the joint frequency of the relation relates to the status as surge or not.

  • The joint frequency is taken globally across all years (static data)
  • The data is observed by unique relation (pair of concepts), i.e. a relation either has at least one surge or doesn’t (multiple surges not taken into account).

The first graph below shows the distribution of surges by frequency, i.e. how many relations with surge have a particular frequency \(x\).

The number of surges decreases when the frequency increases, but we can see below that this an effect of the higher number of relations with low frequency.

d<-surges.scp
d[,first.surge:=(year==min(year)),by=key(d)]
d<-d[first.surge==TRUE,]
x<-merge(d, static_data,suffixes=c('.dynamic','.static'))
ggplot(x,aes(freq.joint.static))+geom_histogram()+xlim(c(0,100000))+scale_y_log10(labels = scales::comma_format(accuracy=1,big.mark = ",", decimal.mark = "."))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 268 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

The following graph shows the number of relations with a surge among all the relations. Note that the Y axis is logarithmic, this makes the proportion of surges look higher. For instance the first bin contains a few thousands surges among 100 millions relations.

On this graph it can be observed that the proportion of surges is roughly stable with respect to frequency.

g1 <- surgesAmongRelationsByFreq(surges.scp,static_data)
g1
## Warning: Removed 498 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 10 rows containing missing values (geom_bar).

The following graph shows the proportion of relations with surges vs. non surges for every bin. It should be noted that by definition, this graph doesn’t represent the proportion of a bin among the whole data: the first bins (low frequency) contain a lot more relations than the others, as seen above.

The X axis is cut at 10k in order to show the increase of the proportion of surges as frequency increases.

g2 <- surgesAmongRelationsByFreq(surges.scp,static_data,asProportion = TRUE,freqmax = 10000,logY = FALSE)
g2
## Warning: Removed 8040 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).

g<-plot_grid(g1,g2,labels=NULL,ncol=2)
## Warning: Removed 498 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 10 rows containing missing values (geom_bar).
## Warning: Removed 8040 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
ggsave('surges-by-freq.pdf',g,width=20,height=8,unit='cm')

Qualitative results

Here we simply observe the top and bottom surges under various conditions, as described in the paper.

surges.scp[,first.surge:=(year==min(year)),by=key(surges.scp)]
surges.scp.terms <- addRelationName(surges.scp,static_data)

These are the top 12 relations by trend without any filtering:

kable(head(surges.scp.terms[order(-trend),.(year,c1,c2,group.c1,group.c2,relation)],12))
year c1 c2 group.c1 group.c2 relation
1967 D007980 D015259 CHEM CHEM PHYS levodopa - Dopaminergic Agents
1968 D017433 D017434 CONC CONC Protein Structure, Secondary - Tertiary Protein Structure
1984 D017510 D057927 PHYS PHEN CONC PHYS Protein Folding, Globular - Hydrophilicity
1971 D008279 D014057 PROC CONC PROC Magnetic Resonance Imaging - Cine-CT
1986 D020033 D054875 CHEM PHYS Protein Isoforms - Ubiquitination
1976 D011994 D049109 CHEM PHYS Recombinant Proteins - Cell Proliferation
1970 D015999 D016000 CONC PROC Multivariate Analysis - statistical cluster
1986 D017354 D017931 DISO CHEM Point Mutation - Oligonucleotide Primers
1983 D017510 D018360 PHYS PROC Protein Folding, Globular - Crystallography, X-Ray
1981 D017774 D056547 PHYS ANAT Long-Term Potentiation - CA1 field
1982 D017774 D056547 PHYS ANAT Long-Term Potentiation - CA1 field
1981 D018836 D042783 CHEM ANAT Inflammation Mediators - Endothelial Cells

Filtering by semantic group

The flitering by semantic group can be done as follows. The UMLS semantic groups are used (the abbreviated name, first column).

filtered <- selectRelationsGroups(surges.scp.terms, groups1=c('CONC'), groups2=c('DISO','LIVB','ANAT')) 

By default both selected groups are ‘DISO’,‘CHEM’,‘GENE’,‘ANAT’:

filtered <- selectRelationsGroups(surges.scp.terms)

Top 12 after filtering:

kable(head(filtered[order(-trend),.(year,c1,c2,group.c1,group.c2,relation)],12))
year c1 c2 group.c1 group.c2 relation
1986 D017354 D017931 DISO CHEM Point Mutation - Oligonucleotide Primers
1981 D018836 D042783 CHEM ANAT Inflammation Mediators - Endothelial Cells
1968 D001402 D013601 ANAT ANAT B-Lymphocytes - T-Lymphocyte
1962 D009414 D020932 CHEM CHEM Nerve Growth Factors - nerve growth factor
1968 D007155 D015551 CHEM DISO Biological Response Modifiers - Autoimmune Response
1974 D012097 D029721 CHEM CHEM Repressor Proteins - Drosophila Proteins
1959 D018797 D048868 CHEM CHEM Cell Cycle Proteins - Adaptor Proteins, Signal Transducing
1996 D051843 D051844 CHEM CHEM Synucleins - alpha-Synuclein
1955 D003342 D017072 ANAT ANAT Corpus striatum structure - Neostriatum
1976 D012097 D014157 CHEM CHEM Repressor Proteins - TRANSCRIPTION FACTOR
1954 D018797 D029721 CHEM CHEM Cell Cycle Proteins - Drosophila Proteins
1994 D020169 D053148 CHEM CHEM caspase - caspase-3

Filtering by conditional probability

Same relations as above with their conditional probabilities:

kable(head(filtered[order(-trend),.(year,c1,relation,prob.C1GivenC2,prob.C2GivenC1)],12))
year c1 relation prob.C1GivenC2 prob.C2GivenC1
1986 D017354 Point Mutation - Oligonucleotide Primers 0.5000000 1.0000000
1981 D018836 Inflammation Mediators - Endothelial Cells 1.0000000 1.0000000
1968 D001402 B-Lymphocytes - T-Lymphocyte 0.4000000 1.0000000
1962 D009414 Nerve Growth Factors - nerve growth factor 0.8461538 0.5789474
1968 D007155 Biological Response Modifiers - Autoimmune Response 0.3333333 1.0000000
1974 D012097 Repressor Proteins - Drosophila Proteins 0.5000000 1.0000000
1959 D018797 Cell Cycle Proteins - Adaptor Proteins, Signal Transducing 1.0000000 0.3333333
1996 D051843 Synucleins - alpha-Synuclein 1.0000000 0.6666667
1955 D003342 Corpus striatum structure - Neostriatum 1.0000000 0.3684211
1976 D012097 Repressor Proteins - TRANSCRIPTION FACTOR 0.3611111 0.7428571
1954 D018797 Cell Cycle Proteins - Drosophila Proteins 0.6666667 1.0000000
1994 D020169 caspase - caspase-3 1.0000000 0.6000000

Relations with a least one High conditional probability are often trivial, since one of the concepts almost always appears with the other one.

The following function filters out relations which have at least one conditional probability higher than the threshold. It also has an option to keep only the first surge for any relation.

filtered2<-filterCondProb(filtered,conditional.threshold = 0.6,onlyFirstSurge = TRUE)
kable(head(filtered2[order(-trend),.(year,c1,relation,trend)],12))
year c1 relation trend
1971 D016501 Neurites - nerve growth factor 0.2500000
1953 D000959 Antihypertensive Agents - reserpine 0.2051585
1969 D002795 Choline O-Acetyltransferase - Cholinergic Fibers 0.1666667
1961 D008562 Membrane Glycoproteins - Actin-Binding Protein 0.1666667
1978 D018836 Inflammation Mediators - Endothelial Cells 0.1666667
1981 D053318 Apolipoprotein E3 - Apolipoprotein E4 0.1666667
1962 D001379 azathioprine - Immunosuppressive Agents 0.1322751
1973 D003924 Diabetes Mellitus, Non-Insulin-Dependent - treatment failure 0.1250000
1972 D004268 DNA Helix Destabilizing Proteins - Nuclear Proteins 0.1111111
1972 D004268 DNA Helix Destabilizing Proteins - Repressor Proteins 0.1111111
1972 D009687 Nuclear Proteins - Repressor Proteins 0.1111111
1954 D011956 Receptors, Cell Surface - Drosophila Proteins 0.1111111

Bottom surges

kable(head(filtered2[order(-trend),.(year,c1,relation,trend)],12))
year c1 relation trend
1971 D016501 Neurites - nerve growth factor 0.2500000
1953 D000959 Antihypertensive Agents - reserpine 0.2051585
1969 D002795 Choline O-Acetyltransferase - Cholinergic Fibers 0.1666667
1961 D008562 Membrane Glycoproteins - Actin-Binding Protein 0.1666667
1978 D018836 Inflammation Mediators - Endothelial Cells 0.1666667
1981 D053318 Apolipoprotein E3 - Apolipoprotein E4 0.1666667
1962 D001379 azathioprine - Immunosuppressive Agents 0.1322751
1973 D003924 Diabetes Mellitus, Non-Insulin-Dependent - treatment failure 0.1250000
1972 D004268 DNA Helix Destabilizing Proteins - Nuclear Proteins 0.1111111
1972 D004268 DNA Helix Destabilizing Proteins - Repressor Proteins 0.1111111
1972 D009687 Nuclear Proteins - Repressor Proteins 0.1111111
1954 D011956 Receptors, Cell Surface - Drosophila Proteins 0.1111111

Comparison against the baseline

This part requires the full Medline data (not only the ND subset):

data.suffix <- '.min100'

surges.scp <- loadSurgesData(dir=dataPathOutput,suffix=data.suffix, measure='scp',indicator='diff',ma_window=5)
## [1] "data/output/scp.diff.5.min100.tsv"
dynamic_joint <- loadDynamicData(dir=dataPathInput,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dir=dataPathInput,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dir=dataPathInput)
static_data <- loadStaticData(dir=dataPathInputStatic,suffix=data.suffix)

Generating output relations to annotate

Important: the data used in the code was generated from the full input set of relations, i.e. with option data.suffix set to '.min100'.

The following code computes the full list of relations, similarly to the original time sliced evaluation method. This will be used as baseline.

relationsDT <- computeMovingAverage(dynamic_joint,dynamic_total, window=1)
indivDT <- computeMovingAverage(dynamic_indiv,dynamic_total, window=1)
allrels<-addDynamicAssociationToRelations(relationsDT,indivDT,measures = c('prob.joint'))
allrels<-allrels[freq.joint>0,]

The code below applies the same preprocessing steps to both the SCP relations and the baseline relations:

set.seed(2022) 
scp<-prepareCmpBaseline(surges.scp, static_data)
## [1] "size relsDT: 315418"
## [1] "size static: 1809993"
## [1] "after filter condi + FIRSTsurge + years:  18343"
## [1] "after filter groups:  9092"
scp <- scp[order(-trend),]
baseline<-prepareCmpBaseline(allrels,static_data)
## [1] "size relsDT: 65370033"
## [1] "size static: 1809993"
## [1] "after filter condi + FIRSTsurge + years:  497070"
## [1] "after filter groups:  108794"

We now create the dataset to annotate, containing both some SCP relations and some baseline relations:

  • x1/y1 contain a random subset of 100 relations for both methods
  • x2/y2 contain the top 100 relations for both methods

The 4 subsets are then concatenated and shuffled.

baseline<-head(baseline[order(-freq.joint),],nrow(scp))
x1<-scp[sample(.N,100),.(c1,c2,term.c1,term.c2,group.c1,group.c2,year)]
y1<-baseline[sample(.N,100), .(c1,c2,term.c1,term.c2,group.c1,group.c2,year)]
x2<-head(scp[,.(c1,c2,term.c1,term.c2,group.c1,group.c2,year)],100)
y2<-head(baseline[, .(c1,c2,term.c1,term.c2,group.c1,group.c2,year)],100)
res <- rbind(x1,y1,x2,y2)
res <- res[sample(.N),]
fwrite(res,'to-annotate.tsv',sep='\t')

Evaluation results

res <- eval.baseline( x1, y1, x2, y2, annotFile='data/annotated-relations.tsv') 
##     method precision
## 1 baseline 0.2985075
## 2      scr 0.4430380
##           
##            no yes
##   baseline 47  20
##   scr      44  35
## [1] "chisq.test p-val= 0.104261308387589"
##     method precision
## 1 baseline 0.1785714
## 2      scr 0.3827160
##           
##            no yes
##   baseline 69  15
##   scr      50  31
## [1] "chisq.test p-val= 0.00596135278801524"
kable(res)
method subset answer number
scr rnd yes 35
scr top yes 31
baseline rnd yes 20
baseline top yes 15
scr rnd no 44
scr top no 50
baseline rnd no 47
baseline top no 69
scr rnd maybe 21
scr top maybe 19
baseline rnd maybe 33
baseline top maybe 16