medline-discoveries: analysis

Overview

This document is part of the documentation for the Medline Discoveries repository, which is the companion code for the paper “Mining impactful discoveries from the biomedical literature” TODO link.

The main documentation describes how to generate the output data (i.e. detecting surges) using the provided R implementation.
This part of the documentation covers additional analysis of the results, including how the graphs and tables presented in the paper are obtained.

Rmd Options

This document was generated from an R Markdown source file. The source file is provided in the repository, allowing full reproducibility of the experiments presented in the paper. It can be executed through the RStudio interface (“knit” button) or as follows:

rmarkdown::render('2-analysis.Rmd')

The .Rmd source document can be configured by modifying the following lines:

# directory for the input data (Medline frequency data)
dataPathInput <- 'data/input'
dataPathInputStatic <- paste(dataPathInput,'static',sep='/')
# directory for the output data (extracted surges from the main process in step 1)
dataPathOutput <- 'data/output'

# by default using the Medline ND subset:
# the static suffix is different from the dynamic one for some reason (sorry for that)
data.suffix <- '.min100.ND'
# uncomment the following line in order to use the full Medline data instead (not tested with this Rmd!):
# data.suffix <- '.min100'

# set to TRUE for quick execution with a random subset of the data
shortVersion <- FALSE
# set to TRUE in order to save the graphs used in the paper
savePaperGraphs <- TRUE

Notes:

If the option shortVersion is set to FALSE, the process requires at least 128GB RAM.
Note: the option to use the full Medline data (data.suffix <- '.min100') has not been tested with this Rmd document.

source('discoveries.R')

Requirements

The software requirements are the same as for the first part

This part requires the same input data as in the first part, but also the output generated as a result of applying the process. Both datasets can be downloaded from https://zenodo.org/record/5888572.

By default the input data is expected in the subdirectory data/input and the output data (unsurprisingly) in data/output.

Loading

dynamic_joint <- loadDynamicData(dataPathInput,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dataPathInput,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dataPathInput)
static_data <- loadStaticData(dir=dataPathInputStatic,suffix=data.suffix)

Selecting some of the specific concepts presented in the paper

als <- filterConcepts(dynamic_joint, 'D000690')
selectedALS <- filterConcepts(als,c('D051379','D000073885','D000870','D019782','D008875'))
parkinson <- filterConcepts(dynamic_joint, 'D010300')
caseParkinson <- filterConcepts(parkinson, 'D013378')

Selecting a random subset

Note: in some of the graphs (trend distribution), the representation of all the points is very computationally heavy, causing extremely long duration even to just load the pdf file. This is why a random subset of relations is used even in the case where shortVersion is set to FALSE.

if (shortVersion) {
  dynamic_joint<-pickRandomDynamic(dynamic_joint,n=1000)
  dynamic_joint.rndsubset<-dynamic_joint
} else {
  dynamic_joint.rndsubset<-pickRandomDynamic(dynamic_joint,n=20000)
}

## [1] "selecting among n pairs =  290797"

In order to avoid cases with insufficient data, it is possible to filter out rows with less than some minimum frequency \(m\) (the relation must have at least one year with a frequency higher than \(m\)). It is also possible to specify the number of cases to pick:

three_relations<-pickRandomDynamic(dynamic_joint,n=3, minFreqYear = 200)

## [1] "selecting among n pairs =  25266"

Displaying relations frequency across time with their concepts names

Using the relations selected randomly above:

displaySeveralPairsData(three_relations,static_data,dynamic_total)

## Warning: Removed 51 rows containing missing values (position_stack).

## Warning: Removed 5 rows containing missing values (geom_col).

## Warning: Removed 2 rows containing missing values (geom_path).

Displaying the ALS cases

g<-displaySeveralPairsData(selectedALS,static_data,dynamic_total,excludeConceptsFromName = 'D000690',ncol=5)
g

## Warning: Removed 94 rows containing missing values (position_stack).

## Warning: Removed 8 rows containing missing values (geom_col).

## Warning: Removed 2 rows containing missing values (geom_path).

ggsave('expl1.pdf',g)

## Saving 10 x 4 in image

## Warning: Removed 94 rows containing missing values (position_stack).

## Warning: Removed 8 rows containing missing values (geom_col).

## Warning: Removed 2 rows containing missing values (geom_path).

Displaying measures

g <- displayMultiMeasure(selectedALS,dynamic_indiv,dynamic_total,static_data,windows=c(1,3,5),measures=c('prob.joint','pmi','npmi','mi','nmi','scp'),excludeConceptsFromName = 'D000690')
g

## Warning: Removed 6 rows containing missing values (geom_path).

ggsave('expl-measures.pdf',g+ theme(text=element_text(size=14),legend.position="none",axis.title.x=element_blank(),axis.title.y=element_blank()),width=20,height=16,units = "cm")

## Warning: Removed 6 rows containing missing values (geom_path).

Study of the trend parameters

Using simple frequency, indicator ‘rate’

displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='rate',measure = 'prob.joint',excludeConceptsFromName = 'D000690')

## Warning: Removed 182 rows containing missing values (position_stack).

## Warning: Removed 16 rows containing missing values (geom_col).

Using NMI, indicator ‘rate’

displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='rate',measure = 'nmi',excludeConceptsFromName = 'D000690')

## Warning: Removed 182 rows containing missing values (position_stack).

## Warning: Removed 16 rows containing missing values (geom_col).

Using NMI, indicator ‘diff’

displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='diff',measure = 'nmi',excludeConceptsFromName = 'D000690')

## Warning: Removed 182 rows containing missing values (position_stack).

## Warning: Removed 16 rows containing missing values (geom_col).

Examples for indicators and surges

This graph shows the surge year(s) in the plot of the probability across years (this is not very useful actually).

Using simple frequency

g <- displaySurges(caseParkinson,dynamic_indiv,dynamic_total,static_data,windows=c(1,3), indicators=c('rate','diff'),withTitle = TRUE)
g

## Warning: Removed 4 rows containing missing values (position_stack).

Using NMI

g <- displaySurges(caseParkinson,dynamic_indiv,dynamic_total,static_data,valueCol='nmi',windows=c(1,3,5), indicators=c('rate','diff'),withTitle = TRUE)
g

## Warning: Removed 12 rows containing missing values (position_stack).

# saving the graph as pdf:
# ggsave('expl1.pdf',g)

Displaying the trend distribution

‘Standard’ version

relations.ma <- computeMovingAverage(dynamic_joint.rndsubset,dynamic_total, window=5)
indiv.ma <- computeMovingAverage(dynamic_indiv,dynamic_total, window=5)

The following graphs show different versions of the histogram of the trend values: with/without logarithmic scale on the X and/or Y axis.

The vertical lines show the outlier threshold calculated as \(Q_3+k \times IQR\), where \(k\) is either 1.5 or 3 (“far outliers”).

l<-displayTrendDistribution1(relations.ma,indiv.ma)
l$x0.y0

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1290146 rows containing non-finite values (stat_bin).

l$x0.y1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1290146 rows containing non-finite values (stat_bin).

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 137 rows containing missing values (geom_bar).

l$x1.y0

## Warning in self$trans$transform(x): NaNs produced

## Warning: Transformation introduced infinite values in continuous x-axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5301922 rows containing non-finite values (stat_bin).

l$x1.y1

## Warning in self$trans$transform(x): NaNs produced

## Warning in self$trans$transform(x): Transformation introduced infinite values in
## continuous x-axis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5301922 rows containing non-finite values (stat_bin).

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 14 rows containing missing values (geom_bar).

Quantile plot (“flat” distribution)

These graphs show the “flat” distribution (probably not a standard name):

The points are sorted by trend and their relative rank (rank divided by number of points) is plotted as X. In other words, the X axis represents all the quantile positions with respect to trend.
The trend value shown on the Y axis is normalized between [0,1] for every measure.

Note that the random selection can affect the graph significantly since the extreme points are rare.

#displayTrendDistribution2(relations.ma,indiv.ma,withRect = FALSE)
g<-displayTwoPlotsDistribTrend(relations.ma,indiv.ma)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 36 rows containing missing values (geom_bar).

Note: the pdf format makes the graph take a very long time to load, this is why png is recommended for this graph.

ggsave('expl-trend-distrib.pdf',g,width=20,height=8,units = "cm")
ggsave('expl-trend-distrib.png',g,width=20,height=8,units = "cm")

Evaluation

Gold standard data

Below the gold-standard dataset presented in the paper is loaded and displayed:

gold <- loadGoldDiscoveries(dir='data',file='ND-discoveries-year.tsv')
kable(gold)

c1	c2	term.c1	group.c1	term.c2	group.c2	year
D002894	D006816	Chromosomes, Human, Pair 4	NA	Huntington Disease	DISO	1983
D000071617	D010300	Protein Deglycase DJ-1	CHEM	Parkinson Disease	DISO	2003
D000690	D057180	Amyotrophic Lateral Sclerosis	DISO	Frontotemporal dementia	DISO	2009
D000073885	D057180	C9orf72 Protein	CHEM	Frontotemporal dementia	DISO	2009
D000073885	D000690	C9orf72 Protein	CHEM	Amyotrophic Lateral Sclerosis	DISO	2009
D000072105	D000690	superoxide dismutase 1	CHEM	Amyotrophic Lateral Sclerosis	DISO	1993
D000068836	D000544	rivastigmine	CHEM	Alzheimer’s Disease	DISO	1997
D007980	D010300	levodopa	CHEM	Parkinson Disease	DISO	1961
D007980	D020734	levodopa	CHEM	Parkinsonian Disorders	DISO	1961
D000544	D001714	Alzheimer’s Disease	DISO	Bipolar Disorder	DISO	2019
D004298	D010300	dopamine	CHEM	Parkinson Disease	DISO	1971
D004298	D012559	dopamine	CHEM	Schizophrenia	DISO	1977

Parameters

if (shortVersion) {
  my.measures <- c('prob.joint','pmi','nmi','scp')
  my.ma <- c(1,5)
  my.indicators <- 'diff'
} else {
  my.measures <- c('prob.joint','pmi','npmi','mi','nmi','scp')
  my.ma <- c(1,3,5)
  my.indicators <- c('diff',"rate")
}
data.eval <- collectEvalDataSurgesAgainstGold(gold, dir=dataPathOutput,suffix=data.suffix,evalAt=NA,ma_windows=my.ma,measures=my.measures,indicators = my.indicators)

## [1] "data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.1.min100.ND.tsv"
## [1] "data/output/pmi.diff.1.min100.ND.tsv"
## [1] "data/output/pmi.rate.1.min100.ND.tsv"
## [1] "data/output/npmi.diff.1.min100.ND.tsv"
## [1] "data/output/npmi.rate.1.min100.ND.tsv"
## [1] "data/output/mi.diff.1.min100.ND.tsv"
## [1] "data/output/mi.rate.1.min100.ND.tsv"
## [1] "data/output/nmi.diff.1.min100.ND.tsv"
## [1] "data/output/nmi.rate.1.min100.ND.tsv"
## [1] "data/output/scp.diff.1.min100.ND.tsv"
## [1] "data/output/scp.rate.1.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.3.min100.ND.tsv"
## [1] "data/output/pmi.diff.3.min100.ND.tsv"
## [1] "data/output/pmi.rate.3.min100.ND.tsv"
## [1] "data/output/npmi.diff.3.min100.ND.tsv"
## [1] "data/output/npmi.rate.3.min100.ND.tsv"
## [1] "data/output/mi.diff.3.min100.ND.tsv"
## [1] "data/output/mi.rate.3.min100.ND.tsv"
## [1] "data/output/nmi.diff.3.min100.ND.tsv"
## [1] "data/output/nmi.rate.3.min100.ND.tsv"
## [1] "data/output/scp.diff.3.min100.ND.tsv"
## [1] "data/output/scp.rate.3.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.5.min100.ND.tsv"
## [1] "data/output/pmi.diff.5.min100.ND.tsv"
## [1] "data/output/pmi.rate.5.min100.ND.tsv"
## [1] "data/output/npmi.diff.5.min100.ND.tsv"
## [1] "data/output/npmi.rate.5.min100.ND.tsv"
## [1] "data/output/mi.diff.5.min100.ND.tsv"
## [1] "data/output/mi.rate.5.min100.ND.tsv"
## [1] "data/output/nmi.diff.5.min100.ND.tsv"
## [1] "data/output/nmi.rate.5.min100.ND.tsv"
## [1] "data/output/scp.diff.5.min100.ND.tsv"
## [1] "data/output/scp.rate.5.min100.ND.tsv"

res.eval<-evalSurgessAgainstGold(data.eval,eval_windows = c(1,3,5))

Top individual configurations

eval3 <- res.eval[eval.window==3 & mode=='first.year',]
kable(head(eval3[order(-perf),],12))

ma.window	measure	indicator	mode	eval.at	perf	eval.window
5	scp	diff	first.year	NA	0.6666667	3
1	nmi	diff	first.year	NA	0.5833333	3
1	scp	diff	first.year	NA	0.5833333	3
3	nmi	diff	first.year	NA	0.5833333	3
3	scp	diff	first.year	NA	0.5833333	3
5	nmi	diff	first.year	NA	0.5833333	3
5	npmi	diff	first.year	NA	0.5833333	3
3	npmi	diff	first.year	NA	0.5000000	3
5	prob.joint	rate	first.year	NA	0.4166667	3
1	npmi	diff	first.year	NA	0.2500000	3
3	prob.joint	rate	first.year	NA	0.2500000	3
1	prob.joint	rate	first.year	NA	0.1666667	3

Comparison by parameter

g<-displayAvgPerfByParam(res.eval,evalWindow = 3)
g

ggsave('plot-params.pdf',g,width = 20,height=8,unit='cm')

Overlap by pair of configurations

The overlap coefficient is calculated based on the number of relations in common between the two configurations, where “in common” means that the relation appears in both lists with their surge year within 5 years of each other.

Note: in the code below the window of years (5) can be set to NA. In this case the overlap coefficient is only based on relations in common (no year comparison).

surges<-readMultipleSurgesFiles(measures=my.measures,indicator='diff',ma_windows = my.ma)

## [1] "data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "data/output/pmi.diff.1.min100.ND.tsv"
## [1] "data/output/npmi.diff.1.min100.ND.tsv"
## [1] "data/output/mi.diff.1.min100.ND.tsv"
## [1] "data/output/nmi.diff.1.min100.ND.tsv"
## [1] "data/output/scp.diff.1.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "data/output/pmi.diff.3.min100.ND.tsv"
## [1] "data/output/npmi.diff.3.min100.ND.tsv"
## [1] "data/output/mi.diff.3.min100.ND.tsv"
## [1] "data/output/nmi.diff.3.min100.ND.tsv"
## [1] "data/output/scp.diff.3.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "data/output/pmi.diff.5.min100.ND.tsv"
## [1] "data/output/npmi.diff.5.min100.ND.tsv"
## [1] "data/output/mi.diff.5.min100.ND.tsv"
## [1] "data/output/nmi.diff.5.min100.ND.tsv"
## [1] "data/output/scp.diff.5.min100.ND.tsv"

mat<-correlationMatrixMultiParam(surges, 5)

## 1: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 2: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 3: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 4: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 5: 6 7 8 9 10 11 12 13 14 15 16 17 18 
## 6: 7 8 9 10 11 12 13 14 15 16 17 18 
## 7: 8 9 10 11 12 13 14 15 16 17 18 
## 8: 9 10 11 12 13 14 15 16 17 18 
## 9: 10 11 12 13 14 15 16 17 18 
## 10: 11 12 13 14 15 16 17 18 
## 11: 12 13 14 15 16 17 18 
## 12: 13 14 15 16 17 18 
## 13: 14 15 16 17 18 
## 14: 15 16 17 18 
## 15: 16 17 18 
## 16: 17 18 
## 17: 18

g<-heatMapCommonMatrix(mat)
g

ggsave('heatmap-overlap.pdf',g,width=20,height=15,unit='cm')

Distribution of surges across time

Duration between the year of the first cooccurrence and the surge year

surges.scp <-loadSurgesData(dir=dataPathOutput,suffix = data.suffix,ma_window = 5,measure='scp',indicator = 'diff')

## [1] "data/output/scp.diff.5.min100.ND.tsv"

surges.scp[,first.surge:=(year==min(year)),by=key(surges.scp)]

ggplot(surges.scp,aes(duration.joint,fill=first.surge))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Trend of surges by year

ggplot(surges.scp,aes(as.factor(year),trend))+geom_boxplot()+ylim(c(0,.02))+theme(axis.text.x=element_text(angle = -90, hjust = 0))

## Warning: Removed 2602 rows containing non-finite values (stat_boxplot).

First cooccurrences and surges

doublePlotAcrossTime(dynamic_joint, surges.scp,bins=68,fontsize=16)

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 4 rows containing missing values (geom_bar).

Duration until cooccurrences and until first surge

first.surges.scp <- surges.scp[first.surge==TRUE,]
nrow(first.surges.scp)

## [1] 7514

g3 <- plotFirstOccIndivJoint(first.surges.scp)
g3

Statistics related to time:

print('average duration between concepts exist and first surge:')

## [1] "average duration between concepts exist and first surge:"

first.surges.scp[,mean(year-year.first.both)]

## [1] 17.88635

print('average duration between concepts exist and first cooccurrence:')

## [1] "average duration between concepts exist and first cooccurrence:"

first.surges.scp[,mean(year.first.joint-year.first.both)]

## [1] 10.30516

print('average duration between first cooccurrence and first surge:')

## [1] "average duration between first cooccurrence and first surge:"

first.surges.scp[,mean(year-year.first.joint)]

## [1] 7.581182

print('distribution (quantiles) for time before cooccurrence:')

## [1] "distribution (quantiles) for time before cooccurrence:"

quantile(first.surges.scp[,year.first.joint-year.first.both],probs=seq(0,1,.1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    0    0    0    2    4    7   11   14   18   26   61

print('distribution (quantiles) for time before first surge:')

## [1] "distribution (quantiles) for time before first surge:"

quantile(first.surges.scp[,year-year.first.both],probs=seq(0,1,.1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##   -1    2    5    9   12   15   18   22   29   40   67

print('proportion of surges happening the first year the concepts both exist:')

## [1] "proportion of surges happening the first year the concepts both exist:"

nrow(first.surges.scp[year <= year.first.both,])/nrow(first.surges.scp)

## [1] 0.05270162

print('proportion of cooccurrences happening the first year the concepts both exist:')

## [1] "proportion of cooccurrences happening the first year the concepts both exist:"

nrow(first.surges.scp[year.first.joint == year.first.both,])/nrow(first.surges.scp)

## [1] 0.2102742

print('proportion of cooccurrences happening at the same time as first cooccurrence:')

## [1] "proportion of cooccurrences happening at the same time as first cooccurrence:"

nrow(first.surges.scp[year == year.first.joint,])/nrow(first.surges.scp)

## [1] 0.07040192

ggsave('durations.pdf',g3,width=20,height=12,unit='cm')

Combination 2

g <- doublePlotAcrossTime(dynamic_joint, surges.scp,addThird = g3)

## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 4 rows containing missing values (geom_bar).

ggsave('surges-by-year2.pdf',g,width=20,height=20,unit='cm')

Relation between surges and frequency

Here we observe how the joint frequency of the relation relates to the status as surge or not.

The joint frequency is taken globally across all years (static data)
The data is observed by unique relation (pair of concepts), i.e. a relation either has at least one surge or doesn’t (multiple surges not taken into account).

The first graph below shows the distribution of surges by frequency, i.e. how many relations with surge have a particular frequency \(x\).

The number of surges decreases when the frequency increases, but we can see below that this an effect of the higher number of relations with low frequency.

d<-surges.scp
d[,first.surge:=(year==min(year)),by=key(d)]
d<-d[first.surge==TRUE,]
x<-merge(d, static_data,suffixes=c('.dynamic','.static'))
ggplot(x,aes(freq.joint.static))+geom_histogram()+xlim(c(0,100000))+scale_y_log10(labels = scales::comma_format(accuracy=1,big.mark = ",", decimal.mark = "."))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 268 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

The following graph shows the number of relations with a surge among all the relations. Note that the Y axis is logarithmic, this makes the proportion of surges look higher. For instance the first bin contains a few thousands surges among 100 millions relations.

On this graph it can be observed that the proportion of surges is roughly stable with respect to frequency.

g1 <- surgesAmongRelationsByFreq(surges.scp,static_data)
g1

## Warning: Removed 498 rows containing non-finite values (stat_bin).

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 10 rows containing missing values (geom_bar).

The following graph shows the proportion of relations with surges vs. non surges for every bin. It should be noted that by definition, this graph doesn’t represent the proportion of a bin among the whole data: the first bins (low frequency) contain a lot more relations than the others, as seen above.

The X axis is cut at 10k in order to show the increase of the proportion of surges as frequency increases.

g2 <- surgesAmongRelationsByFreq(surges.scp,static_data,asProportion = TRUE,freqmax = 10000,logY = FALSE)
g2

## Warning: Removed 8040 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_bar).

g<-plot_grid(g1,g2,labels=NULL,ncol=2)

## Warning: Removed 498 rows containing non-finite values (stat_bin).

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 10 rows containing missing values (geom_bar).

## Warning: Removed 8040 rows containing non-finite values (stat_bin).

## Warning: Removed 4 rows containing missing values (geom_bar).

ggsave('surges-by-freq.pdf',g,width=20,height=8,unit='cm')

Qualitative results

Here we simply observe the top and bottom surges under various conditions, as described in the paper.

surges.scp[,first.surge:=(year==min(year)),by=key(surges.scp)]
surges.scp.terms <- addRelationName(surges.scp,static_data)

These are the top 12 relations by trend without any filtering:

kable(head(surges.scp.terms[order(-trend),.(year,c1,c2,group.c1,group.c2,relation)],12))

year	c1	c2	group.c1	group.c2	relation
1967	D007980	D015259	CHEM	CHEM PHYS	levodopa - Dopaminergic Agents
1968	D017433	D017434	CONC	CONC	Protein Structure, Secondary - Tertiary Protein Structure
1984	D017510	D057927	PHYS	PHEN CONC PHYS	Protein Folding, Globular - Hydrophilicity
1971	D008279	D014057	PROC CONC	PROC	Magnetic Resonance Imaging - Cine-CT
1986	D020033	D054875	CHEM	PHYS	Protein Isoforms - Ubiquitination
1976	D011994	D049109	CHEM	PHYS	Recombinant Proteins - Cell Proliferation
1970	D015999	D016000	CONC	PROC	Multivariate Analysis - statistical cluster
1986	D017354	D017931	DISO	CHEM	Point Mutation - Oligonucleotide Primers
1983	D017510	D018360	PHYS	PROC	Protein Folding, Globular - Crystallography, X-Ray
1981	D017774	D056547	PHYS	ANAT	Long-Term Potentiation - CA1 field
1982	D017774	D056547	PHYS	ANAT	Long-Term Potentiation - CA1 field
1981	D018836	D042783	CHEM	ANAT	Inflammation Mediators - Endothelial Cells

Filtering by semantic group

The flitering by semantic group can be done as follows. The UMLS semantic groups are used (the abbreviated name, first column).

filtered <- selectRelationsGroups(surges.scp.terms, groups1=c('CONC'), groups2=c('DISO','LIVB','ANAT'))

By default both selected groups are ‘DISO’,‘CHEM’,‘GENE’,‘ANAT’:

filtered <- selectRelationsGroups(surges.scp.terms)

Top 12 after filtering:

kable(head(filtered[order(-trend),.(year,c1,c2,group.c1,group.c2,relation)],12))

year	c1	c2	group.c1	group.c2	relation
1986	D017354	D017931	DISO	CHEM	Point Mutation - Oligonucleotide Primers
1981	D018836	D042783	CHEM	ANAT	Inflammation Mediators - Endothelial Cells
1968	D001402	D013601	ANAT	ANAT	B-Lymphocytes - T-Lymphocyte
1962	D009414	D020932	CHEM	CHEM	Nerve Growth Factors - nerve growth factor
1968	D007155	D015551	CHEM	DISO	Biological Response Modifiers - Autoimmune Response
1974	D012097	D029721	CHEM	CHEM	Repressor Proteins - Drosophila Proteins
1959	D018797	D048868	CHEM	CHEM	Cell Cycle Proteins - Adaptor Proteins, Signal Transducing
1996	D051843	D051844	CHEM	CHEM	Synucleins - alpha-Synuclein
1955	D003342	D017072	ANAT	ANAT	Corpus striatum structure - Neostriatum
1976	D012097	D014157	CHEM	CHEM	Repressor Proteins - TRANSCRIPTION FACTOR
1954	D018797	D029721	CHEM	CHEM	Cell Cycle Proteins - Drosophila Proteins
1994	D020169	D053148	CHEM	CHEM	caspase - caspase-3

Filtering by conditional probability

Same relations as above with their conditional probabilities:

kable(head(filtered[order(-trend),.(year,c1,relation,prob.C1GivenC2,prob.C2GivenC1)],12))

year	c1	relation	prob.C1GivenC2	prob.C2GivenC1
1986	D017354	Point Mutation - Oligonucleotide Primers	0.5000000	1.0000000
1981	D018836	Inflammation Mediators - Endothelial Cells	1.0000000	1.0000000
1968	D001402	B-Lymphocytes - T-Lymphocyte	0.4000000	1.0000000
1962	D009414	Nerve Growth Factors - nerve growth factor	0.8461538	0.5789474
1968	D007155	Biological Response Modifiers - Autoimmune Response	0.3333333	1.0000000
1974	D012097	Repressor Proteins - Drosophila Proteins	0.5000000	1.0000000
1959	D018797	Cell Cycle Proteins - Adaptor Proteins, Signal Transducing	1.0000000	0.3333333
1996	D051843	Synucleins - alpha-Synuclein	1.0000000	0.6666667
1955	D003342	Corpus striatum structure - Neostriatum	1.0000000	0.3684211
1976	D012097	Repressor Proteins - TRANSCRIPTION FACTOR	0.3611111	0.7428571
1954	D018797	Cell Cycle Proteins - Drosophila Proteins	0.6666667	1.0000000
1994	D020169	caspase - caspase-3	1.0000000	0.6000000

Relations with a least one High conditional probability are often trivial, since one of the concepts almost always appears with the other one.

The following function filters out relations which have at least one conditional probability higher than the threshold. It also has an option to keep only the first surge for any relation.

filtered2<-filterCondProb(filtered,conditional.threshold = 0.6,onlyFirstSurge = TRUE)
kable(head(filtered2[order(-trend),.(year,c1,relation,trend)],12))

year	c1	relation	trend
1971	D016501	Neurites - nerve growth factor	0.2500000
1953	D000959	Antihypertensive Agents - reserpine	0.2051585
1969	D002795	Choline O-Acetyltransferase - Cholinergic Fibers	0.1666667
1961	D008562	Membrane Glycoproteins - Actin-Binding Protein	0.1666667
1978	D018836	Inflammation Mediators - Endothelial Cells	0.1666667
1981	D053318	Apolipoprotein E3 - Apolipoprotein E4	0.1666667
1962	D001379	azathioprine - Immunosuppressive Agents	0.1322751
1973	D003924	Diabetes Mellitus, Non-Insulin-Dependent - treatment failure	0.1250000
1972	D004268	DNA Helix Destabilizing Proteins - Nuclear Proteins	0.1111111
1972	D004268	DNA Helix Destabilizing Proteins - Repressor Proteins	0.1111111
1972	D009687	Nuclear Proteins - Repressor Proteins	0.1111111
1954	D011956	Receptors, Cell Surface - Drosophila Proteins	0.1111111

Bottom surges

kable(head(filtered2[order(-trend),.(year,c1,relation,trend)],12))

year	c1	relation	trend
1971	D016501	Neurites - nerve growth factor	0.2500000
1953	D000959	Antihypertensive Agents - reserpine	0.2051585
1969	D002795	Choline O-Acetyltransferase - Cholinergic Fibers	0.1666667
1961	D008562	Membrane Glycoproteins - Actin-Binding Protein	0.1666667
1978	D018836	Inflammation Mediators - Endothelial Cells	0.1666667
1981	D053318	Apolipoprotein E3 - Apolipoprotein E4	0.1666667
1962	D001379	azathioprine - Immunosuppressive Agents	0.1322751
1973	D003924	Diabetes Mellitus, Non-Insulin-Dependent - treatment failure	0.1250000
1972	D004268	DNA Helix Destabilizing Proteins - Nuclear Proteins	0.1111111
1972	D004268	DNA Helix Destabilizing Proteins - Repressor Proteins	0.1111111
1972	D009687	Nuclear Proteins - Repressor Proteins	0.1111111
1954	D011956	Receptors, Cell Surface - Drosophila Proteins	0.1111111

Comparison against the baseline

This part requires the full Medline data (not only the ND subset):

data.suffix <- '.min100'

surges.scp <- loadSurgesData(dir=dataPathOutput,suffix=data.suffix, measure='scp',indicator='diff',ma_window=5)

## [1] "data/output/scp.diff.5.min100.tsv"

dynamic_joint <- loadDynamicData(dir=dataPathInput,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dir=dataPathInput,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dir=dataPathInput)
static_data <- loadStaticData(dir=dataPathInputStatic,suffix=data.suffix)

Generating output relations to annotate

Important: the data used in the code was generated from the full input set of relations, i.e. with option data.suffix set to '.min100'.

The following code computes the full list of relations, similarly to the original time sliced evaluation method. This will be used as baseline.

relationsDT <- computeMovingAverage(dynamic_joint,dynamic_total, window=1)
indivDT <- computeMovingAverage(dynamic_indiv,dynamic_total, window=1)
allrels<-addDynamicAssociationToRelations(relationsDT,indivDT,measures = c('prob.joint'))
allrels<-allrels[freq.joint>0,]

The code below applies the same preprocessing steps to both the SCP relations and the baseline relations:

set.seed(2022) 
scp<-prepareCmpBaseline(surges.scp, static_data)

## [1] "size relsDT: 315418"
## [1] "size static: 1809993"
## [1] "after filter condi + FIRSTsurge + years:  18343"
## [1] "after filter groups:  9092"

scp <- scp[order(-trend),]
baseline<-prepareCmpBaseline(allrels,static_data)

## [1] "size relsDT: 65370033"
## [1] "size static: 1809993"
## [1] "after filter condi + FIRSTsurge + years:  497070"
## [1] "after filter groups:  108794"

We now create the dataset to annotate, containing both some SCP relations and some baseline relations:

x1/y1 contain a random subset of 100 relations for both methods
x2/y2 contain the top 100 relations for both methods

The 4 subsets are then concatenated and shuffled.

baseline<-head(baseline[order(-freq.joint),],nrow(scp))
x1<-scp[sample(.N,100),.(c1,c2,term.c1,term.c2,group.c1,group.c2,year)]
y1<-baseline[sample(.N,100), .(c1,c2,term.c1,term.c2,group.c1,group.c2,year)]
x2<-head(scp[,.(c1,c2,term.c1,term.c2,group.c1,group.c2,year)],100)
y2<-head(baseline[, .(c1,c2,term.c1,term.c2,group.c1,group.c2,year)],100)
res <- rbind(x1,y1,x2,y2)
res <- res[sample(.N),]
fwrite(res,'to-annotate.tsv',sep='\t')

Evaluation results

res <- eval.baseline( x1, y1, x2, y2, annotFile='data/annotated-relations.tsv')

##     method precision
## 1 baseline 0.2985075
## 2      scr 0.4430380
##           
##            no yes
##   baseline 47  20
##   scr      44  35
## [1] "chisq.test p-val= 0.104261308387589"
##     method precision
## 1 baseline 0.1785714
## 2      scr 0.3827160
##           
##            no yes
##   baseline 69  15
##   scr      50  31
## [1] "chisq.test p-val= 0.00596135278801524"

kable(res)

method	subset	answer	number
scr	rnd	yes	35
scr	top	yes	31
baseline	rnd	yes	20
baseline	top	yes	15
scr	rnd	no	44
scr	top	no	50
baseline	rnd	no	47
baseline	top	no	69
scr	rnd	maybe	21
scr	top	maybe	19
baseline	rnd	maybe	33
baseline	top	maybe	16

Mining Impactful Discoveries 2: Analysis

Erwan Moreau

November 2021