This document is part of the documentation for the Medline Discoveries repository, which is the companion code for the paper “Mining impactful discoveries from the biomedical literature” TODO link.
This document was generated from an R Markdown source file. The source file is provided in the repository, allowing full reproducibility of the experiments presented in the paper. It can be executed through the RStudio interface (“knit” button) or as follows:
rmarkdown::render('2-analysis.Rmd')
The .Rmd
source document can be configured by modifying
the following lines:
# directory for the input data (Medline frequency data)
dataPathInput <- 'data/input'
dataPathInputStatic <- paste(dataPathInput,'static',sep='/')
# directory for the output data (extracted surges from the main process in step 1)
dataPathOutput <- 'data/output'
# by default using the Medline ND subset:
# the static suffix is different from the dynamic one for some reason (sorry for that)
data.suffix <- '.min100.ND'
# uncomment the following line in order to use the full Medline data instead (not tested with this Rmd!):
# data.suffix <- '.min100'
# set to TRUE for quick execution with a random subset of the data
shortVersion <- FALSE
# set to TRUE in order to save the graphs used in the paper
savePaperGraphs <- TRUE
Notes:
shortVersion
is set to
FALSE
, the process requires at least 128GB RAM.data.suffix <- '.min100'
) has not been tested with this
Rmd document.source('discoveries.R')
The software requirements are the same as for the first part
This part requires the same input data as in the first part, but also the output generated as a result of applying the process. Both datasets can be downloaded from https://zenodo.org/record/5888572.
By default the input data is expected in the subdirectory
data/input
and the output data (unsurprisingly) in
data/output
.
dynamic_joint <- loadDynamicData(dataPathInput,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dataPathInput,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dataPathInput)
static_data <- loadStaticData(dir=dataPathInputStatic,suffix=data.suffix)
als <- filterConcepts(dynamic_joint, 'D000690')
selectedALS <- filterConcepts(als,c('D051379','D000073885','D000870','D019782','D008875'))
parkinson <- filterConcepts(dynamic_joint, 'D010300')
caseParkinson <- filterConcepts(parkinson, 'D013378')
Note: in some of the graphs (trend distribution), the representation
of all the points is very computationally heavy, causing extremely long
duration even to just load the pdf file. This is why a random subset of
relations is used even in the case where shortVersion
is
set to FALSE
.
if (shortVersion) {
dynamic_joint<-pickRandomDynamic(dynamic_joint,n=1000)
dynamic_joint.rndsubset<-dynamic_joint
} else {
dynamic_joint.rndsubset<-pickRandomDynamic(dynamic_joint,n=20000)
}
## [1] "selecting among n pairs = 290797"
In order to avoid cases with insufficient data, it is possible to filter out rows with less than some minimum frequency \(m\) (the relation must have at least one year with a frequency higher than \(m\)). It is also possible to specify the number of cases to pick:
three_relations<-pickRandomDynamic(dynamic_joint,n=3, minFreqYear = 200)
## [1] "selecting among n pairs = 25266"
Using the relations selected randomly above:
displaySeveralPairsData(three_relations,static_data,dynamic_total)
## Warning: Removed 51 rows containing missing values (position_stack).
## Warning: Removed 5 rows containing missing values (geom_col).
## Warning: Removed 2 rows containing missing values (geom_path).
g<-displaySeveralPairsData(selectedALS,static_data,dynamic_total,excludeConceptsFromName = 'D000690',ncol=5)
g
## Warning: Removed 94 rows containing missing values (position_stack).
## Warning: Removed 8 rows containing missing values (geom_col).
## Warning: Removed 2 rows containing missing values (geom_path).
ggsave('expl1.pdf',g)
## Saving 10 x 4 in image
## Warning: Removed 94 rows containing missing values (position_stack).
## Warning: Removed 8 rows containing missing values (geom_col).
## Warning: Removed 2 rows containing missing values (geom_path).
g <- displayMultiMeasure(selectedALS,dynamic_indiv,dynamic_total,static_data,windows=c(1,3,5),measures=c('prob.joint','pmi','npmi','mi','nmi','scp'),excludeConceptsFromName = 'D000690')
g
## Warning: Removed 6 rows containing missing values (geom_path).
ggsave('expl-measures.pdf',g+ theme(text=element_text(size=14),legend.position="none",axis.title.x=element_blank(),axis.title.y=element_blank()),width=20,height=16,units = "cm")
## Warning: Removed 6 rows containing missing values (geom_path).
displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='rate',measure = 'prob.joint',excludeConceptsFromName = 'D000690')
## Warning: Removed 182 rows containing missing values (position_stack).
## Warning: Removed 16 rows containing missing values (geom_col).
displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='rate',measure = 'nmi',excludeConceptsFromName = 'D000690')
## Warning: Removed 182 rows containing missing values (position_stack).
## Warning: Removed 16 rows containing missing values (geom_col).
displayMultiTrend(selectedALS,dynamic_indiv,dynamic_total,static_data,indicator='diff',measure = 'nmi',excludeConceptsFromName = 'D000690')
## Warning: Removed 182 rows containing missing values (position_stack).
## Warning: Removed 16 rows containing missing values (geom_col).
This graph shows the surge year(s) in the plot of the probability across years (this is not very useful actually).
g <- displaySurges(caseParkinson,dynamic_indiv,dynamic_total,static_data,windows=c(1,3), indicators=c('rate','diff'),withTitle = TRUE)
g
## Warning: Removed 4 rows containing missing values (position_stack).
g <- displaySurges(caseParkinson,dynamic_indiv,dynamic_total,static_data,valueCol='nmi',windows=c(1,3,5), indicators=c('rate','diff'),withTitle = TRUE)
g
## Warning: Removed 12 rows containing missing values (position_stack).
# saving the graph as pdf:
# ggsave('expl1.pdf',g)
relations.ma <- computeMovingAverage(dynamic_joint.rndsubset,dynamic_total, window=5)
indiv.ma <- computeMovingAverage(dynamic_indiv,dynamic_total, window=5)
The following graphs show different versions of the histogram of the
trend
values: with/without logarithmic scale on the X
and/or Y axis.
The vertical lines show the outlier threshold calculated as \(Q_3+k \times IQR\), where \(k\) is either 1.5 or 3 (“far outliers”).
l<-displayTrendDistribution1(relations.ma,indiv.ma)
l$x0.y0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1290146 rows containing non-finite values (stat_bin).
l$x0.y1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1290146 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 137 rows containing missing values (geom_bar).
l$x1.y0
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5301922 rows containing non-finite values (stat_bin).
l$x1.y1
## Warning in self$trans$transform(x): NaNs produced
## Warning in self$trans$transform(x): Transformation introduced infinite values in
## continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5301922 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 14 rows containing missing values (geom_bar).
These graphs show the “flat” distribution (probably not a standard name):
trend
and their relative rank
(rank divided by number of points) is plotted as X. In other words, the
X axis represents all the quantile positions with respect to
trend
.trend
value shown on the Y axis is normalized
between [0,1] for every measure.Note that the random selection can affect the graph significantly since the extreme points are rare.
#displayTrendDistribution2(relations.ma,indiv.ma,withRect = FALSE)
g<-displayTwoPlotsDistribTrend(relations.ma,indiv.ma)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 36 rows containing missing values (geom_bar).
g
Note: the pdf
format makes the graph take a very long
time to load, this is why png
is recommended for this
graph.
ggsave('expl-trend-distrib.pdf',g,width=20,height=8,units = "cm")
ggsave('expl-trend-distrib.png',g,width=20,height=8,units = "cm")
Below the gold-standard dataset presented in the paper is loaded and displayed:
gold <- loadGoldDiscoveries(dir='data',file='ND-discoveries-year.tsv')
kable(gold)
c1 | c2 | term.c1 | group.c1 | term.c2 | group.c2 | year |
---|---|---|---|---|---|---|
D002894 | D006816 | Chromosomes, Human, Pair 4 | NA | Huntington Disease | DISO | 1983 |
D000071617 | D010300 | Protein Deglycase DJ-1 | CHEM | Parkinson Disease | DISO | 2003 |
D000690 | D057180 | Amyotrophic Lateral Sclerosis | DISO | Frontotemporal dementia | DISO | 2009 |
D000073885 | D057180 | C9orf72 Protein | CHEM | Frontotemporal dementia | DISO | 2009 |
D000073885 | D000690 | C9orf72 Protein | CHEM | Amyotrophic Lateral Sclerosis | DISO | 2009 |
D000072105 | D000690 | superoxide dismutase 1 | CHEM | Amyotrophic Lateral Sclerosis | DISO | 1993 |
D000068836 | D000544 | rivastigmine | CHEM | Alzheimer’s Disease | DISO | 1997 |
D007980 | D010300 | levodopa | CHEM | Parkinson Disease | DISO | 1961 |
D007980 | D020734 | levodopa | CHEM | Parkinsonian Disorders | DISO | 1961 |
D000544 | D001714 | Alzheimer’s Disease | DISO | Bipolar Disorder | DISO | 2019 |
D004298 | D010300 | dopamine | CHEM | Parkinson Disease | DISO | 1971 |
D004298 | D012559 | dopamine | CHEM | Schizophrenia | DISO | 1977 |
if (shortVersion) {
my.measures <- c('prob.joint','pmi','nmi','scp')
my.ma <- c(1,5)
my.indicators <- 'diff'
} else {
my.measures <- c('prob.joint','pmi','npmi','mi','nmi','scp')
my.ma <- c(1,3,5)
my.indicators <- c('diff',"rate")
}
data.eval <- collectEvalDataSurgesAgainstGold(gold, dir=dataPathOutput,suffix=data.suffix,evalAt=NA,ma_windows=my.ma,measures=my.measures,indicators = my.indicators)
## [1] "data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.1.min100.ND.tsv"
## [1] "data/output/pmi.diff.1.min100.ND.tsv"
## [1] "data/output/pmi.rate.1.min100.ND.tsv"
## [1] "data/output/npmi.diff.1.min100.ND.tsv"
## [1] "data/output/npmi.rate.1.min100.ND.tsv"
## [1] "data/output/mi.diff.1.min100.ND.tsv"
## [1] "data/output/mi.rate.1.min100.ND.tsv"
## [1] "data/output/nmi.diff.1.min100.ND.tsv"
## [1] "data/output/nmi.rate.1.min100.ND.tsv"
## [1] "data/output/scp.diff.1.min100.ND.tsv"
## [1] "data/output/scp.rate.1.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.3.min100.ND.tsv"
## [1] "data/output/pmi.diff.3.min100.ND.tsv"
## [1] "data/output/pmi.rate.3.min100.ND.tsv"
## [1] "data/output/npmi.diff.3.min100.ND.tsv"
## [1] "data/output/npmi.rate.3.min100.ND.tsv"
## [1] "data/output/mi.diff.3.min100.ND.tsv"
## [1] "data/output/mi.rate.3.min100.ND.tsv"
## [1] "data/output/nmi.diff.3.min100.ND.tsv"
## [1] "data/output/nmi.rate.3.min100.ND.tsv"
## [1] "data/output/scp.diff.3.min100.ND.tsv"
## [1] "data/output/scp.rate.3.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "data/output/prob.joint.rate.5.min100.ND.tsv"
## [1] "data/output/pmi.diff.5.min100.ND.tsv"
## [1] "data/output/pmi.rate.5.min100.ND.tsv"
## [1] "data/output/npmi.diff.5.min100.ND.tsv"
## [1] "data/output/npmi.rate.5.min100.ND.tsv"
## [1] "data/output/mi.diff.5.min100.ND.tsv"
## [1] "data/output/mi.rate.5.min100.ND.tsv"
## [1] "data/output/nmi.diff.5.min100.ND.tsv"
## [1] "data/output/nmi.rate.5.min100.ND.tsv"
## [1] "data/output/scp.diff.5.min100.ND.tsv"
## [1] "data/output/scp.rate.5.min100.ND.tsv"
res.eval<-evalSurgessAgainstGold(data.eval,eval_windows = c(1,3,5))
eval3 <- res.eval[eval.window==3 & mode=='first.year',]
kable(head(eval3[order(-perf),],12))
ma.window | measure | indicator | mode | eval.at | perf | eval.window |
---|---|---|---|---|---|---|
5 | scp | diff | first.year | NA | 0.6666667 | 3 |
1 | nmi | diff | first.year | NA | 0.5833333 | 3 |
1 | scp | diff | first.year | NA | 0.5833333 | 3 |
3 | nmi | diff | first.year | NA | 0.5833333 | 3 |
3 | scp | diff | first.year | NA | 0.5833333 | 3 |
5 | nmi | diff | first.year | NA | 0.5833333 | 3 |
5 | npmi | diff | first.year | NA | 0.5833333 | 3 |
3 | npmi | diff | first.year | NA | 0.5000000 | 3 |
5 | prob.joint | rate | first.year | NA | 0.4166667 | 3 |
1 | npmi | diff | first.year | NA | 0.2500000 | 3 |
3 | prob.joint | rate | first.year | NA | 0.2500000 | 3 |
1 | prob.joint | rate | first.year | NA | 0.1666667 | 3 |
g<-displayAvgPerfByParam(res.eval,evalWindow = 3)
g
ggsave('plot-params.pdf',g,width = 20,height=8,unit='cm')
The overlap coefficient is calculated based on the number of relations in common between the two configurations, where “in common” means that the relation appears in both lists with their surge year within 5 years of each other.
surges<-readMultipleSurgesFiles(measures=my.measures,indicator='diff',ma_windows = my.ma)
## [1] "data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "data/output/pmi.diff.1.min100.ND.tsv"
## [1] "data/output/npmi.diff.1.min100.ND.tsv"
## [1] "data/output/mi.diff.1.min100.ND.tsv"
## [1] "data/output/nmi.diff.1.min100.ND.tsv"
## [1] "data/output/scp.diff.1.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "data/output/pmi.diff.3.min100.ND.tsv"
## [1] "data/output/npmi.diff.3.min100.ND.tsv"
## [1] "data/output/mi.diff.3.min100.ND.tsv"
## [1] "data/output/nmi.diff.3.min100.ND.tsv"
## [1] "data/output/scp.diff.3.min100.ND.tsv"
## [1] "data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "data/output/pmi.diff.5.min100.ND.tsv"
## [1] "data/output/npmi.diff.5.min100.ND.tsv"
## [1] "data/output/mi.diff.5.min100.ND.tsv"
## [1] "data/output/nmi.diff.5.min100.ND.tsv"
## [1] "data/output/scp.diff.5.min100.ND.tsv"
mat<-correlationMatrixMultiParam(surges, 5)
## 1: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 2: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 3: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 4: 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 5: 6 7 8 9 10 11 12 13 14 15 16 17 18
## 6: 7 8 9 10 11 12 13 14 15 16 17 18
## 7: 8 9 10 11 12 13 14 15 16 17 18
## 8: 9 10 11 12 13 14 15 16 17 18
## 9: 10 11 12 13 14 15 16 17 18
## 10: 11 12 13 14 15 16 17 18
## 11: 12 13 14 15 16 17 18
## 12: 13 14 15 16 17 18
## 13: 14 15 16 17 18
## 14: 15 16 17 18
## 15: 16 17 18
## 16: 17 18
## 17: 18
g<-heatMapCommonMatrix(mat)
g
ggsave('heatmap-overlap.pdf',g,width=20,height=15,unit='cm')
surges.scp <-loadSurgesData(dir=dataPathOutput,suffix = data.suffix,ma_window = 5,measure='scp',indicator = 'diff')
## [1] "data/output/scp.diff.5.min100.ND.tsv"
surges.scp[,first.surge:=(year==min(year)),by=key(surges.scp)]
ggplot(surges.scp,aes(duration.joint,fill=first.surge))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(surges.scp,aes(as.factor(year),trend))+geom_boxplot()+ylim(c(0,.02))+theme(axis.text.x=element_text(angle = -90, hjust = 0))
## Warning: Removed 2602 rows containing non-finite values (stat_boxplot).
doublePlotAcrossTime(dynamic_joint, surges.scp,bins=68,fontsize=16)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 4 rows containing missing values (geom_bar).
first.surges.scp <- surges.scp[first.surge==TRUE,]
nrow(first.surges.scp)
## [1] 7514
g3 <- plotFirstOccIndivJoint(first.surges.scp)
g3
Statistics related to time:
print('average duration between concepts exist and first surge:')
## [1] "average duration between concepts exist and first surge:"
first.surges.scp[,mean(year-year.first.both)]
## [1] 17.88635
print('average duration between concepts exist and first cooccurrence:')
## [1] "average duration between concepts exist and first cooccurrence:"
first.surges.scp[,mean(year.first.joint-year.first.both)]
## [1] 10.30516
print('average duration between first cooccurrence and first surge:')
## [1] "average duration between first cooccurrence and first surge:"
first.surges.scp[,mean(year-year.first.joint)]
## [1] 7.581182
print('distribution (quantiles) for time before cooccurrence:')
## [1] "distribution (quantiles) for time before cooccurrence:"
quantile(first.surges.scp[,year.first.joint-year.first.both],probs=seq(0,1,.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0 0 0 2 4 7 11 14 18 26 61
print('distribution (quantiles) for time before first surge:')
## [1] "distribution (quantiles) for time before first surge:"
quantile(first.surges.scp[,year-year.first.both],probs=seq(0,1,.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## -1 2 5 9 12 15 18 22 29 40 67
print('proportion of surges happening the first year the concepts both exist:')
## [1] "proportion of surges happening the first year the concepts both exist:"
nrow(first.surges.scp[year <= year.first.both,])/nrow(first.surges.scp)
## [1] 0.05270162
print('proportion of cooccurrences happening the first year the concepts both exist:')
## [1] "proportion of cooccurrences happening the first year the concepts both exist:"
nrow(first.surges.scp[year.first.joint == year.first.both,])/nrow(first.surges.scp)
## [1] 0.2102742
print('proportion of cooccurrences happening at the same time as first cooccurrence:')
## [1] "proportion of cooccurrences happening at the same time as first cooccurrence:"
nrow(first.surges.scp[year == year.first.joint,])/nrow(first.surges.scp)
## [1] 0.07040192
ggsave('durations.pdf',g3,width=20,height=12,unit='cm')
g <- doublePlotAcrossTime(dynamic_joint, surges.scp,addThird = g3)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 4 rows containing missing values (geom_bar).
g
ggsave('surges-by-year2.pdf',g,width=20,height=20,unit='cm')
Here we observe how the joint frequency of the relation relates to the status as surge or not.
The first graph below shows the distribution of surges by frequency, i.e. how many relations with surge have a particular frequency \(x\).
The number of surges decreases when the frequency increases, but we can see below that this an effect of the higher number of relations with low frequency.
d<-surges.scp
d[,first.surge:=(year==min(year)),by=key(d)]
d<-d[first.surge==TRUE,]
x<-merge(d, static_data,suffixes=c('.dynamic','.static'))
ggplot(x,aes(freq.joint.static))+geom_histogram()+xlim(c(0,100000))+scale_y_log10(labels = scales::comma_format(accuracy=1,big.mark = ",", decimal.mark = "."))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 268 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
The following graph shows the number of relations with a surge among all the relations. Note that the Y axis is logarithmic, this makes the proportion of surges look higher. For instance the first bin contains a few thousands surges among 100 millions relations.
On this graph it can be observed that the proportion of surges is roughly stable with respect to frequency.
g1 <- surgesAmongRelationsByFreq(surges.scp,static_data)
g1
## Warning: Removed 498 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 10 rows containing missing values (geom_bar).
The following graph shows the proportion of relations with surges vs. non surges for every bin. It should be noted that by definition, this graph doesn’t represent the proportion of a bin among the whole data: the first bins (low frequency) contain a lot more relations than the others, as seen above.
The X axis is cut at 10k in order to show the increase of the proportion of surges as frequency increases.
g2 <- surgesAmongRelationsByFreq(surges.scp,static_data,asProportion = TRUE,freqmax = 10000,logY = FALSE)
g2
## Warning: Removed 8040 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
g<-plot_grid(g1,g2,labels=NULL,ncol=2)
## Warning: Removed 498 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 10 rows containing missing values (geom_bar).
## Warning: Removed 8040 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
ggsave('surges-by-freq.pdf',g,width=20,height=8,unit='cm')
Here we simply observe the top and bottom surges under various conditions, as described in the paper.
surges.scp[,first.surge:=(year==min(year)),by=key(surges.scp)]
surges.scp.terms <- addRelationName(surges.scp,static_data)
These are the top 12 relations by trend without any filtering:
kable(head(surges.scp.terms[order(-trend),.(year,c1,c2,group.c1,group.c2,relation)],12))
year | c1 | c2 | group.c1 | group.c2 | relation |
---|---|---|---|---|---|
1967 | D007980 | D015259 | CHEM | CHEM PHYS | levodopa - Dopaminergic Agents |
1968 | D017433 | D017434 | CONC | CONC | Protein Structure, Secondary - Tertiary Protein Structure |
1984 | D017510 | D057927 | PHYS | PHEN CONC PHYS | Protein Folding, Globular - Hydrophilicity |
1971 | D008279 | D014057 | PROC CONC | PROC | Magnetic Resonance Imaging - Cine-CT |
1986 | D020033 | D054875 | CHEM | PHYS | Protein Isoforms - Ubiquitination |
1976 | D011994 | D049109 | CHEM | PHYS | Recombinant Proteins - Cell Proliferation |
1970 | D015999 | D016000 | CONC | PROC | Multivariate Analysis - statistical cluster |
1986 | D017354 | D017931 | DISO | CHEM | Point Mutation - Oligonucleotide Primers |
1983 | D017510 | D018360 | PHYS | PROC | Protein Folding, Globular - Crystallography, X-Ray |
1981 | D017774 | D056547 | PHYS | ANAT | Long-Term Potentiation - CA1 field |
1982 | D017774 | D056547 | PHYS | ANAT | Long-Term Potentiation - CA1 field |
1981 | D018836 | D042783 | CHEM | ANAT | Inflammation Mediators - Endothelial Cells |
The flitering by semantic group can be done as follows. The UMLS semantic groups are used (the abbreviated name, first column).
filtered <- selectRelationsGroups(surges.scp.terms, groups1=c('CONC'), groups2=c('DISO','LIVB','ANAT'))
By default both selected groups are ‘DISO’,‘CHEM’,‘GENE’,‘ANAT’:
filtered <- selectRelationsGroups(surges.scp.terms)
Top 12 after filtering:
kable(head(filtered[order(-trend),.(year,c1,c2,group.c1,group.c2,relation)],12))
year | c1 | c2 | group.c1 | group.c2 | relation |
---|---|---|---|---|---|
1986 | D017354 | D017931 | DISO | CHEM | Point Mutation - Oligonucleotide Primers |
1981 | D018836 | D042783 | CHEM | ANAT | Inflammation Mediators - Endothelial Cells |
1968 | D001402 | D013601 | ANAT | ANAT | B-Lymphocytes - T-Lymphocyte |
1962 | D009414 | D020932 | CHEM | CHEM | Nerve Growth Factors - nerve growth factor |
1968 | D007155 | D015551 | CHEM | DISO | Biological Response Modifiers - Autoimmune Response |
1974 | D012097 | D029721 | CHEM | CHEM | Repressor Proteins - Drosophila Proteins |
1959 | D018797 | D048868 | CHEM | CHEM | Cell Cycle Proteins - Adaptor Proteins, Signal Transducing |
1996 | D051843 | D051844 | CHEM | CHEM | Synucleins - alpha-Synuclein |
1955 | D003342 | D017072 | ANAT | ANAT | Corpus striatum structure - Neostriatum |
1976 | D012097 | D014157 | CHEM | CHEM | Repressor Proteins - TRANSCRIPTION FACTOR |
1954 | D018797 | D029721 | CHEM | CHEM | Cell Cycle Proteins - Drosophila Proteins |
1994 | D020169 | D053148 | CHEM | CHEM | caspase - caspase-3 |
Same relations as above with their conditional probabilities:
kable(head(filtered[order(-trend),.(year,c1,relation,prob.C1GivenC2,prob.C2GivenC1)],12))
year | c1 | relation | prob.C1GivenC2 | prob.C2GivenC1 |
---|---|---|---|---|
1986 | D017354 | Point Mutation - Oligonucleotide Primers | 0.5000000 | 1.0000000 |
1981 | D018836 | Inflammation Mediators - Endothelial Cells | 1.0000000 | 1.0000000 |
1968 | D001402 | B-Lymphocytes - T-Lymphocyte | 0.4000000 | 1.0000000 |
1962 | D009414 | Nerve Growth Factors - nerve growth factor | 0.8461538 | 0.5789474 |
1968 | D007155 | Biological Response Modifiers - Autoimmune Response | 0.3333333 | 1.0000000 |
1974 | D012097 | Repressor Proteins - Drosophila Proteins | 0.5000000 | 1.0000000 |
1959 | D018797 | Cell Cycle Proteins - Adaptor Proteins, Signal Transducing | 1.0000000 | 0.3333333 |
1996 | D051843 | Synucleins - alpha-Synuclein | 1.0000000 | 0.6666667 |
1955 | D003342 | Corpus striatum structure - Neostriatum | 1.0000000 | 0.3684211 |
1976 | D012097 | Repressor Proteins - TRANSCRIPTION FACTOR | 0.3611111 | 0.7428571 |
1954 | D018797 | Cell Cycle Proteins - Drosophila Proteins | 0.6666667 | 1.0000000 |
1994 | D020169 | caspase - caspase-3 | 1.0000000 | 0.6000000 |
Relations with a least one High conditional probability are often trivial, since one of the concepts almost always appears with the other one.
The following function filters out relations which have at least one conditional probability higher than the threshold. It also has an option to keep only the first surge for any relation.
filtered2<-filterCondProb(filtered,conditional.threshold = 0.6,onlyFirstSurge = TRUE)
kable(head(filtered2[order(-trend),.(year,c1,relation,trend)],12))
year | c1 | relation | trend |
---|---|---|---|
1971 | D016501 | Neurites - nerve growth factor | 0.2500000 |
1953 | D000959 | Antihypertensive Agents - reserpine | 0.2051585 |
1969 | D002795 | Choline O-Acetyltransferase - Cholinergic Fibers | 0.1666667 |
1961 | D008562 | Membrane Glycoproteins - Actin-Binding Protein | 0.1666667 |
1978 | D018836 | Inflammation Mediators - Endothelial Cells | 0.1666667 |
1981 | D053318 | Apolipoprotein E3 - Apolipoprotein E4 | 0.1666667 |
1962 | D001379 | azathioprine - Immunosuppressive Agents | 0.1322751 |
1973 | D003924 | Diabetes Mellitus, Non-Insulin-Dependent - treatment failure | 0.1250000 |
1972 | D004268 | DNA Helix Destabilizing Proteins - Nuclear Proteins | 0.1111111 |
1972 | D004268 | DNA Helix Destabilizing Proteins - Repressor Proteins | 0.1111111 |
1972 | D009687 | Nuclear Proteins - Repressor Proteins | 0.1111111 |
1954 | D011956 | Receptors, Cell Surface - Drosophila Proteins | 0.1111111 |
kable(head(filtered2[order(-trend),.(year,c1,relation,trend)],12))
year | c1 | relation | trend |
---|---|---|---|
1971 | D016501 | Neurites - nerve growth factor | 0.2500000 |
1953 | D000959 | Antihypertensive Agents - reserpine | 0.2051585 |
1969 | D002795 | Choline O-Acetyltransferase - Cholinergic Fibers | 0.1666667 |
1961 | D008562 | Membrane Glycoproteins - Actin-Binding Protein | 0.1666667 |
1978 | D018836 | Inflammation Mediators - Endothelial Cells | 0.1666667 |
1981 | D053318 | Apolipoprotein E3 - Apolipoprotein E4 | 0.1666667 |
1962 | D001379 | azathioprine - Immunosuppressive Agents | 0.1322751 |
1973 | D003924 | Diabetes Mellitus, Non-Insulin-Dependent - treatment failure | 0.1250000 |
1972 | D004268 | DNA Helix Destabilizing Proteins - Nuclear Proteins | 0.1111111 |
1972 | D004268 | DNA Helix Destabilizing Proteins - Repressor Proteins | 0.1111111 |
1972 | D009687 | Nuclear Proteins - Repressor Proteins | 0.1111111 |
1954 | D011956 | Receptors, Cell Surface - Drosophila Proteins | 0.1111111 |
This part requires the full Medline data (not only the ND subset):
data.suffix <- '.min100'
surges.scp <- loadSurgesData(dir=dataPathOutput,suffix=data.suffix, measure='scp',indicator='diff',ma_window=5)
## [1] "data/output/scp.diff.5.min100.tsv"
dynamic_joint <- loadDynamicData(dir=dataPathInput,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dir=dataPathInput,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dir=dataPathInput)
static_data <- loadStaticData(dir=dataPathInputStatic,suffix=data.suffix)
Important: the data used in the code was generated from the
full input set of relations, i.e. with option data.suffix
set to '.min100'
.
The following code computes the full list of relations, similarly to the original time sliced evaluation method. This will be used as baseline.
relationsDT <- computeMovingAverage(dynamic_joint,dynamic_total, window=1)
indivDT <- computeMovingAverage(dynamic_indiv,dynamic_total, window=1)
allrels<-addDynamicAssociationToRelations(relationsDT,indivDT,measures = c('prob.joint'))
allrels<-allrels[freq.joint>0,]
The code below applies the same preprocessing steps to both the SCP relations and the baseline relations:
set.seed(2022)
scp<-prepareCmpBaseline(surges.scp, static_data)
## [1] "size relsDT: 315418"
## [1] "size static: 1809993"
## [1] "after filter condi + FIRSTsurge + years: 18343"
## [1] "after filter groups: 9092"
scp <- scp[order(-trend),]
baseline<-prepareCmpBaseline(allrels,static_data)
## [1] "size relsDT: 65370033"
## [1] "size static: 1809993"
## [1] "after filter condi + FIRSTsurge + years: 497070"
## [1] "after filter groups: 108794"
We now create the dataset to annotate, containing both some SCP relations and some baseline relations:
The 4 subsets are then concatenated and shuffled.
baseline<-head(baseline[order(-freq.joint),],nrow(scp))
x1<-scp[sample(.N,100),.(c1,c2,term.c1,term.c2,group.c1,group.c2,year)]
y1<-baseline[sample(.N,100), .(c1,c2,term.c1,term.c2,group.c1,group.c2,year)]
x2<-head(scp[,.(c1,c2,term.c1,term.c2,group.c1,group.c2,year)],100)
y2<-head(baseline[, .(c1,c2,term.c1,term.c2,group.c1,group.c2,year)],100)
res <- rbind(x1,y1,x2,y2)
res <- res[sample(.N),]
fwrite(res,'to-annotate.tsv',sep='\t')
res <- eval.baseline( x1, y1, x2, y2, annotFile='data/annotated-relations.tsv')
## method precision
## 1 baseline 0.2985075
## 2 scr 0.4430380
##
## no yes
## baseline 47 20
## scr 44 35
## [1] "chisq.test p-val= 0.104261308387589"
## method precision
## 1 baseline 0.1785714
## 2 scr 0.3827160
##
## no yes
## baseline 69 15
## scr 50 31
## [1] "chisq.test p-val= 0.00596135278801524"
kable(res)
method | subset | answer | number |
---|---|---|---|
scr | rnd | yes | 35 |
scr | top | yes | 31 |
baseline | rnd | yes | 20 |
baseline | top | yes | 15 |
scr | rnd | no | 44 |
scr | top | no | 50 |
baseline | rnd | no | 47 |
baseline | top | no | 69 |
scr | rnd | maybe | 21 |
scr | top | maybe | 19 |
baseline | rnd | maybe | 33 |
baseline | top | maybe | 16 |