Main process
Note: the full process is described in detail below. Readers who are only interested in the simplest way to run the process can go directly to the last part “Processing multiple parameters and saving surges data”.
Initialization
The code can be loaded in the R interpreter with:
source('discoveries.R')
Loading the input data
dynamic_joint <- loadDynamicData(dir=dataPath,suffix=data.suffix, indivOrJoint = 'joint')
dynamic_indiv <- loadDynamicData(dir=dataPath,suffix=data.suffix, indivOrJoint = 'indiv')
dynamic_total <- loadDynamicTotalFile(dir=dataPath)
The following can be used to select a random sample of relations:
dynamic_joint<-pickRandomDynamic(dynamic_joint,n=1000)
Statistics
- Number of concepts:
nrow(unique(dynamic_indiv,by=key(dynamic_indiv)))
## [1] 1803
- Number of relations:
nrow(unique(dynamic_joint,by=key(dynamic_joint)))
## [1] 290797
- Number of rows, i.e. pairs (relation,year):
nrow(dynamic_joint)
## [1] 10711862
Preprocessing
This step fills the gap years with 0 frequency values and calculates the moving average if needed.
- This step is necessary even if the moving average is not used (default window of size 1).
Example with a moving average over a window of size 5:
relations.ma <- computeMovingAverage(dynamic_joint,dynamic_total, window=5)
indiv.ma <- computeMovingAverage(dynamic_indiv,dynamic_total, window=5)
New size:
nrow(relations.ma)
## [1] 13709214
Caculating a set of measures by year
This step calculates the measures used as a basis for calculating the trend (next step). One or several measures can be calculated.
Available measures:
prob.joint
is the simple joint probability (it is already calculated from the previous step but this step is still recommended for consistency).pmi
andnpmi
: Pointwise Mutual Information and its normalized variant.mi
andnmi
: “binary” Mutual Information and its normalized variant. The events considered are simply based on whether each concept is present or not (hence the word “binary”).scp
pmi2
andpmi3
rel.measures<-addDynamicAssociationToRelations(relations.ma,indiv.ma,measures = c('prob.joint','pmi','nmi'))
Calculating the trend for every year and every relation
This step calculates the trend
for every year and every
relation.
- The input data table is modified in place for the sake of efficiency. It contains additional columns after executing the function, but the number of rows is not modified.
- The trend is calculated using the column provided with the argument
measure
and stored in a new columntrend
. - The
indicator
argument determines how thetrend
value is calculated:rate
(default) is the relative rate: \(\frac{p_{y}-p_{y-1}}{p_{y-1}}\). Experimetal results show that this indicator gives very poor results.diff
is the simple difference: \(p_{y}-p_{y-1}\)
Example:
computeTrend(rel.measures, indicator='diff', measure='nmi')
- Note: there is no need to store the output data table since the data table is modified by reference.
Detecting surges
For every relation, this step marks the years where the trend value is higher than some threshold \(t\) as surge.
\(t\) can be a custom threshold or defined with one of the two proposed methods:
- method 1 uses the standard outlier threshold calculated with the inter-quartile range: \(t=Q_3 + 3 IQR\).
- method 2 (recommended) uses the inflection point in the quantile graph (see details in the paper).
The input data table is modified in place for the sake of efficiency.
# threshold <- calculateThresholdTopOutliers(rel.measures$trend)
threshold <- calculateThresholdInflectionPoint(rel.measures$trend)
print(threshold)
## [1] 0.003574462
detectSurges(rel.measures, globalThreshold=threshold)
Calculate ‘first year’ information (optional)
- Given a relation between two concepts c1 and c2 which appear for the
first time at years \(y_1\) and \(y_2\) respectively, the earliest possible
year for a cooccurrence (and consequently for a surge) is \(max(y_1 , y_2)\). This can be calculated
and added to the data table (column
year.first.both
) as follows. The difference between the surge year and this year is added as columnduration
. - The year of the first cooccurrence between the two concepts can be
added as well as column
year.first.joint
. The difference between the surge year and this year is added as columnduration.joint
.
rel.measures<-calculateDiffYears(rel.measures,dynamic_indiv,dynamic_joint)
- Note: if the last argument
dynamic_joint
is not provided, onlyyear.first.both
is added.
Adjusting the surge year in the sliding window (optional)
Using a moving average window (size higher than 1) can cause the surge year to be detected too early. This can lead to a meaningless result, in particular if the surge year has no cooccurrence at all. This step calculates the next non-zero year for every year and every relation. This “adjusted year” can be used to replace the surge year in some applications.
This step is optional.
rel.measures <- addNextNonZeroYear(rel.measures)
Statistics surges
surges_stats <- countSurgesByRelation(rel.measures)
kable(surges_stats[n.surges<=10,])
n.surges | n | prop |
---|---|---|
0 | 280078 | 0.9631392 |
1 | 4595 | 0.0158014 |
2 | 1941 | 0.0066748 |
3 | 1154 | 0.0039684 |
4 | 773 | 0.0026582 |
5 | 622 | 0.0021389 |
6 | 421 | 0.0014477 |
7 | 296 | 0.0010179 |
8 | 213 | 0.0007325 |
9 | 185 | 0.0006362 |
10 | 115 | 0.0003955 |
ggplot(surges_stats,aes(n.surges,prop))+geom_col()
Processing multiple parameters and saving surges data
A convenience function is provided which computes the surges based on multiple parameters and saves the resulting data to a file. This function encapsulates all the steps presented above.
system.time(computeAndSaveSurgesData(dataPath,outputDir=outputDir,suffix=data.suffix, ma_windows=c(1,3,5),measures=c('prob.joint','pmi','npmi','mi','nmi','scp'),indicators=c('rate','diff')))
## [1] "processing and saving to data/output/prob.joint.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.rate.1.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.diff.1.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.rate.3.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.diff.3.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/prob.joint.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/pmi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/npmi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/mi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/nmi.diff.5.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.rate.5.min100.ND.tsv"
## [1] "processing and saving to data/output/scp.diff.5.min100.ND.tsv"
## user system elapsed
## 2667.165 167.954 2836.279