This document is part of the “CLG Authorship Experiments” repository:
The .Rmd
source document can be configured by modifying the following lines:
analyze.results <- TRUE
packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
work.dir <- 'experiments/2-doc-groups-by-case'
variable.name <- 'Documents by group'
Sys.setenv(EXPE_WORK_DIR = work.dir)
Sys.setenv(COPY_EXPE1_DATA_DIR = 'experiments/1-doc-size/100')
group.sizes <- '1 2 3 4 5'
Sys.setenv(GROUP_SIZES = group.sizes)
any.group.size <- strsplit(group.sizes, ' ',fixed=TRUE)[[1]][1]
set.seed(2022)
When reproducing the below experiments manually, one should initialize the environment variables, for instance:
export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export EXPE_WORK_DIR=experiments/2-doc-groups-by-case
export COPY_EXPE1_DATA_DIR=experiments/1-doc-size/100
export GROUP_SIZES='1 2 3 4 5'
We use the same dataset as in the first experiment (must have been calculated before).
source session-setup.sh
if [ ! -d "$COPY_EXPE1_DATA_DIR" ]; then
echo "Dir '$COPY_EXPE1_DATA_DIR' not found" 1>&2
exit 1
fi
if [ ! -d "$EXPE_WORK_DIR" ]; then
mkdir "$EXPE_WORK_DIR"
for SIZE in $GROUP_SIZES; do
mkdir "$EXPE_WORK_DIR/$SIZE"
cp -R "$COPY_EXPE1_DATA_DIR"/process "$COPY_EXPE1_DATA_DIR"/data "$COPY_EXPE1_DATA_DIR"/impostors "$EXPE_WORK_DIR/$SIZE"
rm -f "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
done
fi
d <- readDataDir(paste(work.dir,any.group.size,'data',sep='/'))
dataSplitByAuthor <- splitAuthors(d)
## [1] "24 authors ( 22 with at least 2 books )"
## [1] "authors in train-only: mh,wdh,sw,ga,amd,hm,tsa"
## [1] "authors in test-only: fmc,haj,hj,espw,hbs,us,ab,ewaoc,mh+"
## [1] "authors in shared: ew,cfw,cdw,lma,es,mt,rwc,wta"
## [1] "books from shared authors in the training set: 90"
## [1] "all books in the training set: 285"
## [1] "books from shared authors in the test set: 89"
## [1] "books NOT from shared authors in the test set: 180"
## [1] "all books in the test set: 269"
for (size.str in strsplit(group.sizes, ' ',fixed=TRUE)[[1]]) {
size <- as.numeric(size.str)
full <- buildFullDatasetWithGroups(dataSplitByAuthor, 100, 100,group1.sizes=size, group2.sizes=1)
fwrite(full, paste(work.dir,size.str,'full-dataset.tsv',sep='/'), sep='\t')
saveDatasetInCasesFormat(full,dir=paste(work.dir,as.character(size),sep='/'))
}
## [1] "*** TRAIN SET"
## [1] 1
## [1] "*** TEST SET"
## [1] 1
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.61"
## [1] "*** TRAIN SET"
## [1] 2
## [1] "*** TEST SET"
## [1] 2
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.69"
## [1] "*** TRAIN SET"
## [1] 3
## [1] "*** TEST SET"
## [1] 3
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.68"
## [1] "*** TRAIN SET"
## [1] 4
## [1] "*** TEST SET"
## [1] 4
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.65"
## [1] "*** TRAIN SET"
## [1] 5
## [1] "*** TEST SET"
## [1] 5
## [1] "14 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.51"
source session-setup.sh
for SIZE in $GROUP_SIZES; do
echo "$SIZE: truth file"
cat "$EXPE_WORK_DIR/$SIZE/train.tsv" > "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
done
## 1: truth file
## 2: truth file
## 3: truth file
## 4: truth file
## 5: truth file
The script ./run.sh
performs the full training process for one single “size” (variable value). It’s a simple script which prepares the data and then starts the training process, as described in the user guide (part 2). It is used as follows:
./run.sh $EXPE_WORK_DIR $SIZE $TASKS_DIR $NCORES
$SIZE
.The script ./evaluate-all.sh
evaluates:
It is used as follows:
./evaluate-all.sh $EXPE_WORK_DIR $NCORES $TASKS_DIR
$EXPE_WORK_DIR/results
which contains the detailed output for every evaluated model.
$EXPE_WORK_DIR/results/results.tsv
.d<-readExperimentResults(work.dir)
g1 <- perfByModelType(d,x.label=variable.name)
g1
g2 <- comparePerfsByEvalOn(d,diff.seen=FALSE,x.label=variable.name)
g2
## `geom_smooth()` using formula 'y ~ x'
g3 <- comparePerfsByEvalOn(d,diff.seen=TRUE,x.label=variable.name)
g3
## `geom_smooth()` using formula 'y ~ x'
g<-plot_grid(g1,g2,g3,labels=NULL,ncol=3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
ggsave('graphs-expe2.pdf',g,width=30,height=8,unit='cm')