Options

The .Rmd source document can be configured by modifying the following lines:

analyze.results <- TRUE
packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
work.dir <- 'experiments/2-doc-groups-by-case'
variable.name <- 'Documents by group'
Sys.setenv(EXPE_WORK_DIR = work.dir)
Sys.setenv(COPY_EXPE1_DATA_DIR = 'experiments/1-doc-size/100')
group.sizes <- '1 2 3 4 5'
Sys.setenv(GROUP_SIZES = group.sizes)
any.group.size <- strsplit(group.sizes, ' ',fixed=TRUE)[[1]][1]
set.seed(2022)

When reproducing the below experiments manually, one should initialize the environment variables, for instance:

export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export EXPE_WORK_DIR=experiments/2-doc-groups-by-case
export COPY_EXPE1_DATA_DIR=experiments/1-doc-size/100
export GROUP_SIZES='1 2 3 4 5'

Data generation

Dataset

We use the same dataset as in the first experiment (must have been calculated before).

source session-setup.sh
if [ ! -d "$COPY_EXPE1_DATA_DIR" ]; then
  echo "Dir '$COPY_EXPE1_DATA_DIR' not found" 1>&2
  exit 1
fi
if [ ! -d "$EXPE_WORK_DIR" ]; then 
  mkdir "$EXPE_WORK_DIR"
  for SIZE in $GROUP_SIZES; do
     mkdir "$EXPE_WORK_DIR/$SIZE"
     cp -R "$COPY_EXPE1_DATA_DIR"/process "$COPY_EXPE1_DATA_DIR"/data "$COPY_EXPE1_DATA_DIR"/impostors "$EXPE_WORK_DIR/$SIZE"
     rm -f "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
  done
fi

Training and test cases

d <- readDataDir(paste(work.dir,any.group.size,'data',sep='/'))
dataSplitByAuthor <- splitAuthors(d)
## [1] "24  authors ( 22 with at least 2 books )"
## [1] "authors in train-only:  mh,wdh,sw,ga,amd,hm,tsa"
## [1] "authors in test-only:  fmc,haj,hj,espw,hbs,us,ab,ewaoc,mh+"
## [1] "authors in shared:  ew,cfw,cdw,lma,es,mt,rwc,wta"
## [1] "books from shared authors in the training set:  90"
## [1] "all books in the training set:  285"
## [1] "books from shared authors in the test set:  89"
## [1] "books NOT from shared authors in the test set:  180"
## [1] "all books in the test set:  269"
for (size.str in strsplit(group.sizes, ' ',fixed=TRUE)[[1]]) {
  size <- as.numeric(size.str)
  full <- buildFullDatasetWithGroups(dataSplitByAuthor, 100, 100,group1.sizes=size, group2.sizes=1)
  fwrite(full, paste(work.dir,size.str,'full-dataset.tsv',sep='/'), sep='\t')
  saveDatasetInCasesFormat(full,dir=paste(work.dir,as.character(size),sep='/'))
}
## [1] "*** TRAIN SET"
## [1] 1
## [1] "*** TEST SET"
## [1] 1
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.61"
## [1] "*** TRAIN SET"
## [1] 2
## [1] "*** TEST SET"
## [1] 2
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.69"
## [1] "*** TRAIN SET"
## [1] 3
## [1] "*** TEST SET"
## [1] 3
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.68"
## [1] "*** TRAIN SET"
## [1] 4
## [1] "*** TEST SET"
## [1] 4
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.65"
## [1] "*** TRAIN SET"
## [1] 5
## [1] "*** TEST SET"
## [1] 5
## [1] "14 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.51"

Adding truth file

source session-setup.sh
for SIZE in $GROUP_SIZES; do
  echo "$SIZE: truth file"
  cat "$EXPE_WORK_DIR/$SIZE/train.tsv" > "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
done
## 1: truth file
## 2: truth file
## 3: truth file
## 4: truth file
## 5: truth file

Running the training processes

The script ./run.sh performs the full training process for one single “size” (variable value). It’s a simple script which prepares the data and then starts the training process, as described in the user guide (part 2). It is used as follows:

./run.sh $EXPE_WORK_DIR $SIZE $TASKS_DIR $NCORES

Evaluating

The script ./evaluate-all.sh evaluates:

It is used as follows:

./evaluate-all.sh $EXPE_WORK_DIR $NCORES $TASKS_DIR

Analysis

d<-readExperimentResults(work.dir)
g1 <- perfByModelType(d,x.label=variable.name)
g1

g2 <- comparePerfsByEvalOn(d,diff.seen=FALSE,x.label=variable.name)
g2
## `geom_smooth()` using formula 'y ~ x'

g3 <- comparePerfsByEvalOn(d,diff.seen=TRUE,x.label=variable.name)
g3
## `geom_smooth()` using formula 'y ~ x'

g<-plot_grid(g1,g2,g3,labels=NULL,ncol=3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
ggsave('graphs-expe2.pdf',g,width=30,height=8,unit='cm')