Links

This document is part of the “CLG Authorship Experiments” repository:

Options

The .Rmd source document can be configured by modifying the following lines:

analyze.results <- TRUE
packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
work.dir <- 'experiments/2-doc-groups-by-case'
variable.name <- 'Documents by group'
Sys.setenv(EXPE_WORK_DIR = work.dir)
Sys.setenv(COPY_EXPE1_DATA_DIR = 'experiments/1-doc-size/100')
group.sizes <- '1 2 3 4 5'
Sys.setenv(GROUP_SIZES = group.sizes)
any.group.size <- strsplit(group.sizes, ' ',fixed=TRUE)[[1]][1]
set.seed(2022)

When reproducing the below experiments manually, one should initialize the environment variables, for instance:

export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export EXPE_WORK_DIR=experiments/2-doc-groups-by-case
export COPY_EXPE1_DATA_DIR=experiments/1-doc-size/100
export GROUP_SIZES='1 2 3 4 5'

Data generation

Dataset

We use the same dataset as in the first experiment (must have been calculated before).

source session-setup.sh
if [ ! -d "$COPY_EXPE1_DATA_DIR" ]; then
  echo "Dir '$COPY_EXPE1_DATA_DIR' not found" 1>&2
  exit 1
fi
if [ ! -d "$EXPE_WORK_DIR" ]; then 
  mkdir "$EXPE_WORK_DIR"
  for SIZE in $GROUP_SIZES; do
     mkdir "$EXPE_WORK_DIR/$SIZE"
     cp -R "$COPY_EXPE1_DATA_DIR"/process "$COPY_EXPE1_DATA_DIR"/data "$COPY_EXPE1_DATA_DIR"/impostors "$EXPE_WORK_DIR/$SIZE"
     rm -f "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
  done
fi

Training and test cases

d <- readDataDir(paste(work.dir,any.group.size,'data',sep='/'))
dataSplitByAuthor <- splitAuthors(d)

## [1] "24  authors ( 22 with at least 2 books )"
## [1] "authors in train-only:  mh,wdh,sw,ga,amd,hm,tsa"
## [1] "authors in test-only:  fmc,haj,hj,espw,hbs,us,ab,ewaoc,mh+"
## [1] "authors in shared:  ew,cfw,cdw,lma,es,mt,rwc,wta"
## [1] "books from shared authors in the training set:  90"
## [1] "all books in the training set:  285"
## [1] "books from shared authors in the test set:  89"
## [1] "books NOT from shared authors in the test set:  180"
## [1] "all books in the test set:  269"

for (size.str in strsplit(group.sizes, ' ',fixed=TRUE)[[1]]) {
  size <- as.numeric(size.str)
  full <- buildFullDatasetWithGroups(dataSplitByAuthor, 100, 100,group1.sizes=size, group2.sizes=1)
  fwrite(full, paste(work.dir,size.str,'full-dataset.tsv',sep='/'), sep='\t')
  saveDatasetInCasesFormat(full,dir=paste(work.dir,as.character(size),sep='/'))
}

## [1] "*** TRAIN SET"
## [1] 1
## [1] "*** TEST SET"
## [1] 1
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.61"
## [1] "*** TRAIN SET"
## [1] 2
## [1] "*** TEST SET"
## [1] 2
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.69"
## [1] "*** TRAIN SET"
## [1] 3
## [1] "*** TEST SET"
## [1] 3
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.68"
## [1] "*** TRAIN SET"
## [1] 4
## [1] "*** TEST SET"
## [1] 4
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.65"
## [1] "*** TRAIN SET"
## [1] 5
## [1] "*** TEST SET"
## [1] 5
## [1] "14 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.51"

Adding truth file

source session-setup.sh
for SIZE in $GROUP_SIZES; do
  echo "$SIZE: truth file"
  cat "$EXPE_WORK_DIR/$SIZE/train.tsv" > "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
done

## 1: truth file
## 2: truth file
## 3: truth file
## 4: truth file
## 5: truth file

Running the training processes

The script ./run.sh performs the full training process for one single “size” (variable value). It’s a simple script which prepares the data and then starts the training process, as described in the user guide (part 2). It is used as follows:

./run.sh $EXPE_WORK_DIR $SIZE $TASKS_DIR $NCORES

Naturally the training process must be run for every value $SIZE.
Caution: A single process takes between 1 and 3 days using 40 cores.

Evaluating

The script ./evaluate-all.sh evaluates:

for every “size” (variables values),
the top model (according to the training) for every of the four “model types” (basic, GI, univ, meta),
on both the training and test set,
and calculates the “author seen/unseen” performance values.

It is used as follows:

./evaluate-all.sh $EXPE_WORK_DIR $NCORES $TASKS_DIR

The evaluation process is also resource-intensive, it takes up to 3 hours with 40 cores, depending on the number of values.
The process creates the directory $EXPE_WORK_DIR/results which contains the detailed output for every evaluated model.
- The main output is contained in $EXPE_WORK_DIR/results/results.tsv.

Analysis

d<-readExperimentResults(work.dir)

g1 <- perfByModelType(d,x.label=variable.name)
g1

g2 <- comparePerfsByEvalOn(d,diff.seen=FALSE,x.label=variable.name)
g2

## `geom_smooth()` using formula 'y ~ x'

g3 <- comparePerfsByEvalOn(d,diff.seen=TRUE,x.label=variable.name)
g3

## `geom_smooth()` using formula 'y ~ x'

g<-plot_grid(g1,g2,g3,labels=NULL,ncol=3)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggsave('graphs-expe2.pdf',g,width=30,height=8,unit='cm')

Experiment 2: size of a group of documents by the same author

Erwan Moreau

2/9/2022