Links

This document is part of the “CLG Authorship Experiments” repository:

Options

The .Rmd source document can be configured by modifying the following lines:

analyze.results <- TRUE
packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
work.dir <- 'experiments/3-training-data-size'
Sys.setenv(EXPE_WORK_DIR = work.dir)
variable.name <- 'Training data size'
Sys.setenv(COPY_EXPE1_DATA_DIR = 'experiments/1-doc-size/100')
training.cases <- '100 200 300 400 500 600 700 800 900 1000'
Sys.setenv(TRAINING_CASES = training.cases)
any.training.cases <- strsplit(training.cases, ' ',fixed=TRUE)[[1]][1]
set.seed(2022)

export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export EXPE_WORK_DIR=experiments/3-training-data-size
export COPY_EXPE1_DATA_DIR=experiments/1-doc-size/100
export TRAINING_CASES='100 200 300 400 500 600 700 800 900 1000'

Data generation

Dataset

We use the same dataset as in the first experiment (must have been calculated before).

source session-setup.sh
if [ ! -d "$COPY_EXPE1_DATA_DIR" ]; then
  echo "Dir '$COPY_EXPE1_DATA_DIR' not found" 1>&2
  exit 1
fi
if [ ! -d "$EXPE_WORK_DIR" ]; then 
  mkdir "$EXPE_WORK_DIR"
  for SIZE in $TRAINING_CASES; do
     mkdir "$EXPE_WORK_DIR/$SIZE"
     cp -R "$COPY_EXPE1_DATA_DIR"/process "$COPY_EXPE1_DATA_DIR"/data "$COPY_EXPE1_DATA_DIR"/impostors "$EXPE_WORK_DIR/$SIZE"
     rm -f "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
  done
fi

Training and test cases

d <- readDataDir(paste(work.dir,any.training.cases,'data',sep='/'))
dataSplitByAuthor <- splitAuthors(d)

## [1] "24  authors ( 22 with at least 2 books )"
## [1] "authors in train-only:  mh,wdh,sw,ga,amd,hm,tsa"
## [1] "authors in test-only:  fmc,haj,hj,espw,hbs,us,ab,ewaoc,mh+"
## [1] "authors in shared:  ew,cfw,cdw,lma,es,mt,rwc,wta"
## [1] "books from shared authors in the training set:  90"
## [1] "all books in the training set:  285"
## [1] "books from shared authors in the test set:  89"
## [1] "books NOT from shared authors in the test set:  180"
## [1] "all books in the test set:  269"

for (size.str in strsplit(training.cases, ' ',fixed=TRUE)[[1]]) {
  size <- as.numeric(size.str)
  full <- buildFullDataset(dataSplitByAuthor, size, 100,withReplacement=TRUE)
  fwrite(full, paste(work.dir,size.str,'full-dataset.tsv',sep='/'), sep='\t')
  saveDatasetInCasesFormat(full,dir=paste(work.dir,as.character(size),sep='/'))
}

## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "14 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.34"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.33"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.36"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.33"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.51"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.35"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.39"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.48"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.42"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.42"

Adding truth file

source session-setup.sh
for SIZE in $TRAINING_CASES; do
  echo "$SIZE: truth file"
  cat "$EXPE_WORK_DIR/$SIZE/train.tsv" > "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
done

## 100: truth file
## 200: truth file
## 300: truth file
## 400: truth file
## 500: truth file
## 600: truth file
## 700: truth file
## 800: truth file
## 900: truth file
## 1000: truth file

Running the training processes

The script ./run.sh performs the full training process for one single “size” (variable value). It’s a simple script which prepares the data and then starts the training process, as described in the user guide (part 2). It is used as follows:

./run.sh $EXPE_WORK_DIR $SIZE $TASKS_DIR $NCORES

Naturally the training process must be run for every value $SIZE.
Caution: A single process takes between 1 and 3 days using 40 cores.

Evaluating

The script ./evaluate-all.sh evaluates:

for every “size” (variables values),
the top model (according to the training) for every of the four “model types” (basic, GI, univ, meta),
on both the training and test set,
and calculates the “author seen/unseen” performance values.

It is used as follows:

./evaluate-all.sh $EXPE_WORK_DIR $NCORES $TASKS_DIR

The evaluation process is also resource-intensive, it takes up to 3 hours with 40 cores, depending on the number of values.
The process creates the directory $EXPE_WORK_DIR/results which contains the detailed output for every evaluated model.
- The main output is contained in $EXPE_WORK_DIR/results/results.tsv.

Analysis

d<-readExperimentResults(work.dir)

g1 <- perfByModelType(d,x.label=variable.name)
g1

## Warning: Removed 8 rows containing missing values (geom_point).

## Warning: Removed 3 row(s) containing missing values (geom_path).

g2 <- comparePerfsByEvalOn(d,diff.seen=FALSE,x.label=variable.name)
g2

## `geom_smooth()` using formula 'y ~ x'

g3 <- comparePerfsByEvalOn(d,diff.seen=TRUE,x.label=variable.name)
g3

## `geom_smooth()` using formula 'y ~ x'

g<-plot_grid(g1,g2,g3,labels=NULL,ncol=3)

## Warning: Removed 8 rows containing missing values (geom_point).

## Warning: Removed 3 row(s) containing missing values (geom_path).

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggsave('graphs-expe3.pdf',g,width=30,height=8,unit='cm')

Experiment 3: number of training cases

Erwan Moreau

2/9/2022