Options

The .Rmd source document can be configured by modifying the following lines:

analyze.results <- TRUE
packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
work.dir <- 'experiments/3-training-data-size'
Sys.setenv(EXPE_WORK_DIR = work.dir)
variable.name <- 'Training data size'
Sys.setenv(COPY_EXPE1_DATA_DIR = 'experiments/1-doc-size/100')
training.cases <- '100 200 300 400 500 600 700 800 900 1000'
Sys.setenv(TRAINING_CASES = training.cases)
any.training.cases <- strsplit(training.cases, ' ',fixed=TRUE)[[1]][1]
set.seed(2022)
export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export EXPE_WORK_DIR=experiments/3-training-data-size
export COPY_EXPE1_DATA_DIR=experiments/1-doc-size/100
export TRAINING_CASES='100 200 300 400 500 600 700 800 900 1000'

Data generation

Dataset

We use the same dataset as in the first experiment (must have been calculated before).

source session-setup.sh
if [ ! -d "$COPY_EXPE1_DATA_DIR" ]; then
  echo "Dir '$COPY_EXPE1_DATA_DIR' not found" 1>&2
  exit 1
fi
if [ ! -d "$EXPE_WORK_DIR" ]; then 
  mkdir "$EXPE_WORK_DIR"
  for SIZE in $TRAINING_CASES; do
     mkdir "$EXPE_WORK_DIR/$SIZE"
     cp -R "$COPY_EXPE1_DATA_DIR"/process "$COPY_EXPE1_DATA_DIR"/data "$COPY_EXPE1_DATA_DIR"/impostors "$EXPE_WORK_DIR/$SIZE"
     rm -f "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
  done
fi

Training and test cases

d <- readDataDir(paste(work.dir,any.training.cases,'data',sep='/'))
dataSplitByAuthor <- splitAuthors(d)
## [1] "24  authors ( 22 with at least 2 books )"
## [1] "authors in train-only:  mh,wdh,sw,ga,amd,hm,tsa"
## [1] "authors in test-only:  fmc,haj,hj,espw,hbs,us,ab,ewaoc,mh+"
## [1] "authors in shared:  ew,cfw,cdw,lma,es,mt,rwc,wta"
## [1] "books from shared authors in the training set:  90"
## [1] "all books in the training set:  285"
## [1] "books from shared authors in the test set:  89"
## [1] "books NOT from shared authors in the test set:  180"
## [1] "all books in the test set:  269"
for (size.str in strsplit(training.cases, ' ',fixed=TRUE)[[1]]) {
  size <- as.numeric(size.str)
  full <- buildFullDataset(dataSplitByAuthor, size, 100,withReplacement=TRUE)
  fwrite(full, paste(work.dir,size.str,'full-dataset.tsv',sep='/'), sep='\t')
  saveDatasetInCasesFormat(full,dir=paste(work.dir,as.character(size),sep='/'))
}
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "14 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.34"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.33"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.36"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.33"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.51"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.35"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.39"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.48"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.42"
## [1] "*** TRAIN SET"
## [1] "Cartesian product:  81225"
## [1] "Removing pairs with same book:  80940"
## [1] "pairs same author: 7608"
## [1] "pairs diff author: 73332"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "*** TEST SET"
## [1] "Cartesian product:  72361"
## [1] "Removing pairs with same book:  72092"
## [1] "pairs same author: 6354"
## [1] "pairs diff author: 65738"
## [1] "picking with replacement (same book allowed several times)"
## [1] "shuffling rows"
## [1] "15 authors in the training set"
## [1] "proportion of 'author seen in training' in the test set: 0.42"

Adding truth file

source session-setup.sh
for SIZE in $TRAINING_CASES; do
  echo "$SIZE: truth file"
  cat "$EXPE_WORK_DIR/$SIZE/train.tsv" > "$EXPE_WORK_DIR/$SIZE/data/truth.txt"
done
## 100: truth file
## 200: truth file
## 300: truth file
## 400: truth file
## 500: truth file
## 600: truth file
## 700: truth file
## 800: truth file
## 900: truth file
## 1000: truth file

Running the training processes

The script ./run.sh performs the full training process for one single “size” (variable value). It’s a simple script which prepares the data and then starts the training process, as described in the user guide (part 2). It is used as follows:

./run.sh $EXPE_WORK_DIR $SIZE $TASKS_DIR $NCORES

Evaluating

The script ./evaluate-all.sh evaluates:

It is used as follows:

./evaluate-all.sh $EXPE_WORK_DIR $NCORES $TASKS_DIR

Analysis

d<-readExperimentResults(work.dir)
g1 <- perfByModelType(d,x.label=variable.name)
g1
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 3 row(s) containing missing values (geom_path).

g2 <- comparePerfsByEvalOn(d,diff.seen=FALSE,x.label=variable.name)
g2
## `geom_smooth()` using formula 'y ~ x'

g3 <- comparePerfsByEvalOn(d,diff.seen=TRUE,x.label=variable.name)
g3
## `geom_smooth()` using formula 'y ~ x'

g<-plot_grid(g1,g2,g3,labels=NULL,ncol=3)
## Warning: Removed 8 rows containing missing values (geom_point).
## Warning: Removed 3 row(s) containing missing values (geom_path).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
ggsave('graphs-expe3.pdf',g,width=30,height=8,unit='cm')