Overview

TODO

An Author Verification problem consists in determining for any two text documents (or any two groups of documents) whether they have been written by the same person, or more generally exhibit similar stylistic features. The verif-author.pl script produces a set of features for every verification problem as input, using various possible strategies and parameters.

See also:

Rmd

This document was generated from an R Markdown source file. The source file is provided in the repository, it can be used to reproduce these experiments. It can be executed through the RStudio interface (“knit” button) or as follows:

rmarkdown::render('user-guide-part1.Rmd')
  • Naturally the document can also be read as a regular documentation.
  • Important: this Rmd document includes bash chunks. In order for these to work, the environment must have been configured as explained in Requirements below.

Options

The .Rmd source document can be configured by modifying the following lines:

packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
output.dir <- '/tmp/user-guide-part1.output'
Sys.setenv(OUTPUT_DIR = output.dir)
delete.previous.output <- TRUE
snippets.size <- 25
Sys.setenv(SNIPPETS_SIZE = snippets.size)
  • The packages.path variable indicates the location of the dependencies (see software requirements below). For the sake of simplicity it is assumed that all the packages are in the same directory (as recommended in the installation instructions, see below).
  • In Rmd every bash chunk are executed in an independent session, this is why the path for the packages must be initialized. This is used in turn to initialize the environment with the script session-setup.sh, which must be present in the same directory for executing the Rmd source file. This is not needed when executing commands manually, as long as the environment has been configured once (see below).

In order to manually execute some of the commands below, it is recommended to assign the appropriate value to the above environment variables for the whole session, for example:

export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export OUTPUT_DIR=/tmp/user-guide-part1.output

One can also manually replace the variables with the appropriate value in every command of course (less convenient).

Initialization

rm -rf "$OUTPUT_DIR"
mkdir "$OUTPUT_DIR"

Requirements

Software components

The experiments below require the clg-authorship-analytics software to be installed as well as all its dependencies. A detailed installation guide can be found in the documentation.

The following is a quick test to check that the software is properly installed and configured. It should show the first line of the inline help message for verif-author.pl.

source session-setup.sh
verif-author.pl -h | head -n 3
## 
## Usage: verif-author.pl [options] <config file> [<fileA1:..:fileAn> <fileB1:..:fileBm>]

Data: Diachronic Corpus for Literary Style Analysis (DCLSA)

These experiments require the CLSA corpus which can be found here. The code chunks below assume that the dataset has been extracted in the data directory, for example as follows:

cd data
wget https://scss.tcd.ie/clg/DCLSA/DCLSA.tar.gz
tar xfz DCLSA.tar.gz
echo "If the DCLSA data is available, a list of 10 files (e.g. 'data/gb/1851-tsa-lilfawwrt') should be listed below:"
ls data/*/* | head
## If the DCLSA data is available, a list of 10 files (e.g. 'data/gb/1851-tsa-lilfawwrt') should be listed below:
## data/gb/1851-tsa-lilfawwrt
## data/gb/1851-tsa-tlasorl
## data/gb/1851-tsa-ttw
## data/gb/1851-tsa-wftw
## data/gb/1851-tsa-wt
## data/gb/1852-sw-q
## data/gb/1852-sw-q-v2
## data/gb/1852-tsa-hhalp
## data/gb/1852-tsa-ml
## data/gb/1852-tsa-trowww
  • This document assumes that the CLSA data is found in the directory ./data. If this is not the case it is advised to create a symbolic link for convenience.
  • The data directory must be writable because the program uses cached intermediate files whenever possible for efficiency reasons

For these experiments we create a version of the dataset where each document is replaced with a 100 lines snippet:

source session-setup.sh
find data/gb/* data/ia/* -maxdepth 0 -type f | grep -v '\.' | ./create-snippets-dataset.sh $SNIPPETS_SIZE $OUTPUT_DIR/data

Simple examples

Example using only command line arguments

The simplest way to use the system is to specify the strategy and its parameters directly on the command line. In the following we compare Mark Twain’s “The Adventures of Tom Sawyer” and “The Adventures of Huckleberry Finn” (the last two arguments).

source session-setup.sh
verif-author.pl -s "strategy=basic;obsType.CHAR.CCC.lc1.sl0.mf3=1" $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf 2>/dev/null
## 0.154320690290313
  • The obsType parameter represents charaters (CHAR) trigrams (CCC) with lowercase, no sentence limit and minimum frequency 3. See the CLGTextTools documentation for details about observation types.
  • The default similarity measure minmax is used.
  • The 2>/dev/null is used temporarily to mask the numerous warnings (due to not providing values for various parameters).
  • The output is a single similarity value: the minmax on the char trigrams frequencies between the two input texts.

Example using a configuration file

A better and more convenient way is to provide the parameters in a config file.

The content of the config file is:

cat conf/basic.2.conf
## strategy=basic
## 
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## 
## basic.simMeasure=minmax
## 
## # general options 
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0

The simMeasure option for the basic strategy indicates which similarity measure to use. In this particular case the other three options at the end are not really relevant and are provided only to avoid the warning messages:

  • The formatting parameter is used to interpret sentences or paragraphs as separate units. This may be useful in conjunction with the sl (sentence limit) part of the observations types if some formatting is present in the input files. ** 0 (or undef or empty string): no formatting at all ** singleLineBreak: line breaks as separator for meaningful units (e.g. sentences) ** doubleLineBreak: mpty lines (i.e. at least two consecutive line breaks) as separator for meaningful units (e.g. paragraphs).
  • The wordTokenization parameter) indicates whether the input text should be tokenized (value 1) or not (0). This is relevant only for words observations types.
  • The multipleProbeAggregate parameter specifies which method should be used to aggregate the similarity scores if there are more than one probe doc on either side (or both): random, median, or arithm, geom, harmo mean. ** If “random” (default), then a default doc is picked among the list (disadvantage: same input can give different results). ** Otherwise the similarity is computed between all pairs (cartesian product NxM), and the values are aggregated according to the parameter (disadvantage: NxM longer).

The same result as above can be obtained as follows:

source session-setup.sh
verif-author.pl -c conf/basic.2.conf $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf
## 0.154320690290313
  • The -c option is used to cache and if possible reuse the count files: for every document, a count file is created for every observation type which contains the frequency for every observation.

Using multiple observations types

Several observations types can be specified. In this case the output will show the corresponding types as columns.

Config file:

cat conf/basic.3.conf
## strategy=basic
## 
## obsType.CHAR.CC.lc1.sl0.mf3=1
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## obsType.WORD.T.lc1.sl0.mf2=1
## obsType.WORD.TT.lc1.sl0.mf2=1
## obsType.VOCABCLASS.MORPHO.mf3=1
## 
## basic.simMeasure=minmax
## 
## # general options 
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0

In this example we use various observation types “families”.

source session-setup.sh
verif-author.pl -H conf/basic.3.conf $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf
## CHAR.CC.lc1.sl0.mf3  CHAR.CCC.lc1.sl0.mf3    VOCABCLASS.MORPHO.mf3   WORD.T.lc1.sl0.mf2  WORD.TT.lc1.sl0.mf2
## 0.376597519675562    0.154320690290313   0.608910170769454   0.218033824460178   0
  • Option -H is used to print the columns names as the first line (header).

Multiple documents by group

The program can also receive two groups of documents instead of two documents as input. Documents which are together in the same group are assumed to have been written by the same author. Multiple documents by the same author can potentially provide crucial insight by allowing the verification method to distinguish constant stylistic features vs. document-specific ones.

In this example we use the same config as above but compare a group of books by Mark Twain vs. a group of books by Henry James:

source session-setup.sh
verif-author.pl -H conf/basic.3.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## CHAR.CC.lc1.sl0.mf3  CHAR.CCC.lc1.sl0.mf3    VOCABCLASS.MORPHO.mf3   WORD.T.lc1.sl0.mf2  WORD.TT.lc1.sl0.mf2
## 0.529523647552977    0.161697151385542   0.850543708117557   0.258581235697941   0.0623519026158064
  • There can be any number of documents in each of the two groups ** This is more general than the PAN format which consists of a single questioned document on one side.
  • The multipleProbeAggregate parameter presented above determines how the features are combined across the documents in one group to provide the final feature value.

Multiple verification problems

Finally the program can receive multiple verification problems, applying the same strategy and parameters to all of them.

  • The different problems are read from STDIN, one problem by line. It’s of course more convenient to first write all the problems in a file, as shown below.
  • The output features are also written on several lines, each line corresponding to one problem, following the same order as the input.
  • Every input problem is independent, different problems may involve the same documents or not. There is no constraint either on the groups of documents.
cat data/user-guide.expl1/cases.txt
## data/gb/1876-mt-taots:data/gb/1884-mt-taohf data/gb/1888-hj-tap:data/gb/1896-hj-tsop:data/gb/1898-hj-ttm 0
## data/gb/1876-mt-taots data/gb/1884-mt-taohf  1
## data/gb/1888-hj-tap:data/gb/1896-hj-tsop data/gb/1898-hj-ttm 1
## data/gb/1882-haj-ffbts data/gb/1895-cfw-mcac:data/gb/1895-cfw-tfyaois:data/gb/1895-cfw-daois 0
## data/gb/1905-haj-fftf data/gb/1905-ga-ttt    0
## data/gb/1918-rwc-tlg:data/gb/1918-rwc-tmw data/gb/1919-rwc-tct:data/gb/1919-rwc-tsos 1
## data/gb/1882-haj-ffbts data/gb/1895-cfw-daois    0
## data/gb/1895-cfw-mcac data/gb/1895-cfw-tfyaois   1
## data/gb/1895-cfw-mcac data/gb/1895-cfw-daois 1
## data/gb/1905-haj-fftf data/gb/1884-mt-taohf  0

The same config as above is applied to these four problems:

source session-setup.sh
cat data/user-guide.expl1/cases.txt | sed "s:data/gb:$OUTPUT_DIR/data:g" | cut -f 1 | verif-author.pl -H conf/basic.3.conf
## CHAR.CC.lc1.sl0.mf3  CHAR.CCC.lc1.sl0.mf3    VOCABCLASS.MORPHO.mf3   WORD.T.lc1.sl0.mf2  WORD.TT.lc1.sl0.mf2
## 0.432680054266918    0.155799341777201   0.645388065128682   0.261227981490843   0
## 0.376597519675562    0.154320690290313   0.608910170769455   0.218033824460178   0
## 0.567570958614491    0.259950774037392   0.899399759903962   0.340503716630032   0.052733757731726
## 0.578053060954744    0.218375675034626   0.907122120743362   0.332324485867694   0.0529377803437403
## 0.493152261741496    0.268698787837746   0.809671046981019   0.231056406912292   0.0524242641266203
## 0.545848564293855    0.24939001174395    0.81539335416075    0.242304388645852   0
## 0.502219966747142    0.218375675034626   0.903816882588741   0.307334155756297   0.0529377803437403
## 0.556095247988579    0.288583032193941   0.875215534937187   0.29113400053571    0.135197349952277
## 0.449405984666321    0.150204339730904   0.907839644682968   0.264233539487414   0
## 0.489771216782876    0.226890401197041   0.739570164348925   0.230498238550579   0.069018791777021

Technical notes about observation types (preprocessing)

The observation types belong to different families: CHAR for character n-grams, WORD for word n-grams, POS for Part-Of-Speech and VOCABCLASS for some custom mapping of words to other categories (this can be used to count words based on their capitalization, for instance). See details in explanation about the n-grams patterns.

Two special cases presented below require additional preparation steps.

POS observations

These require the POS tags to have been precomputed and stored into .POS files, otherwise the program will cause an error. The simple way to precompute these POS files is as follows:

source session-setup.sh
ls $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf | count-obs-dataset.sh english POS.P.sl0.mf1 2>/dev/null
## count-obs-dataset.sh: tokenization and POS tagging
## count-obs-dataset.sh: generating count files

Note: the input documents are provided on STDIN.

Stop words for word observations

Word-based observations accept an option which specifies which words to take into account. This causes all the other words to be replaced by a placeholder symbol _. It can be used to count only patterns involving frequent words (often called stop words), in order to avoid content words which are less likely to be good indicators of an author style. Note that this is different from the mf (minimum frequency) option, which discards full observations, as opposed to some of the words in an observation.

A “stop words list” based on frequency can be calculated based on a collection of documents as follows. In this example we use all the books by Mark Twain as the corpus and extract the top 100 most frequent words.

source session-setup.sh
find $OUTPUT_DIR/data/*-mt-* -maxdepth 0 -type f  >"$OUTPUT_DIR/mark-twain.list"
count-obs-dataset.sh -i "$OUTPUT_DIR/mark-twain.list" -o '-g' english WORD.T.lc1.sl0.mf2 2>/dev/null
sort -r -n +1 -2 "$OUTPUT_DIR/global.observations/WORD.T.lc1.sl0.mf2.count" | cut -f 1 | head -n 100 >"$OUTPUT_DIR/mt-100.stop-list" 
## count-obs-dataset.sh: no TreeTagger tokenization/POS tagging needed
## count-obs-dataset.sh: generating count files
  • The find command is only used to find all the regular files corresponding to Mark Twain (mt)
  • The count-obs-dataset.sh command creates the ‘global’ output directories doc-freq.observations and global.observations in the dir where the input list is located.

Example

The “stop words” option is used by adding the name of the resource containing the list of stop words to the desired word observation types, for example:

cat conf/basic.4.conf
## strategy=basic
## 
## obsType.WORD.T.lc1.sl0.mf2=1
## obsType.WORD.TT.lc1.sl0.mf2=1
## obsType.WORD.TT.lc1.sl0.mt-100.mf2=1
## obsType.WORD.TTT.lc1.sl0.mt-100.mf2=1
## obsType.WORD.TTTT.lc1.sl0.mt-100.mf2=1
## 
## basic.simMeasure=minmax
## 
## # general options 
## multipleProbeAggregate=median
## wordTokenization=1
## formatting=0
  • The last three observation types in this config use the mt-100 stop words list.
  • The matching of a resource name to a file can be provided using option -v in the verif-author.pl command, as shown in this example:
source session-setup.sh
verif-author.pl -H -v mt-100:$OUTPUT_DIR/mt-100.stop-list conf/basic.4.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## WORD.T.lc1.sl0.mf2   WORD.TT.lc1.sl0.mf2 WORD.TT.lc1.sl0.mt-100.mf2  WORD.TTT.lc1.sl0.mt-100.mf2 WORD.TTTT.lc1.sl0.mt-100.mf2
## 0.259904608594392    0.0102997694081476  0.227367139858328   0.0476417076372006  0

Strategies

Currently the program offers three verification strategies. The simplest one, called the basic strategy, has been used in the previous examples. The two more advanced strategies are presented below.

Universum (univ)

Given two documents A and B, repeatedly mix different portions of A and B together, then compare the similarity obtained between A and A, B and B, A and B, mixed-AB and A, mixed-AB and B, mixed-AB and mixed-AB. If A and B have the same author, the resulting similarity scores should be all similar.

Config file:

cat conf/univ.1.conf
## 
## strategy=univ
## 
## obsType.WORD.T.lc1.sl0.mf3=1
## obsType.WORD.TT.lc1.sl0.mf3=1
## obsType.WORD.TTT.lc1.sl0.mf3=1
## 
## univ.simMeasure=cosine
## univ.nbRounds=50
## univ.propObsSubset=0.4
## univ.withReplacement=0
## univ.countMostSimByRound=all
## univ.aggregSimByRoundAggregType=median
## univ.finalScoresMethod=both
## univ.aggregSimByRound=all
## univ.splitWithoutReplacementMaxNbAttempts=3
## 
## # general options 
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0

This strategy has many parameters (from documentation):

  • nbRounds: number of rounds (higher number -> more randomization, hence less variance in the result) (default 100)
  • propObsSubset: (0<=p<1) the proportion of observations/occurrences used to mix 2 documents together at each round (p and 1-p); if zero, the proportion is picked randomly at every round (default 0.5)
  • withReplacement: 0 or 1. default: 0
  • splitWithoutReplacementMaxNbAttempts: max number of attempts to try splitting doc without replacement if at least one of the subsets is empty. Default: 5.
  • finalScoresMethod: aggregSimByRound, countMostSimByRound orboth. Overall method(s) to obtain the features: by aggregating the similarities for each category or counting the most similar category among rounds. default: countMostSimByRound.
  • aggregSimByRound: all homogeneity sameCat mergedOrNot: all means use all individual categories as features. with homogeneity four final features are considered: AA+BB, AM+BM, AB, MM; with sameCat there are only two final features: AA+BB+MM, AB+AM+BM. with mergedOrNot there are two categories: AA+BB+AB, AM+BM+MM. default = sameCat
  • countMostSimByRound: all homogeneity sameCat mergedOrNot. See above. default = sameCat
  • aggregSimByRoundAggregType: median, arithm, geom, harmo. default = arithm

This is an ensemble method, the time it requires is proportional to nbRounds (it also depends on the size and number of input documents, of course). With only 50 rounds the following run takes around 20 seconds:

source session-setup.sh
verif-author.pl -H conf/univ.1.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## doc(s) too small => impossible to find enough partitions => used empty doc(s) 21 times (out of 50 rounds). at /home/erwan/now/22-authorship/CLGTextTools/lib/CLGTextTools/Logging.pm line 118.
## simByRound_all_AA    simByRound_all_BA   simByRound_all_BB   simByRound_all_MA   simByRound_all_MB   simByRound_all_MM   countMostSimByRound_all_AA  countMostSimByRound_all_BA  countMostSimByRound_all_BB  countMostSimByRound_all_MA  countMostSimByRound_all_MB  countMostSimByRound_all_MM
## 0.195028469322358    0   0   0.416753736220937   0   0.75318181890011    0.22    0   0.04    0.08    0.02    0.64

General Impostors (GI)

The “Impostors” verification strategy (see “Determining if Two Documents are by the Same Author” by Koppel and Winter, 2014): Portions of the tested documents are repeatedly compared to each other and to other external (portions of) external documents (impostors). If the similarity between the tested documents is significantly higher than the similarity obtained between a tested document and an impostor, then the tested documents are likely to be by the same author.

Preparing the impostors

The method relies on a set of external impostor documents, which must be provided with option -d to verif-author.pl (see below).

Below we create a directory with a random subset of 100 books from the dataset. Note that the system is unpleasantly strict about the naming of the path and files:

  • The files must be located in a directory <resource path>/<id>/impostors/ (note the last impostors dir), where:
    • <resource path> is the path later provided to -d
    • id is the resource id provided in the config file (see below; note that there can be multiple resources)
  • Every impostor document filename must end with .txt (the system reads only these files).
source session-setup.sh
mkdir $OUTPUT_DIR/impostors
mkdir $OUTPUT_DIR/impostors/GI.1.impostors
n=$(find $OUTPUT_DIR/data/* -maxdepth 0 -type f | wc -l)
find $OUTPUT_DIR/data/* -maxdepth 0 -type f | random-lines.pl 100 1 $n | while read f; do cp $f $OUTPUT_DIR/impostors/GI.1.impostors/$(basename "$f").txt; done

“Simple” example

This is a “simple” example using the impostors created above.

Config file:

cat conf/GI.1.conf
## 
## 
## strategy=GI
## 
## obsType.WORD.T.lc1.sl0.mf3=1
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## 
## 
## GI.impostors=GI.1.impostors
## GI.nbImpostorsUsed=10
## GI.nbRounds=50
## GI.selectNTimesMostSimilarFirst=0
## GI.propObsSubset=0.45
## GI.docSubsetMethod=byObservation
## GI.minDocFreq=4
## GI.simMeasure=minmax
## GI.useCountMostSimFeature=original
## GI.kNearestNeighbors=5
## GI.mostSimilarFirst=doc
## GI.aggregRelRank=arithm
## GI.useAggregateSim=ratio
## GI.aggregateSimStat=median
## 
## 
## # general options 
## wordTokenization=1
## formatting=0

Parameters (see details in the documentation):

  • impostors is a list of datasets (resources ids) from which impostors will be picked randomly with equal probability (i.e. independently from the number of docs in each dataset).
  • minDocFreq: minimum doc frequency for observations (optional, default 1);
  • selectNTimesMostSimilarFirst is described below.
  • nbImpostorsUsed: number of impostors documents to select from the impostors dataset (done only once for all rounds) (default 25)
  • nbRounds: number of rounds (higher number -> more randomization, hence less variance in the result) (default 100)
  • propObsSubset: (0<=p<1) the proportion of observations/occurrences to keep in every document at each round; if zero, the proportion is picked randomly at every round (default 0.5)
  • docSubsetMethod (default byObservation):
    • byOccurrence -> the proportion is applied to the set of all occurrences
    • byObservation -> applied only to distinct observations
  • useCountMostSimFeature: the method to calculate the “count most similar” feature (default: original):
    • 0 -> unused,
    • original -> original method by Koppel and Winter (2014)
    • ASGALF -> method used in “A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF” by Khonji (2014) * ASGALFavg -> variant of the above
  • kNearestNeighbors: uses only the K (value) most similar impostors when calculating result features. default: 0 (use all impostors).
  • mostSimilarFirst: doc or run, specifies whether the K most similar impostors are selected globally (doc) or for each run (run); unused if GI_kNearestNeighbors=0. Default: doc.
  • aggregRelRank: 0, median, arithm, geom, harmo. if not 0, computes the relative rank of the sim between A and B among sim against all impostors by round; the value is used to aggregate all relative ranks (i.e. the values by round). Default: 0.
  • useAggregateSim: 0, diff, ratio. if not 0, computes X = the aggregate sim value between A and B across all runs and Y= the aggregate sim value between any probe and any impostor across all rounds; returns A-B (diff) or A/B (ratio); default : 0.
  • aggregateSimStat: median, arithm, geom, harmo. aggregate method to use if useAggregateSim is not 0 (ignored if 0). default: arithm.
source session-setup.sh
verif-author.pl -H -d $OUTPUT_DIR conf/GI.1.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## countMostSim_original    relativeRank    aggregatedSim_ratio
## 0.3  0.403   0.935757753736536

Example with pre-computed similarities

The parameter selectNTimesMostSimilarFirst, if not zero, is used to perform an initial filtering stage which retrieves the N most similar documents to the probe documents, instead of picking impostors documents randomly, with N = selectNTimesMostSimilarFirst * nbImpostors (N for every probe doc). This ensures that the most dissimilar impostors are not used, while maintaining a degree of randomness depending on the value of selectNTimesMostSimilarFirst.

However this option requires the initial similarity values to be precomputed in order to avoid repeating the process as many times as the method is called, which might be prohibitive in computing time. The pre-similarity values are loaded from <probeFile>.simdir/<impDataset>.similarities.

Below we precompute similarities with the same impostors data and same probe files as in the previous example:

source session-setup.sh
find $OUTPUT_DIR/impostors/GI.1.impostors/* -maxdepth 0 -type f  >$OUTPUT_DIR/impostors/GI.1.impostors.list
ls $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap $OUTPUT_DIR/data/1896-hj-tsop $OUTPUT_DIR/data/1898-hj-ttm >$OUTPUT_DIR/probe-files.list
sim-collections-doc-by-doc.pl -o WORD.T.lc1.sl0.mf3 -R BASENAME WORD.T.lc1.sl0.mf3:CHAR.CCC.lc1.sl0.mf3 $OUTPUT_DIR/probe-files.list GI.1.impostors:$OUTPUT_DIR/impostors/GI.1.impostors.list 2>/dev/null

The config file is:

cat conf/GI.2.conf
## 
## 
## strategy=GI
## 
## obsType.WORD.T.lc1.sl0.mf3=1
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## 
## GI.selectNTimesMostSimilarFirst=5
## GI.preSimObsType=WORD.T.lc1.sl0.mf3
## 
## GI.impostors=GI.1.impostors
## GI.nbImpostorsUsed=10
## GI.nbRounds=20
## GI.propObsSubset=0.45
## GI.docSubsetMethod=byObservation
## GI.minDocFreq=4
## GI.simMeasure=minmax
## GI.useCountMostSimFeature=original
## GI.kNearestNeighbors=5
## GI.mostSimilarFirst=doc
## GI.aggregRelRank=arithm
## GI.useAggregateSim=ratio
## GI.aggregateSimStat=median
## 
## 
## # general options 
## wordTokenization=1
## formatting=0
source session-setup.sh
verif-author.pl -H -d $OUTPUT_DIR conf/GI.1.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## countMostSim_original    relativeRank    aggregatedSim_ratio
## 0.16 0.369   0.850218981657574

Supervised learning

verif-author.pl returns a set of features for every problem. A verification problem, called a case in the system, is a pair of sets of documents, with the documents within a group assumed to be from the same author. A case can be labelled, i.e. provided with the gold-standard answer determining whether the two groups of documents are actually from the same author or not. Naturally, a set of labelled cases can be used to train a supervised model.

The most simple option would be to treat the task as binary classification. Since we are also interested in quantifying the confidence level of the system, the supervised setting is implemented as a regression task:

Simple regression model

The system expects some standard directory structure. This can be created automatically by prepare-input.sh (see below), but for the sake of simplicity we start with a predefined dir structure stored in the data dir. The input subdirectory is populated with a few documents.

mkdir "$OUTPUT_DIR"/sup1
mkdir "$OUTPUT_DIR"/sup1/input "$OUTPUT_DIR"/sup1/resources "$OUTPUT_DIR"/sup1/output
target="$OUTPUT_DIR/sup1/resources-options.conf"
echo "vocabResources=" >"$target"
echo "useCountFiles=1" >>"$target"
echo "datasetResourcesPath=$OUTPUT_DIR/impostors/GI.1.impostors" >>"$target"
cat data/user-guide.expl1/cases.txt data/user-guide.expl1/test-cases.txt | cut -f 1 | tr ' :' '\n\n'  | sort -u | while read f; do cp "$f" "$OUTPUT_DIR/sup1/input"; done
cat data/user-guide.expl1/cases.txt | sed 's:data/[^/]*/::g' >"$OUTPUT_DIR/sup1/cases.txt"
cat data/user-guide.expl1/test-cases.txt | sed 's:data/[^/]*/::g' >"$OUTPUT_DIR/sup1/test-cases.gold.txt"
cat "$OUTPUT_DIR/sup1/test-cases.gold.txt" | cut -f 1 >"$OUTPUT_DIR/sup1/test-cases.txt"

The config file used below contains:

cat conf/basic.supervised.1.conf
## strategy=basic
## 
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## obsType.WORD.T.lc1.sl0.mf5=1
## obsType.WORD.TT.lc1.sl0.mf5=1
## obsType.VOCABCLASS.MORPHO.mf5=1
## 
## basic.simMeasure=minmax
## 
## # general options 
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0
## 
## # supervised learning parameters
## confidenceTrainProp=0
## confidenceLearnMethod=simpleOptimC1
## learnMethod=M5P-M4
  • learnMethod=M5P-M4: this parameter indicates which regression algorithm and parameters to use in Weka (here Decision Tree Regression with M5P).

Training

source session-setup.sh
train-test.sh -l "$OUTPUT_DIR/sup1/cases.txt" -m "$OUTPUT_DIR/sup1/model" "$OUTPUT_DIR/sup1" conf/basic.supervised.1.conf "$OUTPUT_DIR/sup1/output"
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
  • The Weka model is stored in $OUTPUT_DIR/sup1/model
  • The model is automatically applied to the training data, the resulting predictions can be found in $OUTPUT_DIR/sup1/output/train/self-predict.arff

Applying the model

We apply the previously trained model to some new cases:

source session-setup.sh
train-test.sh -a "$OUTPUT_DIR/sup1/test-cases.txt" -m "$OUTPUT_DIR/sup1/model" "$OUTPUT_DIR/sup1" conf/basic.supervised.1.conf "$OUTPUT_DIR/sup1/output"
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'

The predicted values are written in $OUTPUT_DIR/sup1/output/test/predict.tsv as a single column, in the same order as the input cases.

Evaluating

The predictions can be evaluated against the gold-standard answers either with AUC or accuracy. Both the predictions and gold file must be provided as two columns: <case id> <value>.

source session-setup.sh
paste "$OUTPUT_DIR/sup1/test-cases.txt" "$OUTPUT_DIR/sup1/output/test/predict.tsv" >"$OUTPUT_DIR/sup1/predicted.tsv"
auc.pl "$OUTPUT_DIR/sup1/test-cases.gold.txt" "$OUTPUT_DIR/sup1/predicted.tsv"
## 0.90625
source session-setup.sh
accuracy.pl "$OUTPUT_DIR/sup1/test-cases.gold.txt" "$OUTPUT_DIR/sup1/predicted.tsv" | cut -f 1
## 0.5

k-fold Cross-validation

Splitting for CV

The script called below requires the dataset to have been split before and the indexes to be stored in a subdirectory folds:

source session-setup.sh
mkdir "$OUTPUT_DIR/sup1/output/folds"
n=$(cat $OUTPUT_DIR/sup1/cases.txt  | wc -l)
generate-random-cross-fold-ids.pl 5 $n "$OUTPUT_DIR/sup1/output/folds/fold"
for f in "$OUTPUT_DIR"/sup1/output/folds/fold*.indexes; do
  cat  "$OUTPUT_DIR/sup1/cases.txt" | select-lines-nos.pl "$f" 1 >${f%.indexes}.cases
done

Running CV

source session-setup.sh
cp "$OUTPUT_DIR/sup1/cases.txt" "$OUTPUT_DIR/sup1/output/truth"
train-cv.sh conf/basic.supervised.1.conf "$OUTPUT_DIR/sup1" "$OUTPUT_DIR/sup1/output"
## train-cv.sh: fail safe mode is OFF
## train-cv.sh: cleanup mode is OFF
##  fold.1;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
##  fold.2;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
##  fold.3;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
##  fold.4;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
##  fold.5;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'

The results can be found in a .perf file named after the configuration filename:

cat "$OUTPUT_DIR/sup1/output/basic.supervised.1.perf"
## 0.588000 0.840000    0.700000

The three values are:

  • the “final score”, i.e. product of the AUC and accuracy modified according to PAN15 evaluation method
  • the AUC score
  • the accuracy score modified according to PAN15 evaluation method

Next: genetic learning

The genetic learning process is described in user guide - part 2. Note that the second part requires much more computing power than the first one (multiple cores and large RAM recommended).