This document is part of the “CLG Authorship Experiments” repository:
TODO
An Author Verification problem consists in determining for any two text documents (or any two groups of documents) whether they have been written by the same person, or more generally exhibit similar stylistic features. The verif-author.pl
script produces a set of features for every verification problem as input, using various possible strategies and parameters.
See also:
This document was generated from an R Markdown source file. The source file is provided in the repository, it can be used to reproduce these experiments. It can be executed through the RStudio interface (“knit” button) or as follows:
rmarkdown::render('user-guide-part1.Rmd')
The .Rmd
source document can be configured by modifying the following lines:
packages.path <- 'packages'
Sys.setenv(CLG_AUTHORSHIP_PACKAGES_PATH = packages.path)
output.dir <- '/tmp/user-guide-part1.output'
Sys.setenv(OUTPUT_DIR = output.dir)
delete.previous.output <- TRUE
snippets.size <- 25
Sys.setenv(SNIPPETS_SIZE = snippets.size)
packages.path
variable indicates the location of the dependencies (see software requirements below). For the sake of simplicity it is assumed that all the packages are in the same directory (as recommended in the installation instructions, see below).session-setup.sh
, which must be present in the same directory for executing the Rmd source file. This is not needed when executing commands manually, as long as the environment has been configured once (see below).In order to manually execute some of the commands below, it is recommended to assign the appropriate value to the above environment variables for the whole session, for example:
export CLG_AUTHORSHIP_PACKAGES_PATH=packages
export OUTPUT_DIR=/tmp/user-guide-part1.output
One can also manually replace the variables with the appropriate value in every command of course (less convenient).
rm -rf "$OUTPUT_DIR"
mkdir "$OUTPUT_DIR"
The experiments below require the clg-authorship-analytics software to be installed as well as all its dependencies. A detailed installation guide can be found in the documentation.
The following is a quick test to check that the software is properly installed and configured. It should show the first line of the inline help message for verif-author.pl
.
source session-setup.sh
verif-author.pl -h | head -n 3
##
## Usage: verif-author.pl [options] <config file> [<fileA1:..:fileAn> <fileB1:..:fileBm>]
These experiments require the CLSA corpus which can be found here. The code chunks below assume that the dataset has been extracted in the data
directory, for example as follows:
cd data
wget https://scss.tcd.ie/clg/DCLSA/DCLSA.tar.gz
tar xfz DCLSA.tar.gz
echo "If the DCLSA data is available, a list of 10 files (e.g. 'data/gb/1851-tsa-lilfawwrt') should be listed below:"
ls data/*/* | head
## If the DCLSA data is available, a list of 10 files (e.g. 'data/gb/1851-tsa-lilfawwrt') should be listed below:
## data/gb/1851-tsa-lilfawwrt
## data/gb/1851-tsa-tlasorl
## data/gb/1851-tsa-ttw
## data/gb/1851-tsa-wftw
## data/gb/1851-tsa-wt
## data/gb/1852-sw-q
## data/gb/1852-sw-q-v2
## data/gb/1852-tsa-hhalp
## data/gb/1852-tsa-ml
## data/gb/1852-tsa-trowww
./data
. If this is not the case it is advised to create a symbolic link for convenience.data
directory must be writable because the program uses cached intermediate files whenever possible for efficiency reasonsFor these experiments we create a version of the dataset where each document is replaced with a 100 lines snippet:
source session-setup.sh
find data/gb/* data/ia/* -maxdepth 0 -type f | grep -v '\.' | ./create-snippets-dataset.sh $SNIPPETS_SIZE $OUTPUT_DIR/data
The simplest way to use the system is to specify the strategy and its parameters directly on the command line. In the following we compare Mark Twain’s “The Adventures of Tom Sawyer” and “The Adventures of Huckleberry Finn” (the last two arguments).
source session-setup.sh
verif-author.pl -s "strategy=basic;obsType.CHAR.CCC.lc1.sl0.mf3=1" $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf 2>/dev/null
## 0.154320690290313
obsType
parameter represents charaters (CHAR
) trigrams (CCC
) with lowercase, no sentence limit and minimum frequency 3. See the CLGTextTools documentation for details about observation types.2>/dev/null
is used temporarily to mask the numerous warnings (due to not providing values for various parameters).A better and more convenient way is to provide the parameters in a config file.
The content of the config file is:
cat conf/basic.2.conf
## strategy=basic
##
## obsType.CHAR.CCC.lc1.sl0.mf3=1
##
## basic.simMeasure=minmax
##
## # general options
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0
The simMeasure
option for the basic
strategy indicates which similarity measure to use. In this particular case the other three options at the end are not really relevant and are provided only to avoid the warning messages:
formatting
parameter is used to interpret sentences or paragraphs as separate units. This may be useful in conjunction with the sl
(sentence limit) part of the observations types if some formatting is present in the input files. ** 0
(or undef or empty string): no formatting at all ** singleLineBreak
: line breaks as separator for meaningful units (e.g. sentences) ** doubleLineBreak
: mpty lines (i.e. at least two consecutive line breaks) as separator for meaningful units (e.g. paragraphs).wordTokenization
parameter) indicates whether the input text should be tokenized (value 1) or not (0). This is relevant only for words observations types.multipleProbeAggregate
parameter specifies which method should be used to aggregate the similarity scores if there are more than one probe doc on either side (or both): random
, median
, or arithm
, geom
, harmo
mean. ** If “random” (default), then a default doc is picked among the list (disadvantage: same input can give different results). ** Otherwise the similarity is computed between all pairs (cartesian product NxM), and the values are aggregated according to the parameter (disadvantage: NxM longer).The same result as above can be obtained as follows:
source session-setup.sh
verif-author.pl -c conf/basic.2.conf $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf
## 0.154320690290313
-c
option is used to cache and if possible reuse the count files: for every document, a count file is created for every observation type which contains the frequency for every observation.Several observations types can be specified. In this case the output will show the corresponding types as columns.
Config file:
cat conf/basic.3.conf
## strategy=basic
##
## obsType.CHAR.CC.lc1.sl0.mf3=1
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## obsType.WORD.T.lc1.sl0.mf2=1
## obsType.WORD.TT.lc1.sl0.mf2=1
## obsType.VOCABCLASS.MORPHO.mf3=1
##
## basic.simMeasure=minmax
##
## # general options
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0
In this example we use various observation types “families”.
source session-setup.sh
verif-author.pl -H conf/basic.3.conf $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf
## CHAR.CC.lc1.sl0.mf3 CHAR.CCC.lc1.sl0.mf3 VOCABCLASS.MORPHO.mf3 WORD.T.lc1.sl0.mf2 WORD.TT.lc1.sl0.mf2
## 0.376597519675562 0.154320690290313 0.608910170769454 0.218033824460178 0
-H
is used to print the columns names as the first line (header).The program can also receive two groups of documents instead of two documents as input. Documents which are together in the same group are assumed to have been written by the same author. Multiple documents by the same author can potentially provide crucial insight by allowing the verification method to distinguish constant stylistic features vs. document-specific ones.
In this example we use the same config as above but compare a group of books by Mark Twain vs. a group of books by Henry James:
source session-setup.sh
verif-author.pl -H conf/basic.3.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## CHAR.CC.lc1.sl0.mf3 CHAR.CCC.lc1.sl0.mf3 VOCABCLASS.MORPHO.mf3 WORD.T.lc1.sl0.mf2 WORD.TT.lc1.sl0.mf2
## 0.529523647552977 0.161697151385542 0.850543708117557 0.258581235697941 0.0623519026158064
multipleProbeAggregate
parameter presented above determines how the features are combined across the documents in one group to provide the final feature value.Finally the program can receive multiple verification problems, applying the same strategy and parameters to all of them.
cat data/user-guide.expl1/cases.txt
## data/gb/1876-mt-taots:data/gb/1884-mt-taohf data/gb/1888-hj-tap:data/gb/1896-hj-tsop:data/gb/1898-hj-ttm 0
## data/gb/1876-mt-taots data/gb/1884-mt-taohf 1
## data/gb/1888-hj-tap:data/gb/1896-hj-tsop data/gb/1898-hj-ttm 1
## data/gb/1882-haj-ffbts data/gb/1895-cfw-mcac:data/gb/1895-cfw-tfyaois:data/gb/1895-cfw-daois 0
## data/gb/1905-haj-fftf data/gb/1905-ga-ttt 0
## data/gb/1918-rwc-tlg:data/gb/1918-rwc-tmw data/gb/1919-rwc-tct:data/gb/1919-rwc-tsos 1
## data/gb/1882-haj-ffbts data/gb/1895-cfw-daois 0
## data/gb/1895-cfw-mcac data/gb/1895-cfw-tfyaois 1
## data/gb/1895-cfw-mcac data/gb/1895-cfw-daois 1
## data/gb/1905-haj-fftf data/gb/1884-mt-taohf 0
The same config as above is applied to these four problems:
source session-setup.sh
cat data/user-guide.expl1/cases.txt | sed "s:data/gb:$OUTPUT_DIR/data:g" | cut -f 1 | verif-author.pl -H conf/basic.3.conf
## CHAR.CC.lc1.sl0.mf3 CHAR.CCC.lc1.sl0.mf3 VOCABCLASS.MORPHO.mf3 WORD.T.lc1.sl0.mf2 WORD.TT.lc1.sl0.mf2
## 0.432680054266918 0.155799341777201 0.645388065128682 0.261227981490843 0
## 0.376597519675562 0.154320690290313 0.608910170769455 0.218033824460178 0
## 0.567570958614491 0.259950774037392 0.899399759903962 0.340503716630032 0.052733757731726
## 0.578053060954744 0.218375675034626 0.907122120743362 0.332324485867694 0.0529377803437403
## 0.493152261741496 0.268698787837746 0.809671046981019 0.231056406912292 0.0524242641266203
## 0.545848564293855 0.24939001174395 0.81539335416075 0.242304388645852 0
## 0.502219966747142 0.218375675034626 0.903816882588741 0.307334155756297 0.0529377803437403
## 0.556095247988579 0.288583032193941 0.875215534937187 0.29113400053571 0.135197349952277
## 0.449405984666321 0.150204339730904 0.907839644682968 0.264233539487414 0
## 0.489771216782876 0.226890401197041 0.739570164348925 0.230498238550579 0.069018791777021
The observation types belong to different families: CHAR
for character n-grams, WORD
for word n-grams, POS
for Part-Of-Speech and VOCABCLASS
for some custom mapping of words to other categories (this can be used to count words based on their capitalization, for instance). See details in explanation about the n-grams patterns.
Two special cases presented below require additional preparation steps.
These require the POS tags to have been precomputed and stored into .POS
files, otherwise the program will cause an error. The simple way to precompute these POS files is as follows:
source session-setup.sh
ls $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf | count-obs-dataset.sh english POS.P.sl0.mf1 2>/dev/null
## count-obs-dataset.sh: tokenization and POS tagging
## count-obs-dataset.sh: generating count files
Note: the input documents are provided on STDIN.
Word-based observations accept an option which specifies which words to take into account. This causes all the other words to be replaced by a placeholder symbol _
. It can be used to count only patterns involving frequent words (often called stop words), in order to avoid content words which are less likely to be good indicators of an author style. Note that this is different from the mf
(minimum frequency) option, which discards full observations, as opposed to some of the words in an observation.
A “stop words list” based on frequency can be calculated based on a collection of documents as follows. In this example we use all the books by Mark Twain as the corpus and extract the top 100 most frequent words.
source session-setup.sh
find $OUTPUT_DIR/data/*-mt-* -maxdepth 0 -type f >"$OUTPUT_DIR/mark-twain.list"
count-obs-dataset.sh -i "$OUTPUT_DIR/mark-twain.list" -o '-g' english WORD.T.lc1.sl0.mf2 2>/dev/null
sort -r -n +1 -2 "$OUTPUT_DIR/global.observations/WORD.T.lc1.sl0.mf2.count" | cut -f 1 | head -n 100 >"$OUTPUT_DIR/mt-100.stop-list"
## count-obs-dataset.sh: no TreeTagger tokenization/POS tagging needed
## count-obs-dataset.sh: generating count files
find
command is only used to find all the regular files corresponding to Mark Twain (mt
)count-obs-dataset.sh
command creates the ‘global’ output directories doc-freq.observations
and global.observations
in the dir where the input list is located.The “stop words” option is used by adding the name of the resource containing the list of stop words to the desired word observation types, for example:
cat conf/basic.4.conf
## strategy=basic
##
## obsType.WORD.T.lc1.sl0.mf2=1
## obsType.WORD.TT.lc1.sl0.mf2=1
## obsType.WORD.TT.lc1.sl0.mt-100.mf2=1
## obsType.WORD.TTT.lc1.sl0.mt-100.mf2=1
## obsType.WORD.TTTT.lc1.sl0.mt-100.mf2=1
##
## basic.simMeasure=minmax
##
## # general options
## multipleProbeAggregate=median
## wordTokenization=1
## formatting=0
mt-100
stop words list.-v
in the verif-author.pl
command, as shown in this example:source session-setup.sh
verif-author.pl -H -v mt-100:$OUTPUT_DIR/mt-100.stop-list conf/basic.4.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## WORD.T.lc1.sl0.mf2 WORD.TT.lc1.sl0.mf2 WORD.TT.lc1.sl0.mt-100.mf2 WORD.TTT.lc1.sl0.mt-100.mf2 WORD.TTTT.lc1.sl0.mt-100.mf2
## 0.259904608594392 0.0102997694081476 0.227367139858328 0.0476417076372006 0
Currently the program offers three verification strategies. The simplest one, called the basic
strategy, has been used in the previous examples. The two more advanced strategies are presented below.
univ
)Given two documents A and B, repeatedly mix different portions of A and B together, then compare the similarity obtained between A and A, B and B, A and B, mixed-AB and A, mixed-AB and B, mixed-AB and mixed-AB. If A and B have the same author, the resulting similarity scores should be all similar.
Config file:
cat conf/univ.1.conf
##
## strategy=univ
##
## obsType.WORD.T.lc1.sl0.mf3=1
## obsType.WORD.TT.lc1.sl0.mf3=1
## obsType.WORD.TTT.lc1.sl0.mf3=1
##
## univ.simMeasure=cosine
## univ.nbRounds=50
## univ.propObsSubset=0.4
## univ.withReplacement=0
## univ.countMostSimByRound=all
## univ.aggregSimByRoundAggregType=median
## univ.finalScoresMethod=both
## univ.aggregSimByRound=all
## univ.splitWithoutReplacementMaxNbAttempts=3
##
## # general options
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0
This strategy has many parameters (from documentation):
nbRounds
: number of rounds (higher number -> more randomization, hence less variance in the result) (default 100)propObsSubset
: (0<=p<1) the proportion of observations/occurrences used to mix 2 documents together at each round (p and 1-p); if zero, the proportion is picked randomly at every round (default 0.5)withReplacement
: 0 or 1. default: 0splitWithoutReplacementMaxNbAttempts
: max number of attempts to try splitting doc without replacement if at least one of the subsets is empty. Default: 5.finalScoresMethod
: aggregSimByRound
, countMostSimByRound
orboth
. Overall method(s) to obtain the features: by aggregating the similarities for each category or counting the most similar category among rounds. default: countMostSimByRound
.aggregSimByRound
: all
homogeneity
sameCat
mergedOrNot
: all
means use all individual categories as features. with homogeneity
four final features are considered: AA+BB, AM+BM, AB, MM; with sameCat
there are only two final features: AA+BB+MM, AB+AM+BM. with mergedOrNot
there are two categories: AA+BB+AB, AM+BM+MM. default = sameCat
countMostSimByRound
: all
homogeneity
sameCat
mergedOrNot
. See above. default = sameCat
aggregSimByRoundAggregType
: median
, arithm
, geom
, harmo
. default = arithm
This is an ensemble method, the time it requires is proportional to nbRounds
(it also depends on the size and number of input documents, of course). With only 50 rounds the following run takes around 20 seconds:
source session-setup.sh
verif-author.pl -H conf/univ.1.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## doc(s) too small => impossible to find enough partitions => used empty doc(s) 21 times (out of 50 rounds). at /home/erwan/now/22-authorship/CLGTextTools/lib/CLGTextTools/Logging.pm line 118.
## simByRound_all_AA simByRound_all_BA simByRound_all_BB simByRound_all_MA simByRound_all_MB simByRound_all_MM countMostSimByRound_all_AA countMostSimByRound_all_BA countMostSimByRound_all_BB countMostSimByRound_all_MA countMostSimByRound_all_MB countMostSimByRound_all_MM
## 0.195028469322358 0 0 0.416753736220937 0 0.75318181890011 0.22 0 0.04 0.08 0.02 0.64
GI
)The “Impostors” verification strategy (see “Determining if Two Documents are by the Same Author” by Koppel and Winter, 2014): Portions of the tested documents are repeatedly compared to each other and to other external (portions of) external documents (impostors). If the similarity between the tested documents is significantly higher than the similarity obtained between a tested document and an impostor, then the tested documents are likely to be by the same author.
The method relies on a set of external impostor documents, which must be provided with option -d
to verif-author.pl (see below).
Below we create a directory with a random subset of 100 books from the dataset. Note that the system is unpleasantly strict about the naming of the path and files:
<resource path>/<id>/impostors/
(note the last impostors
dir), where:
<resource path>
is the path later provided to -d
id
is the resource id provided in the config file (see below; note that there can be multiple resources).txt
(the system reads only these files).source session-setup.sh
mkdir $OUTPUT_DIR/impostors
mkdir $OUTPUT_DIR/impostors/GI.1.impostors
n=$(find $OUTPUT_DIR/data/* -maxdepth 0 -type f | wc -l)
find $OUTPUT_DIR/data/* -maxdepth 0 -type f | random-lines.pl 100 1 $n | while read f; do cp $f $OUTPUT_DIR/impostors/GI.1.impostors/$(basename "$f").txt; done
This is a “simple” example using the impostors created above.
Config file:
cat conf/GI.1.conf
##
##
## strategy=GI
##
## obsType.WORD.T.lc1.sl0.mf3=1
## obsType.CHAR.CCC.lc1.sl0.mf3=1
##
##
## GI.impostors=GI.1.impostors
## GI.nbImpostorsUsed=10
## GI.nbRounds=50
## GI.selectNTimesMostSimilarFirst=0
## GI.propObsSubset=0.45
## GI.docSubsetMethod=byObservation
## GI.minDocFreq=4
## GI.simMeasure=minmax
## GI.useCountMostSimFeature=original
## GI.kNearestNeighbors=5
## GI.mostSimilarFirst=doc
## GI.aggregRelRank=arithm
## GI.useAggregateSim=ratio
## GI.aggregateSimStat=median
##
##
## # general options
## wordTokenization=1
## formatting=0
Parameters (see details in the documentation):
impostors
is a list of datasets (resources ids) from which impostors will be picked randomly with equal probability (i.e. independently from the number of docs in each dataset).minDocFreq
: minimum doc frequency for observations (optional, default 1);selectNTimesMostSimilarFirst
is described below.nbImpostorsUsed
: number of impostors documents to select from the impostors dataset (done only once for all rounds) (default 25)nbRounds
: number of rounds (higher number -> more randomization, hence less variance in the result) (default 100)propObsSubset
: (0<=p<1) the proportion of observations/occurrences to keep in every document at each round; if zero, the proportion is picked randomly at every round (default 0.5)docSubsetMethod
(default byObservation
):
byOccurrence
-> the proportion is applied to the set of all occurrencesbyObservation
-> applied only to distinct observationsuseCountMostSimFeature
: the method to calculate the “count most similar” feature (default: original
):
0
-> unused,original
-> original method by Koppel and Winter (2014)ASGALF
-> method used in “A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF” by Khonji (2014) * ASGALFavg
-> variant of the abovekNearestNeighbors
: uses only the K (value) most similar impostors when calculating result features. default: 0 (use all impostors).mostSimilarFirst
: doc
or run
, specifies whether the K most similar impostors are selected globally (doc) or for each run (run); unused if GI_kNearestNeighbor
s=0. Default: doc.aggregRelRank
: 0
, median
, arithm
, geom
, harmo
. if not 0
, computes the relative rank of the sim between A and B among sim against all impostors by round; the value is used to aggregate all relative ranks (i.e. the values by round). Default: 0
.useAggregateSim
: 0
, diff
, ratio
. if not 0
, computes X = the aggregate sim value between A and B across all runs and Y= the aggregate sim value between any probe and any impostor across all rounds; returns A-B (diff) or A/B (ratio); default : 0
.aggregateSimStat
: median
, arithm
, geom
, harmo
. aggregate method to use if useAggregateSim
is not 0 (ignored if 0
). default: arithm
.source session-setup.sh
verif-author.pl -H -d $OUTPUT_DIR conf/GI.1.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## countMostSim_original relativeRank aggregatedSim_ratio
## 0.3 0.403 0.935757753736536
The parameter selectNTimesMostSimilarFirst
, if not zero, is used to perform an initial filtering stage which retrieves the N most similar documents to the probe documents, instead of picking impostors documents randomly, with N = selectNTimesMostSimilarFirst * nbImpostors
(N for every probe doc). This ensures that the most dissimilar impostors are not used, while maintaining a degree of randomness depending on the value of selectNTimesMostSimilarFirst
.
However this option requires the initial similarity values to be precomputed in order to avoid repeating the process as many times as the method is called, which might be prohibitive in computing time. The pre-similarity values are loaded from <probeFile>.simdir/<impDataset>.similarities
.
Below we precompute similarities with the same impostors data and same probe files as in the previous example:
source session-setup.sh
find $OUTPUT_DIR/impostors/GI.1.impostors/* -maxdepth 0 -type f >$OUTPUT_DIR/impostors/GI.1.impostors.list
ls $OUTPUT_DIR/data/1876-mt-taots $OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap $OUTPUT_DIR/data/1896-hj-tsop $OUTPUT_DIR/data/1898-hj-ttm >$OUTPUT_DIR/probe-files.list
sim-collections-doc-by-doc.pl -o WORD.T.lc1.sl0.mf3 -R BASENAME WORD.T.lc1.sl0.mf3:CHAR.CCC.lc1.sl0.mf3 $OUTPUT_DIR/probe-files.list GI.1.impostors:$OUTPUT_DIR/impostors/GI.1.impostors.list 2>/dev/null
The config file is:
cat conf/GI.2.conf
##
##
## strategy=GI
##
## obsType.WORD.T.lc1.sl0.mf3=1
## obsType.CHAR.CCC.lc1.sl0.mf3=1
##
## GI.selectNTimesMostSimilarFirst=5
## GI.preSimObsType=WORD.T.lc1.sl0.mf3
##
## GI.impostors=GI.1.impostors
## GI.nbImpostorsUsed=10
## GI.nbRounds=20
## GI.propObsSubset=0.45
## GI.docSubsetMethod=byObservation
## GI.minDocFreq=4
## GI.simMeasure=minmax
## GI.useCountMostSimFeature=original
## GI.kNearestNeighbors=5
## GI.mostSimilarFirst=doc
## GI.aggregRelRank=arithm
## GI.useAggregateSim=ratio
## GI.aggregateSimStat=median
##
##
## # general options
## wordTokenization=1
## formatting=0
source session-setup.sh
verif-author.pl -H -d $OUTPUT_DIR conf/GI.1.conf $OUTPUT_DIR/data/1876-mt-taots:$OUTPUT_DIR/data/1884-mt-taohf $OUTPUT_DIR/data/1888-hj-tap:$OUTPUT_DIR/data/1896-hj-tsop:$OUTPUT_DIR/data/1898-hj-ttm
## countMostSim_original relativeRank aggregatedSim_ratio
## 0.16 0.369 0.850218981657574
verif-author.pl
returns a set of features for every problem. A verification problem, called a case in the system, is a pair of sets of documents, with the documents within a group assumed to be from the same author. A case can be labelled, i.e. provided with the gold-standard answer determining whether the two groups of documents are actually from the same author or not. Naturally, a set of labelled cases can be used to train a supervised model.
The most simple option would be to treat the task as binary classification. Since we are also interested in quantifying the confidence level of the system, the supervised setting is implemented as a regression task:
verif-author.pl
for multiple such cases.The system expects some standard directory structure. This can be created automatically by prepare-input.sh
(see below), but for the sake of simplicity we start with a predefined dir structure stored in the data
dir. The input
subdirectory is populated with a few documents.
mkdir "$OUTPUT_DIR"/sup1
mkdir "$OUTPUT_DIR"/sup1/input "$OUTPUT_DIR"/sup1/resources "$OUTPUT_DIR"/sup1/output
target="$OUTPUT_DIR/sup1/resources-options.conf"
echo "vocabResources=" >"$target"
echo "useCountFiles=1" >>"$target"
echo "datasetResourcesPath=$OUTPUT_DIR/impostors/GI.1.impostors" >>"$target"
cat data/user-guide.expl1/cases.txt data/user-guide.expl1/test-cases.txt | cut -f 1 | tr ' :' '\n\n' | sort -u | while read f; do cp "$f" "$OUTPUT_DIR/sup1/input"; done
cat data/user-guide.expl1/cases.txt | sed 's:data/[^/]*/::g' >"$OUTPUT_DIR/sup1/cases.txt"
cat data/user-guide.expl1/test-cases.txt | sed 's:data/[^/]*/::g' >"$OUTPUT_DIR/sup1/test-cases.gold.txt"
cat "$OUTPUT_DIR/sup1/test-cases.gold.txt" | cut -f 1 >"$OUTPUT_DIR/sup1/test-cases.txt"
The config file used below contains:
cat conf/basic.supervised.1.conf
## strategy=basic
##
## obsType.CHAR.CCC.lc1.sl0.mf3=1
## obsType.WORD.T.lc1.sl0.mf5=1
## obsType.WORD.TT.lc1.sl0.mf5=1
## obsType.VOCABCLASS.MORPHO.mf5=1
##
## basic.simMeasure=minmax
##
## # general options
## multipleProbeAggregate=random
## wordTokenization=1
## formatting=0
##
## # supervised learning parameters
## confidenceTrainProp=0
## confidenceLearnMethod=simpleOptimC1
## learnMethod=M5P-M4
learnMethod=M5P-M4
: this parameter indicates which regression algorithm and parameters to use in Weka (here Decision Tree Regression with M5P).source session-setup.sh
train-test.sh -l "$OUTPUT_DIR/sup1/cases.txt" -m "$OUTPUT_DIR/sup1/model" "$OUTPUT_DIR/sup1" conf/basic.supervised.1.conf "$OUTPUT_DIR/sup1/output"
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
$OUTPUT_DIR/sup1/model
$OUTPUT_DIR/sup1/output/train/self-predict.arff
We apply the previously trained model to some new cases:
source session-setup.sh
train-test.sh -a "$OUTPUT_DIR/sup1/test-cases.txt" -m "$OUTPUT_DIR/sup1/model" "$OUTPUT_DIR/sup1" conf/basic.supervised.1.conf "$OUTPUT_DIR/sup1/output"
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
The predicted values are written in $OUTPUT_DIR/sup1/output/test/predict.tsv
as a single column, in the same order as the input cases.
The predictions can be evaluated against the gold-standard answers either with AUC or accuracy. Both the predictions and gold file must be provided as two columns: <case id> <value>
.
source session-setup.sh
paste "$OUTPUT_DIR/sup1/test-cases.txt" "$OUTPUT_DIR/sup1/output/test/predict.tsv" >"$OUTPUT_DIR/sup1/predicted.tsv"
auc.pl "$OUTPUT_DIR/sup1/test-cases.gold.txt" "$OUTPUT_DIR/sup1/predicted.tsv"
## 0.90625
source session-setup.sh
accuracy.pl "$OUTPUT_DIR/sup1/test-cases.gold.txt" "$OUTPUT_DIR/sup1/predicted.tsv" | cut -f 1
## 0.5
The script called below requires the dataset to have been split before and the indexes to be stored in a subdirectory folds
:
source session-setup.sh
mkdir "$OUTPUT_DIR/sup1/output/folds"
n=$(cat $OUTPUT_DIR/sup1/cases.txt | wc -l)
generate-random-cross-fold-ids.pl 5 $n "$OUTPUT_DIR/sup1/output/folds/fold"
for f in "$OUTPUT_DIR"/sup1/output/folds/fold*.indexes; do
cat "$OUTPUT_DIR/sup1/cases.txt" | select-lines-nos.pl "$f" 1 >${f%.indexes}.cases
done
source session-setup.sh
cp "$OUTPUT_DIR/sup1/cases.txt" "$OUTPUT_DIR/sup1/output/truth"
train-cv.sh conf/basic.supervised.1.conf "$OUTPUT_DIR/sup1" "$OUTPUT_DIR/sup1/output"
## train-cv.sh: fail safe mode is OFF
## train-cv.sh: cleanup mode is OFF
## fold.1;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## fold.2;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## fold.3;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## fold.4;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## fold.5;obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
## obtain-strategy-features.sh,59: Warning: parameter 'vocabResources' is defined but empty in parameter file '/tmp/user-guide-part1.output/sup1/resources-options.conf'
The results can be found in a .perf
file named after the configuration filename:
cat "$OUTPUT_DIR/sup1/output/basic.supervised.1.perf"
## 0.588000 0.840000 0.700000
The three values are:
The genetic learning process is described in user guide - part 2. Note that the second part requires much more computing power than the first one (multiple cores and large RAM recommended).