View on GitHub

tdc-tools

Tools for manipulating Tabular Document-Concept format

Generating the doc-concept matrix data

Introduction

<year> <doc id> <list of doc-concepts>

where <list of doc-concepts> contains a space separated list of <concept id>:<freq>.

Usage

PTC data

build-doc-concept-matrix-all-variants.sh <PTC TDC input dir> PTC.dcm

KD data

build-doc-concept-matrix-all-variants.sh -k <KD TDC input dir> KD.dcm

Compressing (optional)

mksquashfs PTC.dcm PTC.dcm.sqsh
mksquashfs KD.dcm KD.dcm.sqsh