CollateX is well-known in DH, but not as much in non-digital textual scholarship …

Development of CollateX started in 2010 as a project within the EU-funded initiative Interedition, with the aim to create a successor of Peter Robinson’s Collate; project leaders are Ronald Dekker (Huygens ING) and Gregor Middell. The recent (2 june 2014) first release on Pypi, make it available easier also for personal usage.

The major CollateX tasks are:

Tokenization. The actual tokenizer work on a plain text, splitting it on boundaries determined by whitespaces, or on a marked-up text, transforming it into a sequence of tokens, with each token referring to its markup context.

Normalisation and regularisation. CollateX operate a case normalization and the removal of punctuation and/or whitespace characters.

Alignement. Collatex offers a choice between several alignment algorithms.

The input formats can be plain text, JSON or XML. The output formats are an alignement table or a variant graph for the Python version and also JSON, TEI P5 and XML for the Java version.

The non progressive multiple sequence alignement is an improvement from the previous progressive alignement1

For a demonstration of the software →

1 Progressive alignment algorithms 1. start by comparing two versions, 2. transform the result into a variant graph, 3. progressively compare another version against that graph and 4. merge the result of that comparison into the graph.   Progressive alignement algorithms have a disadvantage: results depend on the order in which versions are merged. 


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s