TXM & authomatic lemmatization and morphological annotation

In the beginning of July I have attended the workshop “Édition analytique” in Lyon, sponsored by Consortium cahier and Labex Aslan, organized by Alexei Lavrentiev and others from the équipe Lincobato. The workshop was focused on TXM but there were also sessions presenting other tools and projects going on in France (Algone, TEI Critical Edition ToolboxSynopsX).  I will say something about it from my perspective: at the moment I’m more interested in editing tools that in analysis tools. But, as Serge Heiden has clearly pointed out: are editors engaged in producing editions for further analysis?

More info about the workshop can be found here.

TXM is a text/corpus analysis environment following lexicometry and text statistical study, based on CQP and R. Working units are lexical patterns (words and word class information), internal structures (paragraphs, titles, footnotes) and the text with its metadata. A large corpus of ancient French texts (BFM), encoded in TEI, is available online; furthermore it is possible to import one’s own corpus (in a wide variety of formats), to annotate it using an authomatic lemmatizer and a morphosyntactic tagger. One can also publish it on the TXM portal: different layouts (diplomatic, normalized, translation, etc.) and images can be displayed in a multi-panel window; audio and video documents can be easily linked to the texts, for instance in the case of a corpus of interviews.
While creating an edition, analysis may help to develop a deeper comprehension of the text; therefore TXM can be an important tool for editors, even if they don’t publish on this platform.
For TEI compliance have a look here.

Before running texts in TXM you may use a tool providing
authomatic lemmatization and morphological annotation.
Not everybody is interested in lemmatization and morphosyntactic mark-up; but as one of the best potentiality of xml documents is to give a lot of (shown or hidden) information in one file, this may be taken into consideration; on the other hand this kind of authomatic annotation always needs to be reviewed and this can take more or less time.
The learning corpus for such tools is important: a child only listening to the conversation of .. the diplomatic corps, will probably understand after a while who an ambassador is; this is why Kestemont and the co-authors (see below) underline how “an innovative feature of our system is that it draws on all available training data sets for pre-modern Dutch, amongst which the Corpus-Gysseling (literary and legal texts), the CRM Charter corpus, the Repertory for Proper Nouns in Middle Dutch Literary texts, …”
Particularly problematic are medieval languages, because of the grat variation in spelling and handwriting, due to diatopic variation and to the absence of ufficial or unofficial regulation.

I will only mentione the one used in TXM and two other tools, recently presented at DH Benelux and DH 2014.

Lemmatization or/and morphological annotations for medieval languages:


