KOPTE – Corpus Project on Translation Evaluation

Presentation of the KOPTE Corpus and Research Project

The KOPTE corpus has been compiled since 2009 and I am now analyzing the several aspects it offers to Translation Studies, esp. to research on translation evaluation and translation competence (acquisition). You can download a pdf file with the Presentation of the KOPTE Corpus and Research Project, in which you find the detailed information and visualizations of this web site and some more about corpus design, compilation, text type, annotations (in version 2 of the corpus, April 2016) and the analyses already done in the past, mostly in unpublished student's theses, but also in some publications. References are included in this short paper.

KOPTE is a corpus and a research project covering a huge variety of thematic approaches and allowing translation scholars to ask completely new questions. As far as long established strands of research are concerned, they can be reconsidered using empirical data not available beforehand. Students are keen on preparing their theses in KOPTE, because – as they tell me – they can immediately see the relevance of their research contributions. Future translators therefore do not only deliver their translations for research purposes. They get to work with them on their own while doing their first steps in the scholarly world.

To sum up, KOPTE can foster a multitude of work in the area of translation evaluation, translation competence (acquisition) and contrastive studies. It can be a node in the growing network of Translation Learner Corpora and bring forward pieces in the puzzle of results gained from them. This will help us to understand better what translation is and how it can be teached.

 

Corpus

The KOPTE corpus is a multiple Translation Learner Corpus, including many translations of the same source text performed by trainee translators. Our team began to compile the corpus in summer 2009 with the collection of the corrected translations of fourth-year students (in the so-called Klausurenkurs, a translation class (FR-DE) to prepare for final exams (“Diplom-Übersetzer” at the time)). This was continued in the following semesters until winter semester 2011/12. After the diploma course was cancelled in 2012, there were no more students and the class was given up. Students participated for two or more semesters in the class, so the corpus does not only reflect the student’s examination level, but also their development in the time before facing final exams. At the moment, texts from BA and MA translation classes, which constitute a longitudinal corpus over the whole academic cycle of students, are being integrated in KOPTE.

Most of the students consented to research done on the basis of their translations; in the Klausurenkurs alone 58 different translators were involved. Researcher and student helpers had to type the first 38 source texts to transform the manuscript translations into a format to be processed. The corpus texts are UTF-8 plain text files, line break only LF. File names are composed of “AT” (Ausgangstext = source text) plus a three-digit text code and “UE” (Übersetzer = translator) and a three-digit translator code, e.g. the version of translator 34 of source text 23 gets the file name “AT023UE034”. The source text in this case is “AT023UE000”.

Many of the trainee translators filled in a form to deliver translator metadata like languages studied, time and mode of language learning, media use, parent’s origin, etc.

The KOPTE core corpus (KOPTE-KK) consists of Klausurenkurs texts (AT001-AT077), mainly German translations of newspaper articles from French newspapers, with a slight emphasis on opinion: commentaries or other richly structured texts were chosen deliberately because of the translation problems they present. The concept of the Klausurenkurs was to translate as much of the text (up to 2,400 characters) as possible in 45 minutes to achieve a solid translation. A monolingual dictionary was allowed. After moving into another room equipped with computers for every student (winter semester 10/11, starting with AT039), the students had free access to the Internet. For this reason, the texts no longer needed to have teacher notes with information in case of non-dictionary look-ups.

At the moment, the core corpus is being complemented by translations of BA and MA translation students, constituting – apart from the analysis of translation problems and students’ translation solutions – a possibility for long-term studies of translation competence acquisition. KOPTE-BA and KOPTE-MA consist of different text types. The length of the texts is varying greatly from some lines (product packages) to complete 50-page reports on air quality in the Paris region in 2013. These translations are mostly not evaluated in terms of an evaluation scheme and a grade (except exam translations), but carry corrections from course sessions in form of insertions and deletions that are the result of discussion by students and the teacher. These corrections will be integrated into the translation texts as XML annotations. Furthermore, lemmatization, POS-tagging and the annotation of translation problems will enrich this subcorpus, similar to KOPTE-KK (cf. 3 – Corpus Encoding).

Corpus processing – encoding and annotations

Researchers perform automatic tokenization, lemmatization and POS-tagging with TreeTagger (Schmid 1994) and the STTS tagset for German, Achim Stein’s tagset for French (Stein 2003). Researchers annotate manually both the teacher’s evaluations and passages that have caused translation problems to several students. These translation problems (UE-Pros) are being annotated according to their nature in the source text and to the solutions found by the students. Teacher’s evaluation is annotation layer AWEv (Andrea Wurm Evaluation) and consists of positive and negative evaluation, i.e. posev and negev, with numbers indicating the weight of evaluated items (annotation scheme see appendix). Manual annotation is done with UAM Corpus Tool (O’Donnell 2008), which allows free design of annotation schemes and comes with an easy-to-use GUI.

Translator metadata are collected in an Excel spread sheet. Alignment is done with InterText (Vondřička 2010, based on Hunalign) and will be incorporated in KOPTE_v3. The KOPTE corpus is processed with the IMS Corpus Workbench (Evert/Hardie 2011) and queried with its built-in Corpus Query Processor (CQP). KOPTE_v2_KK is now available in CQPweb (Hardie 2012; restricted access) with AT001 through AT077, UE001 through UE058 (i.e. KOPTE-KK) and the following annotations:

  • lemmatization (TreeTagger DE and FR)
  • POS (TreeTagger DE and FR)
  • AWEv (where available, otherwise only negev/posev without differentiation)
  • UE-Pros IDs (up001-01, up001-02, ..., up038-05, etc.; not for participles/gerundium and subjunctif)
  • UE-Pros classification and translation solutions for
    • participles and gerundium (AT001-AT012; annotation layer “Partizipialkonstruktionen”, no IDs)
    • past participles (AT001-AT012; annotation layer “Partizip_Perfekt”, UE-Pros ID in this layer)
    • grammatical number (AT001-AT077; annotation layer “Numerus”, UE-Pros ID in this layer)
    • realia/proper names (AT001-AT004; annotation layer “ER”, ID in layer UE-Pros)
    • subjunctive (AT001-AT077; annotation layer “Subjonctif”, no IDs)
    • translation problems noticed during correction of target texts (AT001-AT066; annotation layer “UE-Pros”, ID in this layer) with one-layer categorization (e.g. punctuation, realia/proper names, content transfer etc.)

Publications and current research

In several articles up to date (references here), I have been using data from KOPTE. Wurm (2012) is a rather qualitative study of two cohesion phenomena, one from the corpus alone, and the other with translation process data from the Translog experiment (keystroke logging files, recorded and transcribed retrospective protocols). A quantitative study of proper names and culture specific items as a translation problem has been presented at the GAL conference 2011 in Bayreuth and is now published in trans-kom.eu (Wurm 2013a). Another paper deals with a German unique item, “Pronominaladverbien”, and their cohesive use without prompting in the source text (Wurm 2014). With colleagues, we have been working on translation evaluation in human and MT settings, using KOPTE texts and my evaluations to compare with MT evaluation metrics (Vela, Schumann, Wurm 2014a, 2014b). Finally, this description of KOPTE_v2 is available online (Wurm 2015), based on the description already existing for Version 1 (Wurm 2013b). Later and still ongoing research (Wurm 2020) covered questions like the influence of translator’s biography on translation quality, statistical analysis of evaluated items (Can we observe clustering of evaluation criteria?), establishment of translator and text profiles based on the metadata and the (evaluated) translations, possible correlations between text features and translation quality.

A complete list of project-related works (up to 2016) is included in the pdf file Presentation of the KOPTE Corpus and Research Project.

Dr. Andrea Wurm

translator's degree (Diplom-Übersetzerin)
academic research associate
member of UniGR Center for Border Studies | http://cbs.uni-gr.eu
partner in the MUST network | https://uclouvain.be/en/research-institutes/ilc/cecl/must.html

 

contact

telephone (office)
+49-(0)681-302-2509
e-mail
a.wurm(at)mx.uni-saarland.de
consultation hours
Please contact me via e-mail to make an appointment.