Bachelor/Master Thesis

Topics

You can sometimes find open topics posted here - these are topics for which we are actively looking out for students with matching interest. Other topics are also possible, as long as they fit with the research topics of the group. You may get in touch with Professor Demberg or one of the Postdocs or PhDs anyway to inquire about current topics. Or, if you have an own idea, please feel free to suggest it to Prof. Demberg.

Analysis of English-French translation of discourse relations using automatic word alignments

Discourse relations are logical relations between segments of texts that make the text coherent. They are often marked by discourse connectives. For example, the connective "because" marks a "reason" relation. Each language has its own collection of connectives and they often do not have mutual cross-lingual correspondences. For instance, the French connective "en effet" can be translated to "indeed" or "in fact" in English. When it is used to mark a "cause" relation, it is often omitted (implicitated) in the English translation (Zufferey 2016). Previous work used automatic word alignment to induce French connective lexicons and discourse annotation by projecting annotations from English (Laali and Kosseim 2014, Laali 2017). In this project, we would like to use this technique to study how discourse relations are marked in English-French translations.

References:
Laali, Majid, and Leila Kosseim. "Inducing discourse connectives from parallel texts." Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 2014.
Zufferey, Sandrine. "Discourse connectives across languages: Factors influencing their explicit or implicit translation." Languages in Contrast. International Journal for Contrastive Linguistics 16.2 (2016): 264-279.
Laali, Majid. Inducing discourse resources using annotation projection. Diss. Concordia University, 2017.

Controllable and faithful generation of language

Natural language generation (NLG) holds immense potential to revolutionize information access and communication. However, a critical challenge lies in ensuring the reliability of generated text. Fabricating information, commonly referred to as "hallucination," undermines trust and limits the practical applications of NLG.

This research focuses on establishing robust mechanisms for reliable generation, particularly within data-to-text tasks. We utilize the drone technology domain as a compelling example, where accurate communication with human pilots is vital. Here, reliable NLG can generate informative messages about the drone's environment and mission details, enhancing situational awareness and decision-making.

You can build on our existing research, and contribute to (but not limited to): developing tailored evaluation metrics that assess both factual accuracy and domain-specific suitability, incorporating controlled generation techniques to explore different modeling techniques while improving reliability.

Machine learning approaches in the prediction of psychometric properties based on eye-tracking data

A large body of psychological research on reading shows that the way how individuals process texts is affected by their psychometric profile. Characteristics like working memory capacity and reading fluency influence the amount of time a reader needs to process particular information and the order in which information is processed. In turn, it has been demonstrated that certain eye movement measures could be used to identify the psychometric characteristics of individuals, e.g., to infer their level of reading fluency. Various methodologies have been explored in this task, using eye-tracking measures of different granularity in data analysis, encompassing tasks such as sentence or paragraph reading, and employing different analysis techniques. The choice of approach can significantly influence the accuracy and the amount of eye-tracking data necessary for successful prediction.

In this thesis project, the objective is to develop a workflow for predicting readers’ psychometric characteristics based on their eye movements collected during a paragraph reading task. Given that eye-tracking yields numerous measures, the task also involves determining which measures are predictive of various psychometric characteristics. The provided dataset consists of eye movement measures of 41 individuals, each reading 8 texts in total. Additionally, we provide a dataset of readers’ psychometric characteristics that includes measures of working memory capacity and reading fluency. While building upon the methods previously employed for this task in the field, the aim is to explore a variety of machine learning techniques to achieve high prediction accuracy. Therefore, the goal is to identify the most effective approach.

Testing a new psycholinguistic paradigm

A popular paradigm to study language processing in the psycholinguistic field is self-paced reading: a task for which participants read a passage word-by-word or phrase-by-phrase, pressing a button to get the next word or phrase displayed. The time taken to press the button gives an indication of the processing difficulty at each stage.

To gain a deeper understanding of language processing, we need to know not only how people process written text, but also how they process spoken text. In this project, we will therefore see if results of previous reading research can be replicated in a spoken setting — using a novel self-paced listening paradigm: instead of reading the texts, people have to listen to the texts, and press a button to hear the next word or phrase.

You will build on existing research that has used self-paced reading, and create spoken versions of the materials used in these studies. The data will be collected via a crowdsourcing platform. You will analyze the data to gain insight into whether the effects found in the earlier research can be replicated in this new paradigm.

Web-based vs lab-based eye-tracking

In research on human language processing, we are often interested in visual attention i.e. what objects or words study participants fixated. Traditionally, such research has used expensive eye-tracking equipment. However, today, many laptops have good cameras, which opens new perspectives for running experiments online using crowd-sourcing. In this thesis, you will compare data quality and experimental results using the traditional vs. crowd-sourced setup for a study on pragmatic processing. 

While researchers working on pragmatic communication games have typically focused only on the outcomes of these communication games, we’re now interested in the moment-to-moment behavior of participants in these games, specifically, how and when they examine various parts of the game’s visual display.

You will help us implement and analyze web-hosted webcam eye-tracking experiments to collect this data from a crowd-sourcing platform, and you will help explore the data to extract useful generalizations and decide on useful metrics for classifying trials and participants. Depending on your interests, the experiment can also be run in the eye-tracking lab for comparative results.

Bachelor / Master Seminar and Thesis

  • You need to do the Bachelor/Master Seminar before you can register for the thesis.
  • The seminar is used to further specify the topic of the thesis, perform a literature review, identify suitable methods and formulate the hypotheses you want to test as part of your thesis. It consists of two deliverables: 1) a talk (ca 30 min, followed by questions); 2) a seminar paper including the introduction to your topic, a literature review and the specification of methods and hypotheses (10-20 pages).
  • the BSc / MSc seminar should be completed within 1 semester (max 6 months) from starting to work on your thesis topic.
  • Additionally, you should participate in the presentations (Thesis seminars and thesis colloquia) of other MSc and BSc students who are doing their thesis at our chair. 
  • After having done both parts of the seminar, you (!) need to write an email to sek-vd(at)lst.uni-saarland.de. In this mail, please put Prof. Demberg and your advisor in CC and provide the following information: your full name, your matriculation number, the date of the seminar talk, the title of the talk that you gave. Only then the processing of LSF data is started.
  • You need to register your thesis. This is only possible, once you finished the Bachelor / Master seminar and when the data is entered in the LSF.
  • You need to defend your thesis (this can be shortly before or shortly after handing in your final thesis).
  • You need to write down your thesis (A German and English template for this can be found here).

 

Thesis colloquium

  • You should finish your thesis seminar within 6 months (including handing in the seminar paper and the report). If you think you will not be able to conform to this, please select a different group for doing your thesis.
  • Once you have completed your thesis (shortly before or after handing in the written version), we will schedule the thesis colloquium.
  • You should announce it yourself to the group and also invite other students if you wish.
  • The presentation should take max 30 minute, followed by ca 15 min of discussion. Please do make sure to not overrun the 30 min presentation time!

Grading Scheme

The following questions are considered (if applicable for the specific thesis topic and further questions might be considered if relevant for it) while grading a thesis. This is aimed at providing you with an overview of aspects important to a thesis. If you have any further questions, refer to your advisor for more information.

  • General
    • Is the thesis topic (as agreed upon initially) properly addressed?
    • Does the thesis show the student implemented appropriate scientific methods (i.e. decisions were made in an informed manner and documented properly, etc.)?
  • Related work
    • Is the selection of related work applicable and comprehensive?
    • Was the feedback on the related work during the bachelor / master seminar properly integrated in the thesis?
    • Is the related work appropriately presented (i.e., it was described in a focused way what constitutes the related work, it was clearly shown why this work is relevant for one's own work and which aspects have flowed into one's own work, etc.)?
    • Are citations used correctly and wherever needed?
    • Is the bibliography complete with consistent formatting?
  • Execution of the written part
    • Does the abstract properly describe the thesis?
    • Is the thesis structured correctly and comprehensibly?
    • Is the motivation of the thesis clearly elaborated on?
    • Does the thesis contain a clear summary of the results achieved?
    • Is there a critical discussion on the performance and the limitations of the work to reflect on the choices made?
    • Is future work thoroughly described and are connections to the own work well presented?
    • Is the language used appropriate without spelling mistakes?
    • Does the thesis follow an internal consistency (e.g., special terms are always written in the same form)?
    • Is the thesis consistent and free of incorrect descriptions (i.e., there are no contradictions within the thesis, etc.)?
    • Is the thesis presented clearly and are the means of presentation appropriate (e.g., short sentences, images are used were reasonable, images are easy to understand, etc.)?
    • Is the layout of the thesis appropriate (i.e., all images are referenced, no widows and orphans, tables are properly formatted, etc.)?
  • Concept
    • Is the concept (in relation to the thesis topic) presented thoroughly in the thesis?
    • Are the hypotheses formulated clearly?
    • Is the chosen and described solution novel?
    • Is the concept appropriately presented with a motivation why this solution is the correct one to target the goal of the work?
  • For theses addressing an NLP task:
    • Has the task been addressed comprehensively?
    • Has the dataset been chosen appropriately?
    • Is the chosen method / algorithm suitable for the task?
    • Was training and testing conducted correctly? hyperparameter choice based on dev set (if applicable) / have different random initializations been tried  (if applicable)?
    • Were evaluation measures chosen appropriately?
    • Is an error analysis provided?
    • Were statistical tests conducted to test whether obtained differences are statistically different from one another?
    • Is the code base made available (github or similar) and is it documented following ACL guidelines / best practices?
    • Are the descriptions in the thesis sufficient to allow for replicability?
    • Have the results of the evaluation been discussed with respect to the hypotheses of the thesis?
  • For thesis in experimental psycholinguistics:
    • Quality and documentation of experimental materials (well designed, no confounds) (if applicable)
    • Is the methodological approach appropriate (addresses the hypotheses, suitable experimental design)
    • Was the experiment implemented correctly (wrt. randomization, counter-balancing, choice of fillers, task instructions, practice trials etc.)
    • Was the number of participants in the study chosen in a well-motivated way?
    • Were the participants selected appropriately?
    • Has the study been pre-registered?
    • Did the study follow ethical guidelines? 
    • Was the data handled in a way that is in line with data protection (pseudonymization or anonymization; appropriate storage etc)?
    • Was the data analysed correction (statistics)?
    • Is the study described well enough so that it could be replicated?
    • Are the results presented clearly, and discussed with respect to the hypotheses?
  • For theses that contain data set collection:
    • Was the pre-processing and/or post-processing of the data performed appropriately and correctly?
    • Were the instructions given to annotators clear (annotation scheme / instructions to crowd-workers)
    • Was the data source chosen in a well-motivated way?
    • Is the quality of the data good and have data quality checks been performed?
    • Is the dataset described using descriptive statistics?