Area C - Human-Machine Interaction

Thematic Area C: Crossmodal learning in human-machine interaction

While Thematic Area A focuses on the dynamics of crossmodal learning and Area B studies crossmodal prediction and generalization, the projects in Area C again investigate crossmodal learning from the perspective of human-machine interaction, addressing issues that specifically relate to the shared multimodal signals that are perceived by both human and machine.

The projects in this thematic area study how crossmodal signals are integrated and learned for speech perception and language understanding (C1, C4, C7), how multiple sensory modalities are combined for embodied language acquisition, such as audio and visual perception, proprioception, and vocal utterances (C4), how motor control (such as eye movements and speech articulation) can provide the information needed to disambiguate rich multisensory information (such as vision and audition) to support a clearer understanding of both spoken and written language (C1, C7) and how crossmodal plasticity during visual-haptic interaction can enable new forms of therapy (C8). Last but not least, crossmodal aspects of theory of mind and sense of ownership will be studied (C9).

On purpose, we interleave projects from computer and brain science in this area to create a comprehensive set of methodologies to cover aspects required for studying crossmodal learning in the context of interaction. What unites these research issues is their potential for improving human-machine interaction: by transferring the knowledge we gain from these projects to artificial agents, we will provide the artificial systems with a greater common ground upon which interaction with humans can become more natural.

Project C1 (D. Zhang, Hong, Nolte) investigated the crossmodal neural representation of human speech in the motor modality and its dynamical interaction with the auditory modality, mainly focusing on the phoneme level. Specifically, using direct recordings of the human brain, categorical neural responses to lexical tones were found over a distributed cooperative network that included strong causal links from the auditory areas in the temporal cortex to the motor areas in the frontal cortex. These findings provide new evidence for a top-down influence and a crossmodal (from auditory to motor) representation of the perceived human speech. During the second funding phase, the project will be extended to higher linguistic levels relating to speech prosody and semantics. It aims to provide a complete overview of the crossmodal electrophysiological signatures for the active perception of human speech. Specifically, the motor modality is hypothesized to represent a top-down active interpretation of the speech information, while the auditory modality is considered to reflect mainly bottom-up processing. Naturalistic speech materials will be used to construct ecologically valid paradigms in order to maximally activate speech-related neural processing as in real-life communication situations. To track the neural dynamics following the fast-changing speech information, electrophysiological recording techniques with a high temporal resolution, including EEG, MEG, and intracranial EEG, will be employed.

Project C4 (Weber, Wermter, Z. Liu) built and studied cognitive and knowledge-based models of vision- and action-embodied language learning during the first phase. It explored mechanisms that explain how temporal dynamics facilitate language decomposition and abstraction, for neural attention in image-to-text transduction, and for representation formation in crossmodal integration. Also, it developed fundamental aspects of visual scene understanding, human-like language teaching using knowledge-based models, and human-robot interaction for collecting crossmodal data in real-world object interaction. In the second phase, C4 will focus on the research question of how to build a neural cognitive model that integrates embodied and knowledge-based crossmodal information in language learning. Furthermore, the project will investigate neurocognitive constraints on computational models for language learning that may be utilized in large scale technical systems. In particular, the project will examine a variety of language functions inspired by the processing in the human brain, study scene-understanding and representation learning architectures for language learning from large-scale data and knowledge bases, and transfer as well as integrate them into prototypes for human-robot interaction.

Project C7 (Li, Qu, Biemann) succeeds project C7 (Li, Menzel, Qu) and plans to investigate the integration of spoken language and visual information in a more specific scenario, i.e., learning to read. Written forms of words are represented through the visual system, and their corresponding sound forms through the auditory and spoken system. Mapping visual and auditory information is required when learning to read. Thus, learning to read requires crossmodal association and integration. The overall goal of this project is to understand how multimodal learning approaches (i.e., the interface of visual input and spoken input; passive exposure vs. active production) boost learning to read over mono-modal learning approaches, in particular to acquire vocabulary, and to understand the plausible mechanisms underlying the potential benefits elicited by multimodal approaches—both from the perspective of behavioural psychology and from the perspective of computational modelling. The project will also investigate how linguistic knowledge and language processing skills learned from other modalities transfer to reading. Third, C7 will create a unified large-vocabulary joint language model for multiple modalities capable of encoding the signals from different modalities in a unified dense vector space representation, such that either the meaning is encoded independent of the modality or that there exists an alignment between modality-specific sub-spaces.

Project C8 Chen, Kühn, Steinicke, Wei) succeeds project C6 (Chen, Steinicke) and extends this work by focusing on interactions and intentionally induced visuo-haptic conflicts (so-called visuo-haptic retargeting) on the upper limbs including the finger-level. The project will first develop a faithful reconstruction and real-time tracking of the user's arms, hands, and fingers while preserving the possibility to manipulate the appearance, spatial configuration, and temporal coordination behaviour of the visually perceived articular structures. The basic notion is to equip people with a sense of body ownership (SoO) and sense of agency (SoA) by using a virtual body representation (avatar), in particular of the upper limbs. In its second phase, the project will study pain and touch processing for virtual reality (VR) therapeutic applications in three experimental setups. The project will then extend these approaches to patients with phantom limb pain and Complex Regional Pain Syndrome (CRPS), to examine the generalization and neural plasticity of the intervention across different tasks and syndromes.

The focus of Project C9 (Gläscher, X. Fu) investigates the cognitive requisites for efficient human-human and human-robot communication. For communication to be successful, the sender has to design the message in a way that the receiver is able to understand it. Conversely, the receiver also has to reason about the sender’s intention during the communication. Thus, both partners need Theory of Mind (ToM) capacities during successful communication, i.e., the ability to think about another person’s beliefs and intentions. In addition, sense of agency (SoA) is a powerful modulator of the efficiency and perception of human communication. In this project, the role of ToM and SoA will be investigated in novel, multisensory communicative acts using the Tacit Communication Game (TCG) as an experimental platform. Using behavioural and neural data collected from fMRI and MEG in combination with computational modelling, the social reasoning processes that the brain engages in when designing and receiving novel communicative messages will be uncovered.