Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction

Yue Li, Koen Hindriks, Florian Kunneman

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. Moreover, the best performance is not sufficient for consistent comprehension after filtering, while we see a large diversity in performance across our dataset. We conclude that, first, the human speech volume and pitch strongly affect the performance of the proposed method's results; second, the signal processing method based on speech masking and spectral subtraction is keen to reverberation, while the neural network method is robust; third, the batch normalization layer in TSE models is not useful for filtering the interference speech when it is significantly more powerful than the target speech. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.

Original languageEnglish
Title of host publicationProceedings of the 2024 International Symposium on Technological Advances in Human-Robot Interaction, TAHRI 2024
PublisherAssociation for Computing Machinery
Pages20-28
Number of pages9
ISBN (Electronic)9798400716614
ISBN (Print)9798400716614
DOIs
Publication statusPublished - 9 May 2024
Event2024 International Symposium on Technological Advances in Human-Robot Interaction, TAHRI 2024 - Boulder, United States
Duration: 9 Mar 202410 Mar 2024

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2024 International Symposium on Technological Advances in Human-Robot Interaction, TAHRI 2024
Country/TerritoryUnited States
CityBoulder
Period9/03/2410/03/24

Bibliographical note

Publisher Copyright:
© 2024 Owner/Author.

Keywords

  • Human-robot interaction
  • spectrogram masking
  • speech recognition
  • target speech estimation

Fingerprint

Dive into the research topics of 'Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction'. Together they form a unique fingerprint.

Cite this