A comparison of online automatic speech recognition systems and the nonverbal responses to unintelligible speech
File(s)1904.12403.pdf (506.79 KB)
Working paper
OA Location
Author(s)
Type
Working Paper
Abstract
Automatic Speech Recognition (ASR) systems have proliferated over the recent
years to the point that free platforms such as YouTube now provide speech
recognition services. Given the wide selection of ASR systems, we contribute to
the field of automatic speech recognition by comparing the relative performance
of two sets of manual transcriptions and five sets of automatic transcriptions
(Google Cloud, IBM Watson, Microsoft Azure, Trint, and YouTube) to help
researchers to select accurate transcription services. In addition, we identify
nonverbal behaviors that are associated with unintelligible speech, as
indicated by high word error rates. We show that manual transcriptions remain
superior to current automatic transcriptions. Amongst the automatic
transcription services, YouTube offers the most accurate transcription service.
For non-verbal behavioral involvement, we provide evidence that the variability
of smile intensities from the listener is high (low) when the speaker is clear
(unintelligible). These findings are derived from videoconferencing
interactions between student doctors and simulated patients; therefore, we
contribute towards both the ASR literature and the healthcare communication
skills teaching community.
years to the point that free platforms such as YouTube now provide speech
recognition services. Given the wide selection of ASR systems, we contribute to
the field of automatic speech recognition by comparing the relative performance
of two sets of manual transcriptions and five sets of automatic transcriptions
(Google Cloud, IBM Watson, Microsoft Azure, Trint, and YouTube) to help
researchers to select accurate transcription services. In addition, we identify
nonverbal behaviors that are associated with unintelligible speech, as
indicated by high word error rates. We show that manual transcriptions remain
superior to current automatic transcriptions. Amongst the automatic
transcription services, YouTube offers the most accurate transcription service.
For non-verbal behavioral involvement, we provide evidence that the variability
of smile intensities from the listener is high (low) when the speaker is clear
(unintelligible). These findings are derived from videoconferencing
interactions between student doctors and simulated patients; therefore, we
contribute towards both the ASR literature and the healthcare communication
skills teaching community.
Date Issued
2019-04-29
Citation
2019
Publisher
arXiv
Copyright Statement
© 2019 The Author(s)
Identifier
http://arxiv.org/abs/1904.12403v1
Subjects
cs.SD
cs.SD
eess.AS
H.5.5
Notes
10 pages, 2 figures
Publication Status
Published