Automatic detection of vocalized hesitations in Russian speech

Publication type Article
Status Published
Affiliation: St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences
Address: Russian Federation, St. Petersburg
Affiliation: St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences
Address: Russian Federation, St. Petersburg
Affiliation: St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences
Address: Russian Federation, St. Petersburg
Affiliation: St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences
Address: Russian Federation, St. Petersburg
Journal nameVoprosy Jazykoznanija
EditionIssue 6

The article is focused on the automatic detection of the most frequent speech disfluencies in Russian speech — hesitations. Authors describe the acoustic features of Russian hesitations as well as analyze the different methods of hesitation detection. Results of acoustic analysis have shown that hesitations in Russian speech tend to be centralized, and dependent on the speech genre influence the context differently. Experiments on computerized detection of hesitations in Russian speech confirmed the efficiency and adequacy of the approaches based on acoustic information alone. Support vector machines method yielded the best results with the weighted harmonic mean of precision and recall reaching 56 %.

Keywordsautomatic speech processing, hesitations, machine learning, paralinguistic speech analysis, Russian speech
AcknowledgmentThe research is supported by RFBR (projects No. 15-06-04465 and 18-07-01407), the Council for grants of the President of the Russian Federation (projects No. MK-1000.2017.8 and MД-254.2017.8), and the budget theme No. 0073-2018-0002.
Publication date26.11.2018
Cite   Download pdf To download PDF you should sign in

Price publication: 0

Number of purchasers: 0, views: 2284

Readers community rating: votes 0

1. AR3 2011 — «Analiz razgovornoi russkoi rechi» AR3: Trudy pyatogo mezhdistsiplinarnogo seminara. [“Russian colloquial speech analysis” AR3: Proceedings of the 5th interdisciplinary seminar.] St. Petersburg: GUAP, 2011.

2. Bogdanova-Beglaryan 2014 — Bogdanova-Beglaryan N. V. Pragmatemes in oral colloquial speech: Definition and general typology. Vestnik Permskogo universiteta. Rossiiskaya i zarubezhnaya filologiya. 2014. No. 3. Pp. 7–20.

3. Bogdanova-Beglaryan 2016 — Bogdanova-Beglaryan N. V. Verbal hesitatives in spoken Russian: realization of the searching function and “the search reflex”. Yazyk i metod: Russkii yazyk v lingvisticheskikh issledovaniyakh XXI veka. Kraków: Wydawnictwo Uniwersytetu Jagiellońskiego, 2016. Pp. 345–354.

4. Verkhodanova 2013 — Verkhodanova V. O. Algorithms and software for automatical speech disfluency detection in audio signal. Trudy SPIIRAN. 2013. No. 31. Pp. 43–60.

5. Kibrik, Podlesskaya 2014 — Kibrik A. A., Podlesskaya V. I. Rasskazy o snovideniyakh: Korpusnoe issledovanie ustnogo russkogo diskursa. [The dream stories: A corpus study of spoken Russian discourse.] Moscow: Litres, 2014.

6. Kipyatkova, Karpov 2016 — Kipyatkova I. S., Karpov A. A. Deep artificial neural network types for speech recognition systems. Trudy SPIIRAN. 2016. No. 6. Pp. 80–103.

7. Podlesskaya, Kibrik 2007 — Podlesskaya V. I., Kibrik A. A. Speaker’s self-corrections and other types of disfluency as an object of annotation in spoken language corpora. Nauchno-tekhnicheskaya informatsiya. Series 2. No. 2. 2007. Pp. 2–23.

8. Khurshudyan 2005 — Khurshudyan V. Experimental study of hesitations in languages of different structures. Trudy konferentsii “Dialog’2005”. 2005. Pp. 497–501.

9. Akusok et al. 2015 — Akusok A., Björk K.-M., Miche Y., Lendasse A. High-performance extreme learning machines: A complete toolbox for big data applications. IEEE Access. 2015. Vol. 3. Pp. 1011–1025.

10. Arbisi-Kelm, Jun 2005 — Arbisi-Kelm T., Jun S. A. A comparison of disfluency patterns in normal and stuttered speech. Disfluency in Spontaneous Speech. 2005. Pp. 13–16.

11. Audhkhasi et al. 2009 — Audhkhasi K., Kandhway K., Deshmukh O. D., Verma A. Formant-based technique for automatic filled-pause detection in spontaneous spoken English. Proc. of the ICASSP-2009. 2009. Pp. 4857–4860.

12. Barnes 2006 — Barnes J. Strength and weakness at the interface: Positional neutralization in phonetics and phonology. Berlin: Walter de Gruyter, 2006.

13. Boersma, Weenink 2016 — Boersma P., Weenink D. Praat: doing phonetics by computer [computer program], version 6.0.11. Available at:

14. Chafe 1980 — Chafe W. L. (ed.). The pear stories: cognitive, cultural, and linguistic aspects of narrative production. Norwood (Mass.): Ablex Publishing Corp, 1980.

15. Chang, Lin 2011 — Chang C. C., Lin C. J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011. Vol. 2. Pp. 1–127.

16. ComParE 2013 — INTERSPEECH: Computational Paralinguistic Challenge, 2013. Available at:

17. DiSS’03 2003 — Proceedings of DiSS’03, disfluency in spontaneous speech workshop. Papers in Theoretical Linguistics 90, Sweden, Göteborg University, 2003. Pp. 3–4.

18. Eisler 1968 — Eisler F. G. Psycholinguistics: Experiments in spontaneous speech. London: Academic Press, 1968.

19. English CTS — LDC: English CTS treebank with structural metadata. Available at:

20. Esposito et al. 2016 — Esposito A., Esposito A. M., Likforman-Sulem L., Maldonato M. N., Vinciarelli A. On the significance of speech pauses in depressive disorders: Results on read and spontaneous narratives. Recent Advances in Nonlinear Speech Processing. 2016. Pp. 73–82.

21. Eyben et al. 2010 — Eyben F., Wöllmer M., Schuller B. OpenSMILE: the Munich versatile and fast opensource audio feature extractor. Proc. Of the Multimedia ACM Multimedia 2010, Firenze, Italy. 2010. Pp. 1459–1462.

22. Ferreira et al. 2004 — Ferreira F., Lau E F., Bailey K. G. D. Disfluencies, language comprehension, and tree adjoining grammars. Cognitive Science. 2004. Vol. 28. No. 5. Pp. 721–749.

23. Garg, Ward 2006 — Garg G., Ward N. Detecting filled pauses in tutorial dialogs. Departmental Technical Reports (CS). 2006. Paper 199. Available at:

24. Giannini 2003 — Giannini A. Hesitation phenomena in spontaneous Italian. Proc. Of the ICPhS-2003, Barcelona, Spain. 2003. Pp. 2653–2656.

25. Godfrey et al. 1992 — Godfrey J. J., Holliman E. C., McDaniel J. Switch board: Telephone speech corpus for research and development. Proc. of the ICASSP-1992, San Francisco, USA. 1992. Vol. 1. Pp. 517–520.

26. Goto et al. 1999 — Goto M., Itou K., Hayamizu S. A real-time filled pause detection system for spontaneous speech recognition. Proc. of the Eurospeech-1999, Budapest, Hungary. 1999. Pp. 227–230.

27. Gupta et al. 2013 — Gupta R., Audhkhasi K., Lee S., Narayanan S. Paralinguistic event detection from speech using probabilistic time-series smoothing and masking. Proc. of the INTERSPEECH-2013, Lyon, France. 2013. Pp. 173–177.

28. Kaya et al. 2017 — Kaya H., Salah A., Karpov A., Frolova O., Grigorev A., Lyakso E. Emotion, age, and gender classification in children’s speech by humans and machines. Computer Speech and Language. 2017. Vol. 46. Pp. 268–283.

29. Kaya, Karpov 2018 — Kaya H., Karpov A. Efficient and effective feature normalization strategies for cross-corpus acoustic emotion recognition. Neurocomputing. 2018. Vol. 275. Pp. 1028–1034.

30. Liu 2004 — Liu Y. Structural event detection for rich transcription of speech. Ph.D thesis. Purdue University, 2004.

31. Liu et al. 2006 — Liu Y., Shriberg E., Stolcke A., Hillard D., Ostendorf M., Harper M. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech and Language Processing. 2006. Vol. 14. No. 5. Pp. 1526–1539.

32. Medeiros et al. 2013a — Medeiros H., Moniz H., Batista F., Trancoso I., Nunes L. Disfluency detection based on prosodic features for university lectures. Proc. of the INTERSPEECH-2013, Lyon, France. 2013. Pp. 2629–2633.

33. Medeiros et al. 2013b — Medeiros H., Batista F., Moniz H., Trancoso I., Meinedo H. Experiments on automatic detection of filled pauses using prosodic features. Actas de Inforum. 2013. Pp. 335–345.

34. Moniz et al. 2014 — Moniz H., Batista F., Mata A. I., Trancoso I. Speaking style effects in the production of disfluencies. Speech Communication. 2014. Vol 65. Pp. 20–35.

35. O’Connell, Kowal 2004 — O’Connell D. C., Kowal S. The history of research on the filled pause as evidence of the written language bias in linguistics (Linell, 1982). Journal of Psycholinguistic Research. 2004. Vol. 33. No. 6. Pp. 459–474.

36. Ogden 2001 — Ogden R. Turn-holding, turn-yielding and laryngeal activity in Finnish talking-interaction. Journal of the International Phonetics Association. 2001. Vol. 31. No. 1. Pp. 139–152.

37. O’Shaughnessy 1992 — O’Shaughnessy D. Recognition of hesitations in spontaneous speech. Proc. of the ICASSP’92. 1992. Vol. 1. Pp. 521–524.

38. Prylipko et al. 2014 — Prylipko D., Egorow O., Siegert I., Wendemuth A. Application of image processing methods to filled pauses detection from spontaneous speech. Proc. of the INTERSPEECH-2014, Singapore. 2014. Pp. 1816–1820.

39. Schuller et al. 2013 — Schuller B., Steidl S., Batliner A., Vinciarelli A., Scherer K., Ringeval F., Chetouani M., Weninger F., Eyben F., Marchi E., Mortillaro M., Salamin H., Polychroniou A., Valente F., Kim S. The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emotion, autism. Proc. of the INTERSPEECH-2013, Lyon, France. 2013. Pp. 148–152.

40. Scikit-Learn — Scikit-Learn: Machine learning in Python. Available at:

41. Shriberg 1994 — Shriberg E. Preliminaries to a theory of speech disfluencies. Ph.D. thesis. Univ. of California at Berkeley, 1994.

42. Shriberg 2001 — Shriberg E. To ‘Errrr’ is human: Ecology and acoustics of speech disfluencies. Journal of the International Phonetic Association. 2001. Vol. 31. No. 1. Pp. 153–169.

43. Shriberg 2005 — Shriberg E. Spontaneous speech: How people really talk and why engineers should care. Proc. of the INTERSPEECH-2005, ISCA, Lisbon, Portugal. 2005. Pp. 1781–1784.

44. Shriberg et al. 1997 — Shriberg E., Bates R. A., Stolcke A. A prosody only decision-tree model for disfluency detection. Proc. of the EUROSPEECH-1997, Rhodes, Greece. 1997. Pp. 2383–2386.

45. Skrelin et al. 2010 — Skrelin P., Volskaya N., Kocharov D. et al. A fully annotated corpus of Russian speech. Proc. of the LREC’10, Valletta, Malta. 2010. Pp. 109–112.

46. Snyman 2005 — Snyman J. Practical mathematical optimization: An introduction to basic optimization theory and classical and new gradient-based algorithms. Vol. 97. Springer Science & Business Media. 2005.

47. Stepanova 2007 — Stepanova S. Some features of filled hesitation pauses in spontaneous Russian. Proc. of the ICPhS-2007, Saarbrucken, Germany. 2007. Vol. 16. Pp. 1325–1328.

48. Stolcke et al. 1998 — Stolcke A., Shriberg E., Bates R. A. et al. Automatic detection of sentence boundaries and disfluencies based on recognized words. Proc. of the ICSLP-1998. 1998. Vol. 2. Pp. 2247–2250.

49. Stouten, Martens 2003 — Stouten F., Martens J. P. A feature-based filled pause detection system for Dutch. Proc. of the ASRU’03, IEEE. 2003. Pp. 309–314.

50. Verkhodanova, Shapranov 2015 — Verkhodanova V., Shapranov V. Multi-factor method for detection of filled pauses and lengthenings in Russian spontaneous speech. Proc. of the SPECOM-2015. 2015. Pp. 285–292.

51. Verkhodanova, Shapranov 2016a — Verkhodanova V., Shapranov V. Detecting filled pauses and lengthenings in Russian spontaneous speech using SVM. Proc. of the SPECOM-2016, Budapest, Hungary. Lecture Notes in Computer Science. 2016. Vol. 9811. Pp. 224–231.

52. Verkhodanova, Shapranov 2016b — Verkhodanova V., Shapranov V. Experiments on detection of voiced hesitations in Russian spontaneous speech. Journal of Electrical and Computer Engineering. 2016. Available at:

53. Verkhodanova et al. 2016 — Verkhodanova V., Shapranov V., Karpov A. Filled pauses and lengthenings detection using machine learning techniques. Proc. of the ExLing, Saint Petersburg, Russia. 2016. Pp. 175–178.

54. Verkhodanova et al. 2017 — Verkhodanova V., Shapranov V., Kipyatkova I. Hesitations in spontaneous speech: Acoustic analysis and detection. Proc. of the SPECOM-2017, Hatfield, UK. 2017. Pp. 398–406.

55. Zahorian et al. 2011 — Zahorian S. A., Wu J., Karnjanadecha M., Sekhar Vootkuri C., Wong B., Hwang A., Tokhtamyshev E. Open source multi-language audio database for spoken language processing applications. Proc. of the INTERSPEECH-2011. 2011. Pp. 1493–1496.

Система Orphus