Methods of misspelling detection and correction: A historical overview

 
PIIS0373658X0001024-5-1
DOI10.31857/S0373658X0001024-5
Publication type Article
Status Published
Authors
Affiliation: National Research University Higher School of Economics
Address: Russian Federation, Moscow, 101000
Journal nameVoprosy Jazykoznanija
EditionIssue 4
Pages115-134
Abstract

This paper discusses the history of methods of automatic spelling correction and the requirements faced by systems implementing such methods at different historical stages. Despite the fact that, since 1960s, the quality of correction has been steadily increasing, two basic technological problems remain: detection of a misspelled word and selection of the optimal candidate for correction. A detailed analysis of contextual features (such as symbolic context, morphological and syntactic characteristics) for NLP-applications utilizing automatic spelling correction can be useful for further improvement of performance in both problematic areas.

Keywordsautomatic spelling correction, historical overview, real-word errors, Russian spellchecking, spelling correction, text normalization
Received14.08.2017
Publication date14.08.2017
Number of characters632
Cite   Download pdf To download PDF you should sign in
1 ….

Price publication: 0

Number of purchasers: 0, views: 1657

Readers community rating: votes 0

1. Baytin 2008 — Baytin A. Correction of search requests in Yandex. Rossiyskie internet-tekhnologii, 2008.

2. Belikov 2006 — Belikov V. I. Digitized texts as material for the dictionary of Russian regionalisms. Trudy mezhdunarodnoy konferentsii «Korpusnaya lingvistika-2006». Gerd A., Zakharov V., Mitrofanova O. (eds.). St. Petersburg: St. Petersburg State Univ. Publ., 2006. Pp. 43—51.

3. Zevakhina, Dzhakupova 2015 — Zevakhina N. A., Dzhakupova S. S. Corpus of Russian student texts: Design and prospects. Komp’yuternaya lingvistika i intellektual’nye tekhnologii. Po materialam ezhegodnoi Mezhdunarodnoi konferentsii «Dialog». Selegey V. (ed.). Moscow: Russian State Univ. for the Humanities, 2015. Available at: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/ZevakhinaNADzhakupovaSS.pdf.

4. Ivanov 2016 — Ivanov G. Use of click data for the improvement of debugging. Yandex Data Fest, 2016. Available at: https://events.yandex.ru/lib/talks/3991.

5. Levenshtein 1965 — Levenshtein V. I. Binary codes with correction of fallouts, insertions, and symbol substitutions. Doklady AN SSSR. 1965. Vol. 163. No. 4. Pp. 845—848.

6. NKRYa — Natsional’nyi korpus russkogo yazyka [Russian National Corpus]. Available at: http://www.ruscorpora.ru.

7. Osovtsev 1999 — Osovtsev S. Freedom of misprints: Misprints from Romulus till our time. Knizhnoe obozrenie. 1999. No. 40. P. 6.

8. Ushakov — Ushakov D. N. (ed.). Tolkovyi slovar’ russkogo yazyka [Defining dictionary of the Russian language]. Moscow: Gosudarstvennoe izdatel’stvo inostrannykh i natsional’nykh slovarei, 1935—1940.

9. Shavrina, Sorokin 2015 — Modeling advanced lemmatization for Russian language using TnT-Russian morphological parser. Komp’yuternaya lingvistika i intellektual’nye tekhnologii. Po materialam ezhegodnoi Mezhdunarodnoi konferentsii «Dialog». Selegey V. (ed.). Moscow: Russian State Univ. for the Humanities, 2015. Available at: http://www.dialog-21.ru/digests/dialog2015/materials/pdf/ShavrinaTOSorokinAA.pdf.

10. Sherikh 2004 — Sherikh D. Yu. «A» upalo, «B» propalo… Zanimatel’naya istoriya opechatok [“A” fell down, “B” disappeared… An entertaining theory of misprints]. Moscow: MiM-Del’ta, 2004.

11. Bahl et al. 1989 — Bahl L. R., Brown P. F., Desouza P. V., Mercer R. L. A tree-based statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1989. Vol. 37. No. 7. Pp. 1001—1008.

12. BNC — The British National Corpus, version 3 (BNC XML Edition). Distributed by Oxford Univ. Computing Services on behalf of the BNC Consortium, 2007. Available at: http://www.natcorp.ox.ac.uk/.

13. Bocharov et al. 2013 — Bocharov V. V., Alexeeva S. V., Granovsky D. V., Protopopova E. V., Stepanova M. E., Surikov A. V. Crowdsourcing morphological annotation. Komp’yuternaya lingvistika i intellektual’nye tekhnologii. Po materialam ezhegodnoi Mezhdunarodnoi konferentsii «Dialog». Vol. 12(19). Selegey V. (ed.). Moscow: Russian State Univ. for the Humanities, 2013. Pp. 109—124.

14. Brill, Moore 2000 — Brill E., Moore R. C. An improved error model for noisy channel spelling correction. Proceedings of the 38th Annual meeting of Association for Computational Linguistics. Hitoshi I. (ed.). Hong Kong: Association for Computational Linguistics, 2000. Pp. 286—293.

15. Brown et al. 1990 — Brown P., Cocke J., Della Pietra S., Della Pietra V., Jelinek F., Mercer R., Roosin P. A statistical approach to machine translation. Computational Linguistics. 1990. Vol. 16. No. 2. Pp. 79—85.

16. Budanitsky, Hirst 2006 — Budanitsky A., Hirst G. Evaluating WordNet-based measures of semantic distance. Computational Linguistics. 2006. Vol. 32. No. 1. Pp. 13—47.

17. Burr 1987 — Burr D. J. Experiments with a connectionist text reader. Proceedings of the First international conference on neural networks. Vol. 4. Caudill M., Butler C. (eds.). San Diego (CA): SOS Printing, 1987. Pp. 717—724.

18. Chen, Goodman 1998 — Chen F. S., Goodman J. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98. Cambridge (MA): Harvard Univ., 1998.

19. Choueka 1988 — Choueka Y. Looking for needles in a haystack, or Locating interesting collocational expressions in large textual databases. Proceedings of RIAO Conference on user-oriented content-based text and image handling. Cambridge (MA): MIT, 1988. Pp. 609—623.

20. Church 1988 — Church K. W. A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the 2nd Applied natural language processing conference. Bates M. et al. (eds.). Austin (Texas): Association for Computational Linguistics, 1988. Pp. 136—143.

21. Dale, Kilgarriff 2011 — Dale R., Kilgarriff A. Helping our own: The HOO 2011 pilot shared task. Proceedings of the 13th European workshop on natural language generation. Gardent C., Striegnitz K. (eds.). Nancy: Association for Computational Linguistics, 2011. Pp. 242—250.

22. Dale et al. 2012 — Dale R., Anisimoff I., Narroway G. A report on the preposition and determiner error correction shared task. Proceedings of the NAACL Workshop on innovative use of NLP for building educational applications. Tetreault J., Burstein J., Leacock C. (eds.). Montreal: Association for Computational Linguistics, 2012. Pp. 216—224.

23. Damerau 1964 — Damerau F. J. A technique for computer detection and correction of spelling errors. Communications of the Association for Computing Machinery. 1964. Vol. 7. No. 3. Pp. 171—176.

24. Deffner et al. 1990 — Deffner R., Eder K., Geiger H. Word recognition as a first step towards natural language processing with artificial neural nets. Proceedings of KONNAI-90. 1990. Vol. 252. Pp. 221—225.

25. Fellbaum 1998 — Fellbaum C. WordNet: An electronic lexical database. Cambridge (MA): MIT Press, 1998.

26. Flor 2012 — Flor M. Four types of context for automatic spelling correction. Traitement Automatique des Langues. 2012. Vol. 53. No. 3. Pp. 61—99.

27. Garside et al. 1987 — Garside R., Leach G., Sampson G. The computational analysis of English: A corpus-based approach. New York: Longman, 1987.

28. Gersho, Reiter 1990 — Gersho M., Reiter R. Information retrieval using self-organizing and heteroassociative supervised neural networks. Proceedings of International joint conference on neural networks (IJCNN-‘90). Vol. 2. San Diego (CA), 1990. Pp. 361—364.

29. Han et al. 2012 — Han B., Cook P., Baldwin T. Lexical normalization of short text messages. Proceedings of the 49th Annual meeting of the association for computational linguistics: Human language technologies. Vol. 1. Portland (OR): Association for Computational Linguistics. Pp. 368—378.

30. Hanson et al. 1976 — Hanson A. R., Riseman E. M., Fisher E. Context in word recognition. Pattern Recognition. 1976. Vol. 8. Pp. 35—45.

31. Heidorn et al. 1982 — Heidorn G. E., Jensen K., Miller L. A., Byrd R. J., Chodorow M. S. The EPISTLE text-critiquing system. IBM Systems Journal. 1982. Vol. 21. No. 3. Pp. 305—326.

32. Huang 2016 — Huang B. WNSpell: A WordNet-based spell corrector. Paper presented at Global WordNet Conference, 2016.

33. Huang et al. 2013 — Huang P., He X., Gao J., Deng L. Learning deep structured semantic models for web search using clickthrough data. Paper presented at International conference on information and knowledge management, 2013. Available at: https://www.microsoft.com/en-us/research/publication/learningdeep-structured-semantic-models-for-web-search-using-clickthrough-data/

34. Jelinek et al. 1991 — Jelinek F., Merialdo B., Roukos S., Strauss M. A dynamic language model for speech recognition. Proceedings of the DARPA speech and natural language workshop. Price P. (ed.). Pacific Grove (CA): Association for Computational Linguistics, 1991. Pp. 293—295.

35. Kempen, Vosse 1990 — Kempen G., Vosse T. A language sensitive text editor for Dutch. Proceedings of the Computers and writing III conference. O’Brian H., Williams N. (eds.). Dordrecht: Kluwer, 1990. Pp. 68—77.

36. Kukich 1990 — Kukich K. A comparison of some novel and traditional lexical distance metrics for spelling correction. Proceedings of the International neural network conference. Vol. 2. Paris: Springer Science & Business Media, 1990. Pp. 309—313.

37. Kukich 1992 — Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys. 1992. Vol. 24. No. 4. Pp. 377—439.

38. Lueck 2011 — Lueck G. A data-driven approach for correcting search queries. Paper presented at Spelling alteration for web search workshop, 2011. Available at: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/en-us-events-spellerworkshop2011-spelling_alteration_workshop.pdf

39. McEnery, Hardie 2011 — McEnery T., Hardie A. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge Univ. Press, 2011.

40. Mikolov et al. 2013 — Mikolov T., Chen K., Corrado G., Dean J. Efficient estimation of word representations in vector space. Ms., arXiv preprint arXiv:1301.3781. Available at: https://arxiv.org/abs/1301.3781.

41. Mitton 1987 — Mitton R. Spelling checkers, spelling correctors, and the misspellings of poor spellers. Information Processing & Management. 1987. Vol. 23. No. 5. Pp. 495—505.

42. Mohit et al. 2014 — Mohit B., Rozovskaya A., Habash N., Zaghouani W., Obeid O. The first QALB shared task on automatic text correction for Arabic. Proceedings of EMNLP Workshop on Arabic natural language processing. Habash N., Vogel S. (eds.). Doha: Curran Associates, 2014. Pp. 39—48.

43. Norvig 2010 — Norvig P. How to write a spelling corrector. Ms., 2010. Available at: http://norvig.com/spell-correct.html.

44. Oshika et al. 1988 — Oshika T., Machi F., Evans B., Tom J. Computational techniques for improved name search. Proceedings of the 2nd Applied natural language processing conference. Bates M. et al. (eds.). Austin (Texas): Association for Computational Linguistics, 1988. Pp. 203—210.

45. Pennington et al. 2014 — Pennington J., Socher R., Manning C. D. Glove: Global vectors for word representation. Proceedings of the Empirical methods in natural language processing (EMNLP 2014). Marton Y. (ed.). Doha: Association for Computational Linguistics, 2014. Pp. 1532—1544.

46. Peterson 1986 — Peterson J. L. A note on undetected typing errors. Communications of the ACM. 1986. Vol. 29. No. 7. Pp. 633—637.

47. Philips 2000 — Philips L. The double metaphone search algorithm. C/C++ Users Journal. 2000. Vol. 18. No. 6. Pp. 38—43.

48. Popescu, Vo 2014 — Popescu O., Vo N. P. A. Fast and accurate misspelling correction in large corpora. Proceedings of the Empirical methods in natural language processing (EMNLP 2014). Marton Y. (ed.). Doha: Association for Computational Linguistics, 2014. Pp. 1634—1643.

49. Richardson, Braden-Harder 1988 — Richardson S. D., Braden-Harder L. C. The experience of developing a larger scale natural language text processing system: CRITIQUE. Proceedings of the 2nd Applied natural language processing conference. Bates M. et al. (eds.). Austin (Texas): Association for Computational Linguistics, 1988. Pp. 195—202.

50. Rozovskaya et al. 2015 — Rozovskaya A., Bouamor H., Habash N., Zaghouani W., Obeid O., Mohit B. The second QALB shared task on automatic text correction for Arabic. Proceedings of the Second workshop on Arabic natural language processing. Tomeh N., Bouamor H. (eds.). Beijing: Association for Computational Linguistics, 2015. Pp. 26—35.

51. Rumelhart et al. 1986 — Rumelhart D. E., Hinton G. E., Williams R. J. Learning internal representations by error propagation. Parallel distributed processing: Explorations in the microstructure of cognition. Rumelhart D. E., McClelland J. L., Bradford E. (eds.). Cambridge (MA): MIT Press, 1986. Pp. 318—362.

52. Schaback, Li 2007 — Schaback J., Li F. Multi-level feature extraction for spelling correction. Paper presented at IJCAI-2007 Workshop on analytics for noisy unstructured text data, 2007. Available at: http://research.ihost.com/and2007/.

53. Schäfer, Bildhauer 2013 — Schäfer R., Bildhauer F. Web corpus construction. Synthesis Lectures on Human Language Technologies. 2013. Vol. 6. No. 4. Pp. 1—145.

54. Shuyo 2012 — Shuyo N. Short text language detection with infinity-gram. Paper presented at Naist seminar, 2012. Available at: http://www.slideshare.net/shuyo/short-text-language-detection-with-infinitygram-12949447.

55. Smadja 1991 — Smadja F. Extracting collocations from text. An application: Text generation. PhD dissertation. New York: Columbia Univ., 1991.

56. Sorokin et al. 2016 — Sorokin A. A., Baitin A. V., Galinskaya I. E., Shavrina T. O. SpellRuEval: The first competition on automatic spelling correction for Russian. Komp’yuternaya lingvistika i intellektual’nye tekhnologii. Po materialam ezhegodnoi Mezhdunarodnoi konferentsii «Dialog». Selegey V. (ed.). Moscow: Russian State Univ. for the Humanities, 2016. Pp. 660—674.

57. Sorokin, Shavrina 2016 — Sorokin A. A., Shavrina T. O. Automatic spelling correction for Russian social media texts. Komp’yuternaya lingvistika i intellektual’nye tekhnologii. Po materialam ezhegodnoi Mezhdunarodnoi konferentsii «Dialog». Selegey V. (ed.). Moscow: Russian State Univ. for the Humanities, 2016. Pp. 688—702.

58. Toutanova, Moore 2002 — Toutanova K., Moore R. C. Pronunciation modeling for improved spelling correction. Proceedings of the 40th Annual meeting on Association for Computational Linguistics. Pierre I. (ed.). Philadelphia: Association for Computational Linguistics, 2002. Pp. 144—151.

59. Van Berkel, DeSmedt 1988 — Van Berkel B., DeSmedt K. Triphone analysis: A combined method for the correction of orthographical and typographical errors. Proceedings of the 2nd Applied natural language processing conference. Bates M. et al. (eds.). Austin (Texas): Association for Computational Linguistics, 1988. Pp.77—83.

60. Whitelaw et al. 2009 — Whitelaw C., Hutchinson B., Chung G. Y., Ellis G. Using the web for language independent spellchecking and autocorrection. EMNLP ’09 Proceedings of the 2009 Conference on empirical methods in natural language processing. Vol. 2. Singapore: Association for Computational Linguistics, 2009. Pp. 890—899.

61. Zamora et al. 1981 — Zamora E. M., Pollock J. J., Zamora A. The use of trigram analysis for spelling error detection. Information Processing & Management. 1981. Vol. 17. No. 6. Pp. 305—316.

62. Zipf 1949 — Zipf G. K. Human behavior and the principle of least effort. Cambridge (MA): Addison-Wesley Press, 1949.

Система Orphus

Loading...
Up