Multilingual Parallel Corpora: Alternative Source of Language Data for Typological Studies, Applying Perspectives and Problems

 
PIIS0373658X0004308-7-1
DOI10.31857/S0373658X0004308-7
Publication type Article
Status Published
Authors
Affiliation: National Research University Higher School of Economics
Journal nameVoprosy Jazykoznanija
EditionIssue 2
Pages111-125
Abstract

In this paper, we discuss the perspectives of using multilingual parallel corpora as a source of language data for cross-linguistic studies. Multilingual parallel corpora make it possible to apply quantitative methods to cross-linguistic data. However, they have not become popular among researchers yet. The reason for that is the lack of multilingual parallel corpora that are suitable for linguistic studies and also the absence of unified guidelines for multilingual parallel corpora development. In the paper, we will analyse the factors that make it difficult to use multilingual parallel corpora for linguistic experiments and present some ideas about the features one should take into account when building multilingual parallel corpora for typological studies.

Keywordscorpus linguistics, parallel corpora, surveys
Received29.04.2019
Publication date05.05.2019
Number of characters45344
Cite  
100 rub.
When subscribing to an article or issue, the user can download PDF, evaluate the publication or contact the author. Need to register.

Number of purchasers: 3, views: 1690

Readers community rating: votes 0

1. Bonch-Osmolovskaya, Nesterenko 2018 ― Bonch-Osmolovskaya A. A., Nesterenko L. V. Seti kak instrument poiska i nakhodok v mul'tiyazychnykh parallel'nykh korpusakh. EVRika! Sbornik statej o poiskakh i nakhodkakh k yubileyu E. V. Rakhilinoj. Ryzhova D. A., Dobrushina N. R., Bonch-Osmolovskaya A. A., Vyrenkova A. S., Kyuseva M. V., Orekhov B. V., Reznikova T. I. (red.). M.: Labirint, 2018, 305–320. [Bonch-Osmolovskaya A. A., Nesterenko L. V. Networks as an instrument for search and findings in multilingual parallel corpora. EVRika! Sbornik statei o poiskakh i nakhodkakh k yubileyu E. V. Rakhilinoi. Ryzhova D. A., Dobrushina N. R., Bonch-Osmolovskaya A. A., Vyrenkova A. S., Kyuseva M. V., Orekhov B. V., Reznikova T. I. (eds.). Moscow: Labirint, 2018, 305–320.]

2. Dobrovol'skij 2009 ― Dobrovol'skij D. O. Korp us parallel'nykh tekstov v issledovanii kul'turno-spetsifichnoj leksiki. Natsional'nyj korpus russkogo yazyka: 2006–2008. Novye rezul'taty i perspektivy. SPb.: Nestor-Istoriya, 2009, 383–401. [Dobrovol’skij D. O. Corpus of parallel texts in research of culture-specific vocabulary. Natsional’nyi korpus russkogo yazyka: 2006–2008. Novye rezul’taty i perspektivy. St. Petersburg: Nestor-Istoriya, 2009, 383–401.]

3. NKRYa — Natsional'nyj korpus russkogo yazyka. [Natsional’nyi korpus russkogo yazyka [Russian National Corpus].] URL: http://www.ruscorpora.ru

4. Sichinava 2015 ― Sichinava D. V. Parallel'nye teksty v sostave Natsional'nogo korpusa russkogo yazyka: novye napravleniya razvitiya i rezul'taty. Trudy Instituta russkogo yazyka im. V. V. Vinogradova, 2016, 6: 194–235. [Sitchinava D. V. Parallel texts within the Russian National Corpus: New development paths and results. Trudy Instituta russkogo yazyka im. V. V. Vinogradova, 2016, 6: 194–235.]

5. Alberti et al. 2017 ― Alberti C., Andor D., Bogatyy I., Collins M., Gillick D., Kong L., Koo T., Ma J., Omernick M., Petrov S., Thanapirom C., Tung Z, Weiss D. SyntaxNet Models for the CoNLL 2017 Shared Task. 2017. URL: http://arxiv.org/abs/1703.04929

6. Asgari, Schütze 2017 — Asgari E., Schütze H. Past, Present, Future: A computational investigation of the typology of tense in 1000 languages. URL: http://arxiv.org/abs/1704.08914. 2017

7. Bickel et al. 2008 ― Bickel B., Comrie B., Haspelmath M. The Leipzig Glossing Rules. Conventions for interlinear morpheme by morpheme glosses (Revised version of February 2008). URL: https://www.eva.mpg.de/lingua/resources/glossing-rules

8. Brown et al. 1993 ― Brown P., Della Pietra S., Della Pietra V., Mercer R. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 1993, 19(2): 263–311.

9. Buchholz, Marsi 2006 ― Buchholz S., Marsi E. CoNLL-X shared task on multilingual dependency parsing. Proc. of the 10th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2006, 149–164.

10. Callison-Burch et al. 2004 ― Callison-Burch C., Talbot D., Osborne M. Statistical machine translation with word- and sentence-aligned parallel corpora. Proc. of the 42nd Annual Meeting of Association for Computational Linguistics. Association for Computational Linguistics, 2004, 175–182.

11. Chen, Nie 2000 ― Chen J., Nie J. Y. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proc. of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 2000, 21–28.

12. Cysouw 2014 ― Cysouw M. Inducing semantic roles. Perspectives on semantic roles. Luraghi S., Narrog H. (eds.). Amsterdam: Benjamins, 2014, 23–68.

13. Cysouw, Wälchli 2007 — Cysouw M., Wälchli B. Parallel texts: using translational equivalents in linguistic typology. STUF-Sprachtypologie und Universalienforschung, 2007, 60(2): 95–99.

14. Čermák, Rosen 2012 ― Čermák F., Rosen A. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 2012, 17(3): 411–427.

15. Dahl 2007 ― Dahl Ö. From questionnaires to parallel corpora in typology. STUF-Sprachtypologie und Universalienforschung, 2007, 60(2): 172–181.

16. Davis, Dunning 1995 ― Davis M. W., Dunning T. Query translation using evolutionary programming for multi-lingual information retrieval. Evolutionary Programming, 1995: 175–185.

17. Gale, Church 1991 — Gale W. A., Church K. W. Identifying Word Correspondences in Parallel Texts. HLT, 1991, 91: 152–157.

18. Koehn et al. 2003 ― Koehn P., Och F. J., Marcu D. Statistical phrase-based translation. Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Vol. 1. Association for Computational Linguistics, 2003, 48–54.

19. Koehn 2005 ― Koehn P. Europarl: A parallel corpus for statistical machine translation. MT Summit, 2005, 5: 79–86.

20. Mayer, Cysouw 2012 ― Mayer T., Cysouw M. Language comparison through sparse multilingual word alignment. Proc. of the EACL 2012 Joint Workshop of LINGVIS & UNCLH. Association for Computational Linguistics, 2012, 54–62.

21. Mayer, Cysouw 2014 ― Mayer T., Cysouw M. Creating a massively parallel Bible corpus. Proc. of the 9th International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), 2014, 3158–3163.

22. Megyesi et al. 2010 ― Megyesi B., Dahlqvist B., Csato E., Nivre J. The English-Swedish-Turkish Parallel Treebank. Proc. of the 7th International Conference on Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), 2010, 3393–3397.

23. Nie et al. 1999 ― Nie J. Y., Simard M., Isabelle P., Durand R. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999, 74–81.

24. Nivre et al. 2016 ― Nivre J., de Marneffe M.-C., Ginter F., Goldberg Y., Haji J., Manning C. D., McDonald R., Petrov S., Pyysalo S., Silveira N., Tsarfaty R., Zeman D. Universal Dependencies v1: A multilingual treebank collection. Proc. of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016, 1659–1666.

25. Östling 2015 ― Östling R. Word order typology through multilingual word alignment. Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Vol. 2: Short papers. 2015, 205–211.

26. Östling 2016 ― Östling R. Studying colexification through massively parallel corpora. The lexical typology of semantic shifts, 2016, 58: 157.

27. Sahlgren, Karlgren 2005 — Sahlgren M., Karlgren J. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Natural Language Engineering, 2005, 11(03): 327–341.

28. Sharoff 2002 ― Sharoff S. Meaning as use: Exploitation of aligned corpora for the contrastive study of lexical semantics. Proc. of the 3rd International Conference on Language Resources and Evaluation (LREC 2002). European Language Resources Association (ELRA), 2002, 447–452.

29. Sitchinava 2012 ― Sitchinava D. Parallel corpora within the Russian National Corpus. Prace Filologiczne, 2012, 63: 271–278.

30. Stambolieva 2011 ― Stambolieva M. Parallel corpora in aspectual studies of non-aspect languages. Proc. of The Second Workshop on Annotation and Exploitation of Parallel Corpora. Association for Computational Linguistics, 2011, 39–42.

31. Straka, Straková 2017 ― Straka M., Straková J. Tokenizing, POS-tagging, lemmatizing and parsing UD 2.0 with UDPipe. Proc. of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2017, 88–99.

32. Tiedemann 2009 ― Tiedemann J. News from OPUS ― a collection of multilingual parallel corpora with tools and interfaces. Recent Advances in Natural Language Processing, 2009, 5: 237–248.

33. Vavřín, Rosen 2008 ― Vavřín M., Rosen A. Intercorp: A multilingual parallel corpus. Proc. of the International Conference “Corpus Linguistics”. St. Petersburg: Saint Petersburg State Univ., 2008, 156–162.

34. Waldenfels 2006 ― von Waldenfels R. Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV), 9. Brehmer B., Zdanova V., Zimny R. (eds.). München: Otto Sagner, 2006: 123–138.

35. Wälchli 2007 ― Wälchli B. Advantages and disadvantages of using parallel texts in typological investigations. STUF — Sprachtypologie und Universalienforschung, 2007, 60(2): 118–134.

36. Wälchli, Cysouw 2012 ― Wälchli B., Cysouw M. Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics, 2012, 50(3): 671–710.

Система Orphus

Loading...
Up