Collect and extract data from media websites

Yatskov, A.; Varlamov, M.; Turdakov, D.

doi:10.31857/S013234740001216-2

Home

Programmirovanie

Issue 5

Collect and extract data from media websites

Annotation
Reviews
References

Collect and extract data from media websites

PII

S013234740001216-2-1

DOI

10.31857/S013234740001216-2

Publication type

Article

Status

Published

Authors

A. Yatskov

Affiliation: Institute for System Programming them. V.P. Ivannikova RAS
Address: Russian Federation, Moscow

M. Varlamov

Affiliation: Institute for System Programming them. V.P. Ivannikova RAS
Address: Russian Federation, Moscow

D. Turdakov

Affiliation: Moscow State University. Mv Lomonosov
Address: Russian Federation, Moscow

Journal name

Programmirovanie

Edition

Issue 5

Pages

68-80

Abstract

Keywords

Acknowledgment

This work was supported by the Russian Foundation for Basic Research in the framework of scientific projects No. 18-07-01211 and No. 18-07-01059.

Received

26.10.2018

Publication date

28.10.2018

Number of characters

821

Cite Download pdf To download PDF you should sign in

GOST	Yatskov A., Varlamov M., Turdakov D. Collect and extract data from media websites // Programmirovanie – 2018. – Issue 5 C. 68-80 [Electronic resource]. URL: http://ras.jes.su/progr/s207987840000186-3-2-en (circulation date: 23.07.2024). DOI: 10.31857/S013234740001216-2
MLA	Yatskov, A., Varlamov, M., Turdakov, D. "Collect and extract data from media websites." Programmirovanie 5 (2018):68-80. DOI: 10.31857/S013234740001216-2
APA	Yatskov A., Varlamov M., Turdakov D. (2018). Collect and extract data from media websites. Programmirovanie (5), pp.68-80 DOI: 10.31857/S013234740001216-2

Размещенный ниже текст является ознакомительной версией и может не соответствовать печатной

Readers community rating: votes 0

1. Roses: A continuous content-based query engine for rss feeds / J.C. Toms, B. Amann, N. Travers, D. Vodislav. International Conference on Database and Expert Systems Applications, Springer, 2011, pp. 203Ts218.

2. Vouzoukidou N., Amann B., and Christophides V. Continuous top-k queries over real-time web streams, arXiv preprint arXiv:1610.06500, 2016.

3. Gogar T., Hubacek O., and Sedivy J. Deep neural networks for web page information extraction IFIP International Conference on Artificial Intelligence Applications and Innovations, Springer, 2016, pp. 154Ts163.

4. Kohlschtter C., Fankhauser P., and Nejdl W. Boilerplate detection using shallow text features, Proceedings of the third ACM international conference on Web search and data mining, ACM, 2010, pp. 441Ts450.

5. Web information extraction using markov logic networks / S. Satpal, S. Bhadra, S. Sellamanickam et al., Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp. 1406Ts1414.

6. Diadem: domain-centric, intelligent, automated data extraction methodology / T. Furche, G. Gottlob, G. Grasso et al., Proceedings of the 21st International Conference on World Wide Web, ACM, 2012, pp. 267Ts270.

7. Subercaze J., Gravier C., and Laforest F. Mining user-generated comments, Web Intelligence and Intelligent Agent Technology (WITsIAT), 2015 IEEE/WIC/ACM International Conference on, IEEE, 2015, vol. 1, pp. 45Ts52.

8. Incorporating site-level knowledge to extract structured data from web forums / J.-M. Yang, R. Cai, Y. Wang et al., Proceedings of the 18th international conference on World wide web, ACM, 2009, pp. 181Ts190.

9. Automatic extraction of web data records containing user-generated content / X. Song, J. Liu, Y. Cao et al., Proceedings of the 19th ACM international conference on Information and knowledge management, ACM, 2010, pp. 39Ts48.

10. Schulz A., Lssig J., and Gaedke M. Practical web data extraction: Are we there yet? Ts A short survey, Web Intelligence (WI), 2016 IEEE/WIC/ACM International Conference on, 2016., pp. 562Ts567.

11. Varlamov M.I. and Turdakov D. A survey of methods for the extraction of information from web resources, Programming and Computer Software, 2016, vol. 42, no. 5, pp. 279Ts291.

12. Automatic web news extraction using tree edit distance / D.d.C. Reis, P.B. Golgher, A.S. Silva, A. Laender, Proceedings of the 13th international conference on World Wide Web / ACM, 2004, pp. 502Ts511.

13. Vogels T., Ganea O.-E., and Eickhoff C. Web2text: Deep structured boilerplate removal, arXiv preprint arXiv:1801.02607, 2018.

14. Cleaneval: a competition for cleaning web pages. / M. Baroni, F. Chantree, A. Kilgarriff et al., LREC, 2008.

15. Vips: a vision-based page segmentation algorithm / D. Cai, S. Yu, J.-R. Wen, W.-Y. Ma. 2003.

16. Zheng S., Song R., and Wen J.-R. Templateindependent news extraction based on visual consistency, AAAI, vol. 7., 2007, pp. 1507Ts1513.

17. News article extraction with templateindependent wrapper / J. Wang, X. He, C. Wang et al., Proceedings of the 18th international conference on World wide web, ACM, 2009, pp. 1085Ts1086.

18. Focus: learning to crawl web forums / J. Jiang, X. Song, N. Yu, C.-Y. Lin, IEEE Transactions on Knowledge and Data Engineering, 2013, vol. 25, no. 6., pp. 1293Ts1306.

19. Pretzsch S., Muthmann K., and Schill A. FodexTs towards generic data extraction from web forums, Advanced Information Networking and Applications Workshops (WAINA), 2012 26th International Conference on / IEEE, 2012, pp. 821Ts826.

20. Barbosa L. Harvesting forum pages from seed sites, International Conference on Web Engineering, Springer, 2017, pp. 457Ts468.

21. Scikit-learn: Machine learning in Python / F. Pedregosa, G. Varoquaux, A. Gramfort et al., Journal of Machine Learning Research, 2011, vol. 12, pp. 2825Ts2830.

22. Web data extraction, applications and techniques: A survey / E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner. Knowledgebased systems, 2014, vol. 70, pp. 301Ts323.

23. Barbosa L. and Ferreira G. Extracting records and posts from forum pages with limited supervision, International Conference on Web Information Systems Engineering, Springer, 2015, pp. 233Ts240.