Collect and extract data from media websites

Publication type Article
Status Published
Affiliation: Institute for System Programming them. V.P. Ivannikova RAS
Address: Russian Federation, Moscow
Affiliation: Institute for System Programming them. V.P. Ivannikova RAS
Address: Russian Federation, Moscow
Affiliation: Moscow State University. Mv Lomonosov
Address: Russian Federation, Moscow
Journal nameProgrammirovanie
EditionIssue 5


AcknowledgmentThis work was supported by the Russian Foundation for Basic Research in the framework of scientific projects No. 18-07-01211 and No. 18-07-01059.
Publication date28.10.2018
Number of characters821
Cite   Download pdf To download PDF you should sign in
Размещенный ниже текст является ознакомительной версией и может не соответствовать печатной

views: 482

Readers community rating: votes 0

1. Roses: A continuous content-based query engine for rss feeds / J.C. Toms, B. Amann, N. Travers, D. Vodislav. International Conference on Database and Expert Systems Applications, Springer, 2011, pp. 203Ts218.

2. Vouzoukidou N., Amann B., and Christophides V. Continuous top-k queries over real-time web streams, arXiv preprint arXiv:1610.06500, 2016.

3. Gogar T., Hubacek O., and Sedivy J. Deep neural networks for web page information extraction IFIP International Conference on Artificial Intelligence Applications and Innovations, Springer, 2016, pp. 154Ts163.

4. Kohlschtter C., Fankhauser P., and Nejdl W. Boilerplate detection using shallow text features, Proceedings of the third ACM international conference on Web search and data mining, ACM, 2010, pp. 441Ts450.

5. Web information extraction using markov logic networks / S. Satpal, S. Bhadra, S. Sellamanickam et al., Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp. 1406Ts1414.

6. Diadem: domain-centric, intelligent, automated data extraction methodology / T. Furche, G. Gottlob, G. Grasso et al., Proceedings of the 21st International Conference on World Wide Web, ACM, 2012, pp. 267Ts270.

7. Subercaze J., Gravier C., and Laforest F. Mining user-generated comments, Web Intelligence and Intelligent Agent Technology (WITsIAT), 2015 IEEE/WIC/ACM International Conference on, IEEE, 2015, vol. 1, pp. 45Ts52.

8. Incorporating site-level knowledge to extract structured data from web forums / J.-M. Yang, R. Cai, Y. Wang et al., Proceedings of the 18th international conference on World wide web, ACM, 2009, pp. 181Ts190.

9. Automatic extraction of web data records containing user-generated content / X. Song, J. Liu, Y. Cao et al., Proceedings of the 19th ACM international conference on Information and knowledge management, ACM, 2010, pp. 39Ts48.

10. Schulz A., Lssig J., and Gaedke M. Practical web data extraction: Are we there yet? Ts A short survey, Web Intelligence (WI), 2016 IEEE/WIC/ACM International Conference on, 2016., pp. 562Ts567.

11. Varlamov M.I. and Turdakov D. A survey of methods for the extraction of information from web resources, Programming and Computer Software, 2016, vol. 42, no. 5, pp. 279Ts291.

12. Automatic web news extraction using tree edit distance / D.d.C. Reis, P.B. Golgher, A.S. Silva, A. Laender, Proceedings of the 13th international conference on World Wide Web / ACM, 2004, pp. 502Ts511.

13. Vogels T., Ganea O.-E., and Eickhoff C. Web2text: Deep structured boilerplate removal, arXiv preprint arXiv:1801.02607, 2018.

14. Cleaneval: a competition for cleaning web pages. / M. Baroni, F. Chantree, A. Kilgarriff et al., LREC, 2008.

15. Vips: a vision-based page segmentation algorithm / D. Cai, S. Yu, J.-R. Wen, W.-Y. Ma. 2003.

16. Zheng S., Song R., and Wen J.-R. Templateindependent news extraction based on visual consistency, AAAI, vol. 7., 2007, pp. 1507Ts1513.

17. News article extraction with templateindependent wrapper / J. Wang, X. He, C. Wang et al., Proceedings of the 18th international conference on World wide web, ACM, 2009, pp. 1085Ts1086.

18. Focus: learning to crawl web forums / J. Jiang, X. Song, N. Yu, C.-Y. Lin, IEEE Transactions on Knowledge and Data Engineering, 2013, vol. 25, no. 6., pp. 1293Ts1306.

19. Pretzsch S., Muthmann K., and Schill A. FodexTs towards generic data extraction from web forums, Advanced Information Networking and Applications Workshops (WAINA), 2012 26th International Conference on / IEEE, 2012, pp. 821Ts826.

20. Barbosa L. Harvesting forum pages from seed sites, International Conference on Web Engineering, Springer, 2017, pp. 457Ts468.

21. Scikit-learn: Machine learning in Python / F. Pedregosa, G. Varoquaux, A. Gramfort et al., Journal of Machine Learning Research, 2011, vol. 12, pp. 2825Ts2830.

22. Web data extraction, applications and techniques: A survey / E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner. Knowledgebased systems, 2014, vol. 70, pp. 301Ts323.

23. Barbosa L. and Ferreira G. Extracting records and posts from forum pages with limited supervision, International Conference on Web Information Systems Engineering, Springer, 2015, pp. 233Ts240.

Система Orphus