Review of metadata storing and searching methods in DLT and distributed files systems

 
Код статьиS278229070018060-2-1
DOI10.18254/S278229070018060-2
Тип публикации Рецензия
Статус публикации Опубликовано
Авторы
Аффилиация:
Department of Engineering Cybernetics, National University of Science and Technology “MISiS”, Moscow, Russia
Department of Information Technology, Tartous University, Tartus, Syria
Адрес: Российская Федерация, Москва
Название журналаLaw & Digital Technologies
ВыпускТом 2 № 1
Страницы35-42
Аннотация

In this review, the author considers currently available methods for storing and looking up files by metadata. Different available solutions were considered and their shortcomings were highlighted. The purpose of this research is to evaluate the current metadata processing methods in order to provide a robust base to build a metadata indexing system for a distributed documents exchange system. Different projects and researches were considered: Luster, WekaFS, Ceph, Gluster, IPFS, HDFS, Spyglass, Smartstore, and Distributed Metadata Search for the Cloud research. It was concluded that all the considered solutions don't fit the requirements of processing documents metadata. In the end, the characteristics of an ideal system for this job were described.

Ключевые словаDLT, metadata search, e-document, B2B exchange
Получено11.04.2022
Дата публикации30.06.2022
Кол-во символов18654
Цитировать   Скачать pdf Для скачивания PDF необходимо авторизоваться
100 руб.
При оформлении подписки на статью или выпуск пользователь получает возможность скачать PDF, оценить публикацию и связаться с автором. Для оформления подписки требуется авторизация.

Оператором распространения коммерческих препринтов является ООО «Интеграция: ОН»

1

Introduction

E-documents form an essential part of daily B2B data exchanges on the Internet. This includes contracts, service notes, official mails, financial documents and other types of e-documents. These documents have a different structure than random files [1]. They may contain specific metadata entries, like an author’s name, account number, registration date, etc. Various standards can be used to represent these metadata entries, like MARC 21, Dublin Core, ANSI ASC X12 and others. As these documents have their own structure, B2B services are required to process these documents based on various attributes.
2 Large corporations build their own services to manage e-documents using centralized solutions. As centralized solutions can be easily managed and deployed, they are more vulnerable to attacks and threats, which leads to critical consequences for the entire system [2]. A distributed architecture can help to avoid these threats. A recent trend in B2B systems is taking advantage of distributed architectures, such as immutability data and transparency of data processing and logs. Thereby, such well-known distributed technologies as DLT and IPFS are considered. The architecture should be designed to fit the requirements of processing e-documents, such as searching for different e-documents using several attributes.
3 The aim of this review is to study the most well-known projects which handle files in distributed systems, or metadata in general. The paper discusses how much these projects can be integrated into a distributed structure to store e-documents. For this purpose, the author highlights the flaws and advantages of each project.
4

Review methodology

Different technologies and solutions were considered in this study. These solutions were classified into three main categories:
  1. Projects and research articles about storing data in distributed architectures, like Luster, IPFS, etc.
  2. Projects and articles about methods to store metadata which handles a big number of files.
  3. Projects and articles about storing metadata in blockchain.
This study utilized different open resources: research articles (Google Scholar [3], Scopus [4], IEEE [5]) and white papers available on the Internet. Among the sources examined in the present review, there were developing projects mentioned and cited for the last 3-4 years, technical documentations which had been updated over the last year, and around which a community of developers was being formed. The system for the task of processing e-documents metadata can be described by the following requirements:
  1. Support for several attributes, with various data types like string, number, dictionary, etc.
  2. The ability to work on several nodes. As there can be several nodes, having a single node to handle searching processes can lead to a bottleneck. Also, the system should be flexible when adding a new node, so the workload can be shared with new nodes without interference.
  3. Support for searching data entries based on multiple attributes. It includes searching by a single attribute, several attributes, or even all attributes. The solution should be efficient in terms of speed and memory (for example, storing many trees is not efficient in terms of memory).
  4. Support for complex queries. There are a number of situations where complex queries can be useful, such as finding data quickly or making reports. Complex queries - such as intersections and aggregations - may combine multiple attributes.
  5. Support for scheme changes for metadata. Eventually, there may be a need to upgrade the data schema, and the process should be feasible.
  6. A method or protocol to agree on data must be implemented for nodes to resolve conflicts automatically. It is also necessary to synchronize the data between nodes to quickly add changes.
5

Distributed filesystems

Luster

Luster [6] is a clusters-based opensource filesystem, which works in distributed structures. Most HPFS (High Performance File System) use Luster file system [7]. Luster file system mainly consists of three types of nodes (Fig. 1):
  1. The client which uses the filesystem
  2. Metadata server (MDS)
  3. Object storage server (OSS)
6 Luster filesystem can contain more than one metadata server, and these servers store data on devices called MDT (Metadata Targets). Also, Luster filesystem can contain one or more MDTs. MDTs contain all required metadata, including filenames, their permissions, folders, etc.
7 These servers manage storage devices called OSTs (Object Storage Target). Luster filesystem can be comprised of one or several OSSs, and each OSS can have a single or multiple OSTs. The total storage capacity of the system is the sum of the storage spaces of all OSTs.
8 MDS uses both round-robin and a weighted random algorithm to allocate OSTs’ objects as follows:
  • When the free space variation among all the OSTs is less than a certain threshold (17%), MDS uses round-robin to choose the next OST to write the strip of data to. This case is called a balanced state.
  • If the free space variation among all the OSTs is more than the certain threshold (imbalanced state), it uses weighted random algorithm to allocate the next object. By using this algorithm, the file system can return to the balanced state by writing more objects to the OSTs that have more free space. The I/O performance decreases while the system is imbalanced.
As long as there might be more than one MDT, Luster uses DNE (Distributed Namespace) remote directories to assign each MDT a part of the overall MDT data. This process is called a DNE phase 1. MDTs can have nested relations, so one MDT can present a part of another MDT, but this can cause problems, because if a MDT is damaged, all the nested MDTs become useless.

всего просмотров: 323

Оценка читателей: голосов 0

1. Sawadogo, Pegdwendé, Tokio Kibata, and Jérôme Darmont. 2019. Metadata Management for Textual Documents in Data Lakes. Proceedings of the 21st International Conference on Enterprise Information Systems. May, 2019, Greece, 72–83. https://doi.org/10.5220/0007706300720083.

2. Golosova, Julija, and Andrejs Romanovs. 2018. The Advantages and Disadvantages of the Blockchain Technology. 2018 IEEE 6th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), 2018, Lithuania. https://doi.org/10.1109/aieee.2018.8592253.

3. Google Scholar. Accessed January 17, 2022. https://scholar.google.com/.

4. Scopus. Accessed January 17, 2022. https://www.scopus.com/.

5. IEEE Xplore. Accessed January 17, 2022. https://ieeexplore.ieee.org/.

6. Cluster File Systems, Inc. 2003. Lustre: A Scalable, High-Performance File System. https://cse.buffalo.edu/faculty/tkosar/cse710/papers/lustre-whitepaper.pdf

7. Salunkhe, Rushikesh, Aniket D Kadam, Naveenkumar Jayakumar, and Shashank Joshi. 2016.Luster a Scalable Architecture File System: A Research Implementation on Active Storage Array Framework with Luster File System. 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). March, 2016, 1073–1081. https://doi.org/10.1109/iceeot.2016.7754852.

8. Lustre* systems and network administration. 2017. Introduction to Lustre* Architecture. https://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf

9. González-Domínguez, Jorge, Verónica Bolón-Canedo, Borja Freire, and Juan Touriño. 2019. Parallel Feature Selection for Distributed-Memory Clusters. Information Sciences, vol. 496. 399–409. https://doi.org/10.1016/j.ins.2019.01.050.

10. Benet, Juan. 2014. "IPFS - Content Addressed, Versioned, P2P File System." https://arxiv.org/abs/1407.3561.

11. Lustre Metadata Service (MDS). Accessed January 17, 2022. https://wiki.lustre.org/Lustre_Metadata_Service_ (MDS)

12. Das, Dipanjan, Priyanka Bose, Nicola Ruaro, Christopher Kruegel, and Giovanni Vigna. 2021. Understanding Security Issues in the NFT Ecosystem. https://arxiv.org/abs/2111.08893.

13. Wang, Qin, Rujia Li, Qi Wang, and Shiping Chen. 2021. Non-Fungible Token (NFT): Overview, Evaluation, Opportunities and Challenges. https://arxiv.org/abs/2105.07447.

14. IPSE TEAM. 2019. IPSE: A Search Engine Based on IPFS IPSE TEAM. https://ipfssearch.io/IPSE-whitepaper-en.pdf.

15. Hilmi, Muhammad, Eueung Mulyana, Hendrawan Hendrawan, and Adrie Taniwidjaja. 2019. Analysis of Network Capacity Effect on Ceph Based Cloud Storage Performance. 2019 IEEE 13th International Conference on Telecommunication Systems, Services, and Applications (TSSA). October, 2019, Bali. 22–24 https://doi.org/10.1109/tssa48701.2019.8985455.

16. Selvaganesan, Manikandan, and Mohamed Ashiq Liazudeen. 2016. An Insight about Glusterfs and Its Enforcement Techniques. 2016 International Conference on Cloud Computing Research and Innovations (ICCCRI). 2016. https://doi.org/10.1109/icccri. April, 2016, Singapore. 120–126. https://doi.org/10.1109/ICCCRI.2016.26

17. Wang, Xin, and Jianhua Su. 2013. Research of Distributed Data Store Based on HDFS. 2013 International Conference on Computational and Information Sciences, June, 2013, China. 1457–1459. https://doi.org/10.1109/iccis.2013.384.

18. Leung, Andrew W., Minglong Shao, Timothy Bisson, Shankar Pasupathy and E. L. Miller. 2009. Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems. FAST (2009). USA, 153–166

19. Hua, Yu, Hong Jiang, Yifeng Zhu, Dan Feng and Lei Tian. 2009. SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. November, 2009, USA, 1–12. https://doi.org/10.1145/1654059.1654070

20. Yu, Yang, Yongqing Zhu and Juniarto Samsudin.2016. Distributed Metadata Search for the Cloud. Journal of Communications 11(1): 100–107.

21. Barriocanal, Elena García, Salvador Sánchez-Alonso and Miguel-Ángel Sicilia. 2017. Deploying Metadata on Blockchain Technologies. Communications in Computer and Information Science. MTSR 2017. Estonia, 38–49. https://doi.org/10.1007/978-3-319-70863-8_4

22. BigchainDB GmbH. 2018. BigchainDB 2.0 the Blockchain Database. https://www.bigchaindb.com/whitepaper/bigchaindb-whitepaper.pdf.

Система Orphus

Загрузка...
Вверх