Dataspace Management with ETL and RDF Support

##plugins.themes.bootstrap3.article.main##

Marko Niinimaki Tapio Niemi Peter Thanisch

Abstract

        Dataspaces have become popular in data modeling and business intelligence. In this paper, we introduce a dataspace management system with Extract-Transform-Load capabilities and RDF (Resource Description Framework) data export. Moreover, we demonstrate how distributed processing based on the MapReduce framework can be used in processing the exported RDF data.


        Specifically, our system helps the user (i) discover potential problems with data integration and then (ii) carry out the actual integration. In the first case, the software generates a file of comma separated values that the users can load into their statistics software. In the second case the user can transform the files into the RDF format and analyze the data using Python tools, or export the final data set to a visualization package or business intelligence software. Our method can therefore be seen as constructive; building a tool for data professionals.


        We demonstrate the viability of both the dataspace management system and RDF processing using a Hadoop cluster. There, using Hadoop distribution improved the processing speed by 85..86%.


Keywords: dataspace, MapReduce, RDF, cloud, Hadoop.

References

Abelló, A., Ferrarons, J., & Romero, O. (2011). Building cubes with MapReduce. Retrieved from https://
www.researchgate.net/publication/220933887_Building_Cubes_with_MapReduce
Antonio, M., Salles, V., & Dittrich, J.-P. (2006). iMeMex: A Platform for Personal Dataspace Management. Retrieved from https://www.researchgate.net/publication/244431278_iMeMex_A_Platform_for_Per
sonal_Dataspace_Management
Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26(1), 65-74.
Chaudhuri, S., Dayal, U., & Narasayya, V. (2011). An overview of business intelligence technology. Communications of The ACM, 54(8), 88-98.
Cyganiak, R., & Reynolds, D. (2014). The RDF Data Cube Vocabulary. Retrieved from https://www.w3.org/
TR/2012/WD-vocab-data-cube-20120405/
Dong, X., & Naumann, F. (2009). Data fusion: resolving data conflicts for integration. Very large data bases, 2(2), 1654-1655.
Dong, X., & Srivastava, D. (2013). Big data integration. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), 8-12 April 2013 (pp. 1245-1248). Brisbane, QLD, Australia: Institute of Electrical and Electronics Engineers.
Franklin, M., Halevy, A., & Maier, D. (2005). From databases to dataspaces: a new abstraction for information management. international conference on management of data, 34(4), 27-33.
Halevy, A., Franklin, M., & Maier, D. (2006). Principles of dataspace systems. Retrieved from https://dl.acm.
org/doi/10.1145/1142351.1142352
Harris, S., Seaborne, A., & Prudhommeaux, E. (2013). SPARQL 1.1 query language. Retrieved from https://www.w3.org/TR/sparql11-query/.
Husain, M., Doshi, P., Khan, L., & Thuraisingham, B. (2009). Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce. Retrieved from http://cs.utdallas.edu/semanticweb/Hadoop-RDF/
Paper-SR-Large-RDF-Graphs.pdf
Kawises, J., & Vatanawood, W. (2016). A development of RDF data transfer and query on Hadoop Framework. International Journal of Pharmacy and Technology, 8(4), 19492-19498.
Klimek, J., Škoda, P., & Nečaský, M. (2016). LinkedPipes ETL: Evolved Linked Data Preparation. In International Semantic Web Conference, 29 May 2016 - 2 June 2016 (pp. 95-100). Greece: Haraklion.
Komamizu, T., Amagasa, T., & Kitagawa, H. (2016). H-SPOOL: A SPARQL-based ETL framework for OLAP over linked data with dimension hierarchy extraction. International Journal of Web Information Systems, 12(3), 359-378.
Lenz, H.-J., & Shoshani, A. (1997). Summarizability in OLAP and statistical data bases. Olympia, WA: IEEE.
Li, N.-Y., Escalona, A., Guo, Y., & Offutt, A. (2015). A Scalable Big Data Test Framework. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), 13-17 April 2015 (pp. 1-2). Graz, Austria: Institute of Electrical and Electronics Engineers.
Liu, X., Thomsen, C., & Pedersen, T. (2014). CloudETL: scalable dimensional ETL for Hive. Retrieved from https://www.researchgate.net/publication/266660677_CloudETL_scalable_dimensional_ETL_for_hive
Mauro, A., Greco, M., & Grimaldi, M. (2016). A formal definition of Big Data based on its essential features. Library Review, 65(3), 122-135.
Niemi, T., Niinimaki, M., Thanisch, P., & Nummenmaa, J. (2014). Detecting summarizability in OLAP. data and knowledge engineering, 89, 1-20.
Niemi, T., Toivonen, S., Niinimaki, M., & Nummenmaa, J. (2007). Ontologies with Semantic Web/Grid in Data Integration for OLAP. International Journal on Semantic Web and Information Systems, 3(4), 25-49.
Niinimaki, M., & Niemi, T. (2009). An ETL process for OLAP using RDF/OWL ontologies. Journal on Data Semantics, 13, 97-119.
Niinimaki, M., & Thanisch, P. (2019). Dataspace Management for Large Data Sets. Innovative Computing Trends and Applications, 1, 13-21.
Rahm, E., & Do, H. (2000). Data Cleaning: Problems and Current Approaches. IEEE Data(base) Engineering Bulletin, 23, 3-13.
Rautmare, S., & Bhalerao, D. (2016). MySQL and NoSQL database comparison for IoT application. In 2016 IEEE International Conference on Advances in Computer Applications (ICACA), 24-24 October 2016 (pp. 235-238). Coimbatore, India: Institute of Electrical and Electronics Engineers.
Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., & Lausen, G. (2014). Sempala: Interactive SPARQL Query Processing on Hadoop. Retrieved from https://www.researchgate.net/publication/270901879_Sem
pala_Interactive_SPARQL_Query_Processing_on_Hadoop
Shakhovska, N., Veres, O., Bolubash, Y., & Bychkovska, L. (2015). Big data information technology and data space architectur. Sensors and Transducers, 195(12), 69.
Sharma, K., & Attar, V. (2016). Generalized Big Data Test Framework for ETL migration. In 2016 International Conference on Computing, Analytics and Security Trends (CAST), 19-21 December 2016 (pp. 528-532). Pune, India: Institute of Electrical and Electronics Engineers.
Simoes, A. (2012). The observatory : designing data-driven decision making tools. Retrieved from https://
www.media.mit.edu/publications/the-observatory-designing-data__driven-decision-making-tools/
Stevens, S. S. (1946). On the Theory of Scales of Measurement. Science, 103(2684), 677-680.
Sun, J., & Jin, Q. (2010). Scalable RDF store based on HBase and MapReduce. In 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE), 20-22 August 2010 (pp. 1). Chengdu, China: Institute of Electrical and Electronics Engineers.
Talia, D. (2013). Clouds for Scalable Big Data Analytics. IEEE Computer, 46(5), 98-101.
Thanisch, P., Niemi, T., Nummenmaa, J., & Niinimaki, M. (2019). Detecting measurement issues in SQL arithmetic expressions and aggregations. Data and knowledge engineering, 122, 116-129.

Section
Research Articles

##plugins.themes.bootstrap3.article.details##

How to Cite
NIINIMAKI, Marko; NIEMI, Tapio; THANISCH, Peter. Dataspace Management with ETL and RDF Support. Naresuan University Journal: Science and Technology (NUJST), [S.l.], v. 28, n. 4, p. 36-49, june 2020. ISSN 2539-553X. Available at: <http://www.journal.nu.ac.th/NUJST/article/view/Vol-28-No-4-2020-36-49>. Date accessed: 26 feb. 2021. doi: https://doi.org/10.14456/nujst.2020.35.