Titre: Processing Spatial Big Data with Spark and Spark SQL
Hicham Hajji is an associate professor at IAV Institute. He received a PhD (2005) and MS (2001) in Computer Sciences from INSA LYON in 2001. In 1999 he received an engineer degree of Surveying from IAV Institute. Since 2001, he occupied different positions as a Lecturer, IT consultant and Research & Development Engineer.
He has been involved in more than fifteen projects (technical and research projects) ranging from financial data warehousing, GIS Projects , Web Mapping to Big Data with international and national institution such as United Nations’ UNIDO, USAID , Natixis bank, IXIS CIB, etc .
His major research interests lie in Big Data management using scalable approaches and Spatial Data Management. He was awarded recently The Water Innovation Fellowship from USAID and The Azure for Research Awards for ML(Machine Learning) from Microsoft. He is leading a research group working on applications of Spatial Big Data Management on Transportation, Telco, and Forest Management.
Processing Spatial Big Data with Spark and Spark SQL
Processing and computing over spatial big data was often considered as a tedious task that faces many challenges due to the complex nature of spatial databases and to the difficulty to developing scalable algorithms that can address efficiently spatial problems and queries in a cluster environment.
The aim of the talk is to firstly review the constraints that face processing spatial big data in a parallel environment : mainly Spatial Partitioning, Shuffling and Spatial Indexing, which should be addressed carefully when building a spatial big data platform. Then we will give an overview of the Spark computing engine which is an in-memory framework for cluster computing systems that can tackle emerging data processing workloads while coping with larger and larger scales. We will show that in the case of Spatial data, developing Spatial Big Data platform relying on Spark SQL using abstractions such as Datasets is more efficient that using Resilient Distributed Dataset (RDD). By making use of many optimizations under the hood and by proposing a declarative query language SQL over spatial big data, we will show that Spark SQL is more adapted to spatial big data. Some Spark SQL Based research and industrial prototypes will be reviewed during this talk.