Keywords: Text mining, NLP, information retrieval, machine learning, data mining, knowledge discovery
Objective: The availability of data is rapidly increasing, while data driven research approaches still suffer from the problem that observations are not collected in central accessible data bases. Mass accumulation rates (MAR) at the seafloor are a good example. This fundamental parameter has been measured since more than one hundred years at numerous locations around the globe. However, the MAR data were never compiled in a data base but reported in a very large number of individual publications. Hence, current global MAR maps are not based on a comprehensive MAR data set but on simple empirical correlations with e.g. water depth (Burwicz et al., 2011). This PhD project addresses the above-mentioned problem by investigating a general text-mining based approach for pooling information and discovering knowledge from large collections of text documents (publications). The challenges of this project go beyond the traditional challenges on text mining, i.e. clustering and classification on unstructured data. In this interdisciplinary project, a domain-knowledge driven approach will be followed, that enables the incorporation of expert knowledge into the text mining process. Instead of aiming at approaches using data mining and machine learning as black boxes, interactive, potentially semi-supervised, data mining and machine learning approaches have to be developed and investigated. In addition, methods for merging the information and knowledge extracted from the documents into a central knowledge base to provide the basis for further domain specific analysis have to be developed and investigated. The overarching aim is to discover solutions to answer advanced domain specific research questions based on an existing domain specific literature base.
Aim and Tasks: Our study that will mainly address MAR data to answer marine geological questions, serves as a blueprint and a proof of concept that would be useful for diverse other research fields facing the same problem. In this project, novel advanced marine science domain driven text mining methods designed for interactive semi-superviced operations need to be developed and investigated to i) automatically identify scientific publications that present or refer to relevant MAR data, ii) extract MAR and accompanying meta-data from these publications guided by expert knowledge, and iii) integrate this information into a semantically rich MAR-knowledge base to answer advanced marine science questions. The resulting approach will be implemented as a software tool that will be tested and trained using GEOMAR’s OceanRep, an open access digital collection of scientific publications. Subsequently, the new software tool will be applied to larger sets of publications, to extract as many MAR data as possible, and compile a comprehensive MAR data set and knowledge base following data fusion concepts [Patel et al., 2018a,b].
Burwicz, E.B., Rüpke, L.H. and Wallmann, K. (2011) Estimation of the global amount of submarine gas hydrates formed via microbial methane formation based on numerical reaction-transport modeling and a novel parameterization of Holocene sedimentation. Geochim. Cosmochim. Acta 75, 4562-4576.
Patel, H., Paraskevopoulos, P., & Renz, M. (2018a). Data fusion of diverse data sources: enrich spatial data knowledge using HINs. In Proceedings of the Fifth International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data (pp. 13-18). ACM.
Patel, H., Paraskevopoulos, P., & Renz, M. (2018b). GeoTeGra: A System for the Creation of Knowledge Graph Based on Social Network Data with Geographical and Temporal Information. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 617-620). IEEE.