GRAPHIA’s SSH Federated Knowledge Graph: The Role of GoTriple
Authors: Luca de Santis, Julien Homo, Ursula Rabar
While GoTriple is the largest Social Sciences and Humanities (SSH) only repository of scientific content, it is not a graph-based system. Its “triple” naming refers to a formal ontology description of its textual indexes, not semantic graph structures. Nevertheless, GoTriple is central to the GRAPHIA SSH Knowledge Graph federation, providing the primary data source (the federation concept is explained in more detail in our previous blog post GRAPHIA’s SSH Federated Knowledge Graph: Vision and Goal).
The purpose of GoTriple in GRAPHIA will be to create a companion service based on the knowledge graph technologies or a GoTriple Knowledge Graph.

This, on the one hand, allows a more sophisticated navigation and reuse of the vast amount of information that GoTriple stores. In particular relationships amongst entities (think of publications and authors that can be used for co-authorship analysis) can be exploited to allow sophisticated analysis on GoTriple data. On the other hand, new kinds of data, not stored in GoTriple, will be made available, in particular entities extracted from the full text of documents by Artificial Intelligence and Natural Language Processing-powered services developed in GRAPHIA in the context of AI Solutions.
Importantly, the GoTriple Knowledge Graph will not modify GoTriple itself; it is a separate service that enhances the platform with:
- Relationships amongst objects (e.g. a direct connection between publications and authors, that can be used for co-authorship analysis), that can be exploited to allow sophisticated analysis on GoTriple data
- Semantic alignment and interoperability with other knowledge graphs through the alignment with the Scientific Knowledge Graphs Interoperability Framework (SKG-IF) and the integration with the federation services
- Concepts and entities extracted from the full text of documents by using Artificial Intelligence and Natural Language Processing
The GoTriple Knowledge Graph will act as the central SSH graph in the federation, bridging existing SSH data layers and AI-enriched metadata.
The GoTriple Data Source
GoTriple aggregates metadata related to SSH resources from multiple sources. At present, it contains about 22 million documents harvested from over 1,500 sources provided by around 35 providers, where a provider may be a large aggregator such as DOAJ or BASE, or a repository managed by a research institution, such as OpenEdition or Complutense University of Madrid.
GoTriple manages only metadata describing specific research products, including textual documents (articles, publications, books, book chapters, etc.), datasets, media objects, projects, and author/researcher profiles. In the near future, it will also manage semantic artefacts. It is important to emphasise that GoTriple is an index of resources and stores only their metadata, not the resources themselves. For example, within the platform, the full text of a document is represented simply as an attribute containing the URL pointing to the actual PDF. The GoTriple Knowledge Graph will import all the information currently managed by GoTriple in graph form and, as mentioned above, will additionally include the entities extracted from texts, which will be available only within the Knowledge Graph.
The indexing of the GoTriple Knowledge Graph is performed through an extension of the GoTriple processing pipeline. During the course of GRAPHIA, this pipeline is undergoing a significant refactoring aimed at creating a more efficient and scalable platform called the Data Acquisition Pipeline.
Current Development Status and Next Steps
The GRAPHIA technical team is currently working on the release of a first alpha version of the GoTriple Knowledge Graph. This initial release will serve as an experimental prototype aimed at testing the technologies selected for the graph infrastructure, the extensions of the Data Acquisition Pipeline required to populate the graph via dedicated services, as well as the semantic model adopted to represent the graph data, which is based on the TRIPLE Ontology. This alpha version will be made available for testing purposes and will be accessible through a standard SPARQL endpoint.
