Copyright Cameron Neylon and Bianca Kramer 2021. This slide deck is licensed under a
Creative Commons Attribution 4.0 International License.
Code is available under an Apache v2.0 license at Github.
Microsoft made the data available under an open (ODC-By) license which offers opportunities to develop new tools and resources, particularly based on machine learning approaches.
Much of the technology is open source and can be adapted or rebuilt to provide a replacement. Several groups are working on this.
Some aspects of what made MAG so useful were dependent on licenses to content that will be difficult or impossible to renegotiate. Some elements of the workflows are dependent on the Bing infrastructure.
A replacement will need to work smarter in some ways, especially if we want to reach beyond the core of the academic record (e.g. beyond DOIs)
Efforts to replace MAG largely focus on inferring or scraping metadata and are therefore dependent on access to semi-structured metadata or full text
Structured metadata resources are also moving forward in leaps and bounds. Increasing the upstream provision of structured metadata (i.e. by publishers) is a complementary strategy route towards a rich open metadata environment
We believe services seeking to replace MAG should demonstrate commitment to the Principles of Open Scholarly Infrastructure (POSI)/ to ensure transparency, community governance and an insurance plan for long term availability
"Our goal is to change the stories that universities tell about themselves, placing open knowledge at the heart of that narrative"
Data is derived from the following sources
Data is integrated and processed via Observatory Platform, an open source workflow system developed within COKI to integrate data related to scholarly communications. The code is available on Github
For this analysis we largely use the "DOI Table" which is an aggregation of multiple data sources that provide information on the outputs identified by Crossref DOIs. To supplement this we use a de-normalised version of the MAG database.
Code for the analysis, including the queries and processing and a local copy of the derived data are available at the presentation Github repository>.
Additional metadata for Crossref DOIs / metadata in MAG for non-DOIs
Split by publication type, year of publication
all time and for Crossref "current" (2019-21)
This analysis uses data snapshots from 18 July 2021
What biases exist in metadata coverage across e.g. languages, (types of) publishers ?
What is the role of metadata provision vs extraction of full-text?
Are publishers the authoritative source for all publication metadata ?
What is needed to safeguard transparent data collection, provenance and sustainability ?
How can we (as a community) not only replace MAG, but do better ?
MAG was a great resource, and its openness will enable us to build back better
There are challenges in systematically gaining access to underlying information to replicate some aspects of MAG (eg subject classifications)
Improvements to the provision of structured metadata by publishers (ROR, ORCID, I4OC, I4OA etc) have a great potential to improve metadata coverage
But gaps will still need to be filled (and back-filled). Access to content, including abstracts and ideally full-text is critical to make this happen beyond the low hanging fruit
There are big gaps and under-represented areas, including content beyond journal articles and smaller (and often non-APC open access) journals
Centre for Culture and Technology
Funding from
Curtin Institute for Computation
Educopia Foundation
Copyright Cameron Neylon and Bianca Kramer 2021. This slide deck is licensed under a
Creative Commons Attribution 4.0 International License.
Code is available under an Apache v2.0 license at Github.