What do we lose when MAG goes away?

What we have, what we need,
and how to maintain it?

Colophon

Creative Commons License
Copyright Cameron Neylon and Bianca Kramer 2021. This slide deck is licensed under a Creative Commons Attribution 4.0 International License. Code is available under an Apache v2.0 license at Github.

What is the issue here?

[Microsoft Academic Graph]...will be supported until the end of calendar year 2021, upon which time MAS will be retired

Good News and Bad News

  • Microsoft made the data available under an open (ODC-By) license which offers opportunities to develop new tools and resources, particularly based on machine learning approaches.

  • Much of the technology is open source and can be adapted or rebuilt to provide a replacement. Several groups are working on this.

  • Some aspects of what made MAG so useful were dependent on licenses to content that will be difficult or impossible to renegotiate. Some elements of the workflows are dependent on the Bing infrastructure.

  • A replacement will need to work smarter in some ways, especially if we want to reach beyond the core of the academic record (e.g. beyond DOIs)

Good News and Bad News

  • Efforts to replace MAG largely focus on inferring or scraping metadata and are therefore dependent on access to semi-structured metadata or full text

  • Structured metadata resources are also moving forward in leaps and bounds. Increasing the upstream provision of structured metadata (i.e. by publishers) is a complementary strategy route towards a rich open metadata environment

  • We believe services seeking to replace MAG should demonstrate commitment to the Principles of Open Scholarly Infrastructure (POSI)/ to ensure transparency, community governance and an insurance plan for long term availability

Can we rebuild it? Yes...but...

...how would we know?

...and what content do we have (or not) ?

A quick segue on the data

The Curtin Open Knowledge Initiative

"Our goal is to change the stories that universities tell about themselves, placing open knowledge at the heart of that narrative"

The Data

Information

...but because we focus on provenance and transparency, it's also good for...

Comparing MAG with Crossref Metadata

Who has what, and what do we lose?

The Analysis

Data is derived from the following sources

  • Crossref - weekly dump via Metadata Plus program
  • Unpaywall - Open Access Status data via open data dump (October 2020)
  • Microsoft Academic - Affiliation and authorship data via biweekly dump
  • GRID - Information on organisations via regular data dump

Data is integrated and processed via Observatory Platform, an open source workflow system developed within COKI to integrate data related to scholarly communications. The code is available on Github

For this analysis we largely use the "DOI Table" which is an aggregation of multiple data sources that provide information on the outputs identified by Crossref DOIs. To supplement this we use a de-normalised version of the MAG database.

Code for the analysis, including the queries and processing and a local copy of the derived data are available at the presentation Github repository>.

The Analysis

  • Additional metadata for Crossref DOIs / metadata in MAG for non-DOIs

  • Metadata on:
    • Affiliations
    • Abstracts
    • Citations
    • (Open) References
    • Subjects
  • Split by publication type, year of publication

  • all time and for Crossref "current" (2019-21)

  • This analysis uses data snapshots from 18 July 2021

MAG vs Crossref Coverage

MAG vs Crossref Coverage (Current)

Crossref records in MAG

Crossref records in MAG - by Crossref type

Crossref records: publication types in Crossref and MAG

Crossref Records: MAG Added Value

Crossref Records: MAG Added Value (Current)

Crossref Records: MAG Added Value (Subjects)

MAG Added Value by Crossref Type - Affiliations

MAG Added Value by Crossref Type - Abstracts

MAG Added Value by Crossref Type - Citations to

MAG Added Value by Crossref Type - Citations from

MAG Added Value by Crossref Type - Subjects

Case studies

References in Crossref - added value MAG (by publisher)

Coverage of affiliations by journal category

What is in MAG but not in Crossref?

MAG vs Crossref Coverage

Coverage of non-DOIs in MAG

Coverage of non-DOIs in MAG

Open questions

Open questions

  • What biases exist in metadata coverage across e.g. languages, (types of) publishers ?

  • What is the role of metadata provision vs extraction of full-text?

  • Are publishers the authoritative source for all publication metadata ?

  • What is needed to safeguard transparent data collection, provenance and sustainability ?

  • How can we (as a community) not only replace MAG, but do better ?

Conclusions

Conclusions

  • MAG was a great resource, and its openness will enable us to build back better

  • There are challenges in systematically gaining access to underlying information to replicate some aspects of MAG (eg subject classifications)

  • Improvements to the provision of structured metadata by publishers (ROR, ORCID, I4OC, I4OA etc) have a great potential to improve metadata coverage

  • But gaps will still need to be filled (and back-filled). Access to content, including abstracts and ideally full-text is critical to make this happen beyond the low hanging fruit

  • There are big gaps and under-represented areas, including content beyond journal articles and smaller (and often non-APC open access) journals

Tackle the problem from both ends...

...meet in the middle for a rich metadata future

COKI Team

Centre for Culture and Technology

  • Cameron Neylon
  • Lucy Montgomery
  • Katie Wilson
  • Chun-Kai (Karl) Huang
  • Niam Quigley
  • Chloe-Brookes Kenworthy
  • Tim Winkler

Funding from

  • Research Office at Curtin
  • Faculty of Humanities
  • School of Media, Creative Arts and Social Enquiry
  • Andrew W. Mellon Foundation
  • Arcadia

Curtin Institute for Computation

  • Kathryn Napier
  • Rebecca Handcock
  • Rebecca Lange
  • Aniek Roelofs
  • Richard Hosking
  • Jamie Diprose
  • Tuan Chien

Educopia Foundation

  • Katherine Skinner
  • Rebecca Meyerson

@COKIProject - @cameronneylon

http://openknowledge.community
  • Subscribe to the COKI Newsletter
  • View the public dashboards

Colophon

Creative Commons License
Copyright Cameron Neylon and Bianca Kramer 2021. This slide deck is licensed under a Creative Commons Attribution 4.0 International License. Code is available under an Apache v2.0 license at Github.