160-million Papers and Counting: The World’s Information Deluge

Academic output has exploded over the last 100 years but how can the most relevant research be found?

— by Melissa Cochrane

In 2009, it’s estimated there were at least 50 million research publications floating around the coves of the internet. If you printed all of them out and put them side by side, you could go all the way around the earth. Based on the recent data, however, it appears the number of publications are at least 3 times larger than previously thought, at around 160 million, and the growth rate has increased to 0.8% per month, doubling in just over 7 years. It’s clear that the scientific world is booming with information, but how do researchers find out who, what and where is relevant to their specific fields? How on earth can we navigate all this?

Kicked off two years ago, Microsoft Academic is a research project inside Microsoft Research. At its core is an artificial intelligence agent that reads all academic publications on the web to learn and automatically create a massive knowledge base, going far beyond a simple keyword-matching search to provide an overall benchmark and the context of what you’re looking for. A goal of the project is to explore to what extent the advancements in machine intelligence can be harnessed to serve scientists to discover and disseminate new knowledge.

“The search is based on intelligent behavior,” says Dr. Kuansan Wang, Managing Director at MSR Outreach Innovation. “Microsoft Academic can serve the reader with the most influential paper, for example, in the field of Computer Vision even though this paper does not have the words ‘computer vision’ in its contents.”

“This is possible because the technology has gone beyond a lexical representation and into the semantics of the search intent, therefore matching intent to knowledge,” explains Wang.

What that means is that Microsoft Academic will help readers to reach the desired information in a specific field by making the search extremely smart and efficient. If the query is clear, the reader will be able to get exactly what they are asking for. If the query isn’t clear, the user will be asked clarifying questions or given suggestions to disambiguate. If a specific topic is queried, for example, the reader will be presented with the most influential papers within that topic, and presented with the key authors, key institutions, and the key associated conferences.

This exceeds the usefulness and capability of a standard search engine or even a scholarly search engine like Google Scholar because it goes far beyond a simple keyword-matching search to give you an overall benchmark and the context of what you’re looking for. Basically, the 160 million plus entities currently indexed by Microsoft Academic are organized into semantic, knowledge networks, connected by billions of relationships, which can be tapped into and are easily to navigate to find the most important, relevant research, people and conferences in a field or topic.

Finding what the reader needs is only the first part of the story. Accessing it is the second.

“Science that is not open, is not real science,” Dr. Wang believes. “We stand on the shoulders of giants. We build on the findings of other scientists which means that we need to be able to find and access their research, and to be able to reproduce it. What drives science forward is that the community works together to validate and verify. Based on that foundation, we discover new things. Open science is critical to scientific advancement. And reproducibility is more easily achieved with open science.”

Microsoft Academic points to research articles and where possible, provides a link to the HTML or PDF full document to facilitate open science. “For every publication online, there are around seven links discussing the same paper,” Dr. Wang explains. Microsoft Academic can help to find the full text of the article based on these other links to assist with accessibility.

Additionally, Dr. Wang says, “we have adopted an open approach in developing this service, and we want the community to get involved. We like to think what we have developed is community property so we have opened up our academic knowledge as a downloadable dataset and make key building components as cloud based services from Microsoft Cognitive Services. Everyone is welcome to test out the Academic Knowledge API instead of downloading the massive dataset over the internet.”

It seems that the Open Science movement is gaining momentum and, working together, we can all ensure that it becomes the future.

Dr. Wang was a keynote speaker at the Frontiers Open Minted Workshop at EPFL last February.

REPUBLISHING GUIDELINES: At Frontiers, open-access and sharing research is part of our mission. Unless otherwise noted, you can republish our articles posted in the Frontiers blog – as long as you credit us with a link back. Editing the articles or selling them is not allowed.