Four ways to mine the academic literature for insights

Reviewing the academic literature around a topic is typically a qualitative process, where a small number of highly cited academic articles are studied in-depth. But useful finding may also exist in the “long tail” of less cited articles. Depending on the research topic of interest, this “long tail” may consist of thousands or even tens of thousands of articles. To map the entire landscape of academic research around a topic, we therefore need to use a different approach, one which combines quantitative mapping with qualitative analysis.

This article outlines the use of AI-powered methods for literature review. More precisely, it describes step-by-step how to use Dcipher Analytics to:

  1. Spot key research topics
  2. Find relevant information topics of interest
  3. Identify key researchers and institutions
  4. Map networks of research collaboration

Getting and prepping the data

First we need to download the data we want to study. Web of Science is a popular place to do this. We can use the Web of Science search box to create a criteria for what we are looking for:

In this case, we are interested in all articles in English about sustainability published between 2015 and 2018. In total, the search gives us 43,645 articles, which we download using the download option of the Web of Science platform. Since the number of articles that can be downloaded at once is limited, we need to make multiple downloads and then merge the files.

 

Once the data is downloaded and merged into one file, we upload it to Dcipher Analytics. The platform automatically detects the column separator and lets us import the file. We first name the dataset and inspect the data.

One way to understand if misalignments have been introduced into the dataset is to inspect columns that we are familiar with. In this case, we drag-and-drop the column containing publication year, which was used as a search criteria when the articles were downloaded. Discovering that values outside the expected interval are present, we remove all rows containing these values.

1. Spotting key research topics

To get an overview of the sustainability research, we use the Keywords plus field, which contains keywords describing the content of each article. Since the keywords are stored as a semicolon-separated text string, we need to first use the Split by pattern operation to split the string into a collection of keywords for each article. We can now work with the keywords individually.

Drag-and-dropping the new field to the Bubble view counts all the keywords and displays them as bubbles, where the bubble radius indicates the number of articles a keyword is used in. The bigger the bubble, the more prominent the topic is in sustainability research. We can use word mode to view the keywords as words instead of bubbles.

However, we are not only interested in the frequency of individual keywords, but in how the different keywords form topics. For this, we can use the Display as network option card in the Bubble view, which connects keywords that are often used together or in a similar context. Dropping the keywords field there results in a keyword network:

Navigating around in this network shows not only what topics are present and how large they are, but also how the different topics are related. For example, “biofuels” is connected to “gasification”, “pyrolysis”, and “energy crops”. The result is a picture of the overall structure of the sustainability research.

If we want to find the articles associated with each topic, we simply select the words that form the topic we are interested in and drag-and-drop it to the Document summary view. This triggers a scoring and sorting of all the articles based on the dropped keywords. The articles at the top are those that are most relevant for the given keywords.

2. Finding information related to specific topics

While the keyword network approach is useful for surfacing interesting patterns in the data and exploring topics bottom-up, we may also be interested in searching for information about specific topics. Let’s say, for example, that we would like to know what research has been conducted in relation to the topic life-cycle assessment. If we were certain that all relevant articles use this particular phrase, we could simply eliminate all posts that do not contain the phrase. But life-cycle assessment could potentially be discussed in many different ways, without using the particular phrase, and manually figuring out all the different ways would take a long time. This is where deep learning-powered similarity search comes in handy.

First, though, we need to segment our article texts, in this case the abstract, into pieces that are possible to analyze. We do this by applying the Tokenization operation with lemmatization. This does not only split the abstracts into words, but reduces the inflections to a single base form. “Researching” and “researched”, for example, are reduced to the base form “research”. We do this to avoid having the same meaning spread out across multiple words.

With close to 44,000 abstracts, this takes a few minutes. Once it is done, we again drag-and-drop the tokens field to the Bubble view, where the tokens are counted and visualized. To get an overview of the tokens and their parts-of-speech, we can drag-and-drop the two fields to Group by field option card in the Document summary view:

To search for the topic we are interested in, we use the search field at the top of the Bubble view to search for one or several keywords of interest. Clicking the filter icon gives us all tokens matching our search. We can now select those of the tokens we are interested in, right-click, and use them as a grouped seed.

The result is that all tokens that are related (in the sense that they are used in similar contexts in the abstracts) are found and presented to us. We can now make a selection of tokens that describe the topic we are interested in.

To view the articles most strongly associated with the topic, we drag-and-drop the selected keywords to the Document summary view to score and sort the articles.

By applying similarity search on multiple keywords (“seeds”), we can see what words are more strongly associated with which seed. Here a comparison between China and the United States:

3. Identifying key stakeholders

After having acquired an understanding of the research topics and research related to our specific topics of interest, we now want to know what institutions and researchers are active in our topic of interest.

The first step is to filter the data so that only articles related to the specific topic of interest are included. We do that by applying a filter.

We now turn our attention to the two fields Institution and Authors, containing the names of the institutions and authors involves in each paper. Like the keyword field, the names in these two fields are semicolon-separated, so we first use the operation Split by pattern to get individual names.

Drag-and-dropping the Institutions field to the Document summary view groups the articles by institution and shows the number of articles published by each institution.

Dropping more fields gives additional information. We can for example drop the field Citations twice to get the total and average number of citations for each institution. We can sort the table in ascending or descending order by any of the columns to rank the institutions by different metrics.

The same steps can be followed for the Authors field, to see what individual researchers are most prominent.

4. Mapping networks of research collaboration

Looking at the list of top authors does not give the whole picture, though. What if the top authors on the list are all part of the same research group? In that case we may be more interested in the work of less prominent researchers to avoid redundancy.

The solution is to map the collaboration network of the authors. To do this, we follow similar steps as when we created the topic network. But instead of a similarity measure for the link strengths, we use the co-occurrence measure, which can be set in the settings bar. Co-occurrence gives the number of times two authors have co-authored an article.

The network shows who is collaborating with whom and how many relevant articles each author has published. We can change the settings so that the bubble radius instead indicates each author’s total or average number of citations in relation to the topic.

Dragging-and-dropping authors’ names to the Document summary view will display the articles they have authored.

The same steps can be followed to map the collaboration networks between institutions

To try out for yourself, sign up for a free trial of Dcipher Analytics.

Add Comment