Metaknowledge Perspective on Russian Studies

Data

To create the main dataset, we employed diverse bibliometric methods for the identification of papers with a focus on Russia. The process of data collection included seeding a pilot dataset for keywords, selection of keywords, storing the primary dataset, selection of papers by experts, and the scrubbing of affiliation information. The final dataset consists of 25,851 articles stored from Web of Science for the period 1990 – 2020.
Web of Science
We rely on the Web of Science database that includes several citation indices. The papers were stored by using the Social Sciences Citation Index and the Arts & Humanities Citation Index. Our project is based on Web of Science data because WoS provides better coverage of the international scientific literature for the period 1990 – 2000 compared with the Scopus database. The Web of Science Emerging Science Citation Index was excluded for the purpose of obtaining stable and reliable results over time. We were not able to consider book publications, which is an important limitation in the social sciences and humanities.
Selecting keywords
First, we constructed a list of relevant journals in Russian, Eurasian, and East European Studies to form the initial set of keywords. The query through the WoS was conducted to store the articles published in those sources. These papers did not represent the dataset itself, but rather served as a source of keywords extracted from titles to be used subsequently for forming a final corpus. The total number of keywords was 39,675, excluding stop words, with only nouns and pronouns chosen to be used as potentially useful keywords. As expected, Russian and Russia were the most frequent words in the list. This list needed to be filtered manually because of multiple false positives. Four coders read the list and left only relevant words. A keyword was considered suitable if at least three of the four experts marked it as relevant, which yielded a list of 1,125 keywords.

As an additional step, we inspected the keywords disapproved by the experts for possible association with Russia upon embedding them in certain phrases. We constructed 2- and 3-grams using excluded words and their nearest meaningful neighbors in respective titles. In most cases, these phrases represented names of famous individuals, literary characters, and geographical objects. Some notable examples include Vladimir Dal, Dead Souls, Bronze Horseman, etc. In the end, these phrases, together with previously selected keywords and the two “necessary” words (Russia and Russian), made up a set of 1,271 words and phrases used for subsequent article query.
Forming a primary dataset
We use the list of keywords to get all academic papers written in English during the period 1990 – 2020. To be stored, a paper has to contain at least one word from keywords in titles, abstracts or keywords. Our initial WoS query yielded in 55 709 (the database was queried in January, 2022), only article and review were taken into account. Since this list is likely to contain redundant papers, additional steps were needed to provide a corpus of articles appropriate for further analysis. Our next stage was to resort to expert assessments once again to narrow down the dataset leaving only relevant publications. This step was necessary as querying articles by keywords might result in partially or completely unrelated documents. Since articles containing Russia in their title can be treated suitable with a substantial degree of certainty, such papers were not subject to expert assessment and immediately marked relevant. Thus, four experts received a shortened dataset of 40,647 papers to be checked for compliance with the topic. They read titles and examined keywords and abstracts. For the whole coded dataset, agreement and partial agreement constituted approximately 68.5% and 96.8%, respectively. Overall, an article was accepted if it contained the substring Russia in its title or if at least three out of four experts marked it as 1 (related). 29,826 papers (roughly 54% of the whole corpus) met this criterion.
Cleaning author affiliation data
Since this study aims at constructing institutional rankings, it was vital to conduct a meticulous analysis of author affiliations given the available author address provided by WoS. In general, this field represents a list of strings which correspond to the authors and contain some information on their respective scientific institutions. After inspecting author addresses, it became evident that available information had to be enriched due to the following issues: an empty field; absence of data for one or more authors (affiliations were partly or completely missing for 12.2% of articles in the dataset, and there was no information for 8.6% of scientists); several authors with similar or identical names; unreadable/unstructured affiliation data or poor quality thereof.

To address the first two problems, we used external assistance to manually collect missing data from other sources (such as journal pages and the research profiles of scholars). We also addressed the problem of unreadable affiliation data by using the refsplitr R package, which helps break down affiliations into key components (country, city, institution) and disambiguate entities where possible. With the above steps taken, countries and institutions were identified in roughly 87% and 74% of cases, respectively.

As an additional step, consolidation was conducted to avoid the inclusion of institutions multiple times under different names. This problem was addressed by writing additional code as well as manually inspecting particularly difficult cases.
Enriching research areas’ classification
Even though our primary emphasis was placed on Russian studies as an interdisciplinary domain, being able to analyze articles within more specific research areas is still desirable. For this purpose, the WoS categories were inspected and certain steps were taken to modify the initial categories. First, we merged several related areas into larger ones (for example, Business, Economics, and Management were merged into one category) to shorten the list of research areas. Next, for papers with multiple categories and/or “multidisciplinary” category assigned to them, we attempted to update research fields by exploring the Cited Reference field and identifying the most frequently occurring research areas among journals to which the article in question referred. Overall, we were able to reduce the share of “unclarified” categories from approx. 21% to 10%. The identified areas coincided with ones initially contained in the WoS Category field in almost 55% of cases.

Indicators

For the Global Ranking of Expertise in Russia, we included the possibility to select the type of bibliometric indicators instead of providing a single aggregated indicator. We provided both size-independent and size-dependent indicators, so performance might be presented in an absolute or relative perspective – such as by the number of publications that are highly cited or the share of those publications in the overall output. Using size-dependent indicators tends to privilege organizations with a larger number of publications; therefore it is important to also provide size-independent indicators.

Based on the methodology of the Leiden Ranking, we included the following indicators of scientific impact: The scientific indicators can be calculated using either a full counting or a fractional counting procedure. The full counting method assigns a full weight to an organization even it was written by authors affiliated with several organizations. The fractional counting method gives less weight to collaborative publications by dividing credit equally between organizations. The fractional counting method is the preferable counting method for constructing the scientific impact indicators.

An important note is that, compared with the most common bibliometric rankings, the specialized ranking inevitably includes organizations that in some periods might have a limited number of publications. We should acknowledge that small numbers add uncertainty to results of academic rankings.