Digital Spotlight: Text Mining ‘TCB’/
November 02, 2022
With the publication of the latest issue of TCB: Technical Services in Religion & Theology, editor-in-chief Christa Strickler’s colleague Eric Lease Morgan decided to explore this Atla serial through text mining. Eric is the Digital Initiatives Librarian at the Navari Family Center for Digital Scholarship at the University of Notre Dame. Among his research areas are machine learning and text mining. Below are the results of his explorations.
Where in the World is TCB?
A colleague (Christa Strickler) announced on a mailing list (ACQNET) the existence of a new issue of TCB (Technical Services in Religion & Theology). It was touted as an open access journal, and I wondered whether or not there was an application programmer interface (API) for downloading the content. After a bit of rooting around, I discovered that TCB is published using a system called Open Journal Systems (OJS), and OJS rigorously supports a protocol called OAI-PMH. So, to answer my question, “Yes, TCB does support an API.”
I then wondered how easy it would be to actually harvest/cache/acquisition the content of TCB and then do some analysis against the result. Ironically, I played with this same idea a few years ago. More specifically, I wrote a system of software to harvest the whole of another journal, Information Technology And Libraries. Looking through my archives, I found the desired code, updated it to point to TCB, and downloaded the whole of TCB in less than two minutes. The code to do this work is called OJS Toolbox Redux, and the resulting content ought to be here in the cache. Want some of the metadata describing each content item? Take a gander at ./metadata.csv.
I then wanted to get my mind around the whole of TCB. What sorts of things were discussed? How have those things ebbed & flowed over time? To answer these questions, I used a thing called the Distant Reader Toolbox to create a data set from the cache and then do some analysis (“reading”) against it. Here are a few of the rudimentary things I discovered:
- The cache includes 287 items.
- Items are dated from 2010 to the present; see the rudimentary bibliography
- The sum total of words in the cache is just more than 295,000. By comparison, the sum total of words in Melville’s Moby Dick is about 250,000, and the Bible is about 800,000 words long.
The following word clouds illustrate the frequency of unigrams, bigrams, and statistically significant keywords found in the corpus. The content of the corpus lives up to its name, obviously.
(If you want to take a gander at some additional characteristics of this data set, then check out the rudimentary index page.)
I then applied topic modeling to the corpus, and since the title has been in existence for twelve years, I topic modeled for twelve topics. This resulted in the following enumeration of themes, and the pie chart illustrates the dominance of the themes across the whole. Again the result echoes the name of the journal.
labels weights features library 0.07629 library course data cataloging digital metadat... rda 0.06975 rda atla cataloging library funnel naco conser... cataloging 0.06966 library cataloging quarterly classification se... terms 0.06435 rda terms library cataloging religion form atl... records 0.05667 records data record library cataloging use inf... class 0.03728 class topics individual theology cataloging li... heading 0.02412 heading music field headings add terms genre/f... cancel 0.02192 cancel church heading religious theology chris... india 0.02151 class india history literature information chu... collection 0.01955 collection openathens library oer resources ht... tcb 0.01453 library tcb maps information cataloging san ma... field 0.01361 heading field add literature former bible chan...
To illustrate how these themes ebbed & flowed over time, I augmented the underlying topic model with a year column, pivoted the model, and created the following stacked area chart. From the result, we can see that the topic of “rda” was predominate between 2010 and 2012. We can see that “terms” had a going on just after that, but upon closer inspection, “terms” was still a lot about “rda.” We can also see that the theme of “catalog” and “library” are pretty consistent throughout time.
You might ask, “Given this analysis, can you recommend some salient articles elaborating on the themes?” And my answer is, “Sure!” For example, a theme seems to be “rda.” Searching the underlying data set’s full-text database, the following three articles are specifically about RDA and have RDA in the title:
- RDA Toolkit & Examples by Lynn Berg (2010-08-01)
- My Experience with RDA, Part 1: Overview by Armin Siedlecki (2011-02-01)
- My Experience with RDA, Part 2: Examples by Armin Siedlecki (2011-05-01)
An article specifically about TCB includes an editorial by Cynthia Snell (2021-08-23). From the computed summary:
Therefore, in order to appeal to a broader audience—including persons acquiring and cataloging materials at museums and archives—as well as to provide opportunities for interdisciplinary engagement with other library technical services professionals, we will roll out an expanded TCB beginning in 2022. TCB will remain a publication that focuses on the needs of technical services professionals, transforming from a publication for catalogers of materials in religion and theology to one that addresses the interests of all technical services staff who may be working with materials in religion and theology.The Editors
They say that if you have a hammer, then everything begins to look like a nail. Well, my current hammer is the Distant Reader Toolbox, and I enjoy using the Toolbox to practice librarianship. With the Toolbox, I create and curate collections. I then provide services against them. This missive outlined one of my explorations.
If you want to play with this collection, then begin by downloading the data set, and it is temporarily available at http://distantreader.org/tmp/tcb/etc/reader.zip. The compressed zip file is made up of mostly plain text files, a relational database, and a few images. You can then use the Toolbox or any number of other tools to do your own analysis. Other tools include Wordle, OpenRefine, Antconc, any spreadsheet or database program, or even your text editor. Enjoy!
Enjoying the Atla Blog?
Subscribe to receive email alerts of new blog posts of a specific type. Members, subscribers, publishers, or anyone interested in the study of religion & theology are welcome to sign up to one or all alerts to keep up to date with the Atla community. If you or your institution are a member, the Atla Newsletter delivers a monthly curated email of top posts to your email inbox.