As the director of a high-density materials storage facility on behalf of the University of California system I am often engaged in conversations with campus libraries about holdings, duplication and other collections analysis type issues. The UC regional library facilities (RLFs) have policies to prevent duplication and policies that encourage persistence, both great things, but we do not have the tools we need to help libraries make well informed and efficient data-grounded decisions about their collections (e.g. what can I send to a storage facility, what volumes in a serial does the facility already have, what other shared print programs have this item).
These issues are pressing enough that many of them are surfacing in a research project led collectively by the UC Libraries to study the workflows and data streams around the RLFs. In addition, there are a number of other projects including the UC Libraries Federal Document Archive and a large-scale collection review at UC Berkeley that require us to ask questions grounded in metadata about individual holdings at multiple institutions.
There are already some great tools out there to help this work. UC Berkeley hosts a duplicate detection tool that checks the OCLC database for holdings at the northern (NRLF) and southern (SRLF) regional library facilities. This tool works well for monographs and for serials where there are no holdings but, because OCLC does not have the detailed holdings information and because our API interface cannot get to the information it has, the tool has limited effectiveness for serials. In addition, the tool works by having a library submit a list of OCLC numbers to it. This is great if you have a known list from which you want to check duplication but does not work so well if you are asking the much more open question – what does my library have that is not at either RLF?
With the motivation in mind to help provide a different system to answer these questions I took on the creation of a set of tools to help libraries ask these questions at the detailed item level. In doing so I found out that I was driving towards a rudimentary data warehouse application – not because the data I had to analyze was over complex but because the scale of data started exceeding the capacity of my local computers.
In order to gather some of this content together I decided to capture a few of the design decisions and activities.
Though my design process I decided my application(s) needed to do a few things. I needed to:
- Be able to transform data from my various sources into a normalized set of metadata that was specific and appropriate to the questions I wanted to ask but not overly complicated.
- Design solutions that work at the 20M record level
- Create a platform that would support asking many types of questions with relative ease
- Create a system that allowed me to answer very specific volume/issue holdings questions.
- Be able to transform data from my various sources into a normalized set of metadata that was specific and appropriate to the questions I wanted to ask but not overly complicated
- Not require any manual manipulation of data
I’ll talk about these six needs in context of data extraction, data analysis and overall findings below:
Creating programs to transform data turned out to be fairly easy if not straightforward. I had a good data source from the SRLF with item-level information and I used that as my base – modeling data from NRLF and other sources on it. An overview metadata_schema shows which fields I was able to gather for all datasets.
I got data dumps from UC Berkeley and UCLA that contained the SRLF, NRLF and UC Berkeley data. In order to extract the data in a normalized format I wrote two separate programs.
- UCB/NRLF extraction (nrlf_ucb_marc_parser code): Using the PyMARC library extract data from the MARC records and put into this program outputs a tab separated file that conforms to the metadata schema. As part of this work the program features two functions designed to extract and normalize data. The first function fingerprint_keyer(), performs string normalization and tokenization. This function is used to normalize various pieces of data from the MARC records including the title, oclcnumber and enumchron fields. The second important function enum_keyer() uses regex to parse out detailed enumeration information from collapsed enumeration statements.
- SRLF extraction (srlf_processing_script code): This program is far simpler as it simply opens the input CSV file for SRLF and performs normalization and data extraction steps through the fingerprint_keyer() and enum_keyer() functions. Like the UCB/NRLF program it outputs data in the shared schema
In previous iterations of this work I relied largely on specialized python scripts to find matches and overlapping holdings. This approach worked well but the overhead associated with the design and error checking of a program to complete basically database or SQL type comparisons meant that it was hard to ask a new question. Attempts to use more common data analysis tools (e.g. Microsoft Access and MySQL running on a small server) failed due to the large amount of data being analyzed.
Seeking a solution for a SQL-like tool that would work at larger scale, I experimented with the Google cloud tools, particularly Google Bigquery. The Bigquery platform uses the Google cloud platform to support queries on large table operations. The figure below shows a sample query run on a table from the UC Berkeley dataset counting the languages of items. As you can see on the left hand side of the screen the UCDW schema contains a number of tables including UCB, HathiTrust, and XRLF, each of which follow the standard schema. Other tables in the image are tables generated through queries. The storage of query results in tables is one method that this platform uses to ensure fast results on large datasets.
Performance and cost
The most complex query run on this dataset to date looked for serial titles located at UCB that had no holdings at either NRLF or SRLF. This query compares almost 7 Million items at UCB against 14 Million items at NRLF/SRLF. Comparison points include OCLC number and a tokenized version of the title. While this query would take approximately 10 minutes to run as a Python program and would not run at all as a SQL query in MySQL, the results return in about 3 minutes on the Google Big Query platform.
During an active month of development the cost to work with this service was $.70.
The Google Bigquery platform is driven by tables created from gzipped TSV files uploaded to the Google Storage platform. In order to create a table in the BigQuery platform a user generates a TSV file following a schema, gzips it and uploads it to the storage platform. When query results are downloaded they are first exported by the system to this storage space where they can be downloaded.
Gzipped files are then loaded into the SQL platform through a table wizard.
This process means that even without further user interface development this platform could provide libraries with a manageable way to upload data into a shared or individualized query space and run pre-defined queries from a central data store.
Findings and next steps
The use of a set of Python programs for data extraction and analysis provides a sustainable way to process data. All data extraction was completed on a desktop computer with an i5 Quad core processor, 32 GB ram and an SSD. Memory, CPU and disk speed were all factors in processing time but when executed in a parallel to optimize CPU usage it takes approximately 15-20 minutes on this platform to process 14 Million MARC records. Processing CSV files takes considerably less time.
In order to enable more in-depth comparison of data I found the Google platform to be far superior to other methods tested. This seems like an easy way to scale analysis up and potentially engage other users without considerable development time.
It was unfortunate that just as I was undertaking the comparison of data that OCLC deprecated its XID service. This API-based service would have allowed me to extract considerably more data around OCLC number version history as well as ISSN and ISBN versions. This information would have greatly improved the accuracy of OCLC Number-based matching. Without this service I found that tokenized title matching to be an accurate solution. At the moment UC Berkeley is working through a set of 42,000 serial titles that, according to this process, do not have holdings at NRLF or SRLF. A quality control check shows that the process is largely accurate with some false positives (e.g. titles returned in the data that have holdings at NRLF or SRLF). I have not tested the data for false negatives (e.g. titles not included in the data that do not have holdings at NRLF/SRLF).
I have consulted with OCLC staff about an alternative technique for getting historic ISSN data associated with serials as this seems to be an area where we need better bibliographic data. I have a technique outlined using the OCLC Linked Data service that I have not had a chance to operationalize yet.
An example of a document generated through this process for actual collections decision is this list of serial titles held in Main Stacks at UCB which have no holdings at either RLF (ucb_main_serials_notin_xrlf).
The current enumeration extraction function uses rudimentary regex functions to extract volume, number and date information based on common punctuation (e.g. :, -, ()). I tried some other experimental methods focused on finding patterns and automatically or manually writing functions to handle enumeration statements that fit these patterns (e.g. v1:2 (1999) is broken out to volume 1, number 2 date 1999). In broad strokes I found arund 142,751 patterns in the 21 M records I analyzed. With further generalization I brought the unique number of patterns down to 13,78. I believe that with further consideration of programming techniques to extract data based on these identified patters we may be able to create a highly reliable enumeration extraction process.
The data below shows one such list of extracted enumeration patterns with a count of the occurrence of each pattern. This Most normalized (by punctuation) 142,751 rows. When I adjusted the program (enum_pattern_generator code) to translate numbers down I was able to distill the 142,751 rows to 13,678 unique patterns (see below end of this section)
I believe that the proof of concept here can inform some next steps thinking about collection analysis work that would be of value in helping campuses analyze their holdings in relation to RLF holdings as well as other shared print/digital platforms. While the data contained here is a snapshot (e.g. it would need to be updated regularly to be valuable long term) simply refreshing the data monthly would provide campuses with a cycle on which they could perform comparisons.
I also believe that the work completed on identifying enumeration patterns may hint at a feasible strategy for working through the problem of holdings comparison for serials. Such a process would be of immediate value to the UC libraries but also could help libraries more broadly undertake collections analysis work in their own collections.
Enumeration pattern example
# enumcount, enumpatterns
‘487389’, ‘v.z dddd’
‘451532’, ‘v.z yr.dddd’
‘141676’, ‘v.z no.z yr.dddd mo.mmm.’
‘138499’, ‘v.z:z-z mmm-mmm dddd’
‘108013’, ‘v.z-z dddd-dd’
‘107466’, ‘v.z:z-z dddd’
‘103756’, ‘no.z yr.dddd’
‘81453’, ‘v-.z dddd’
‘71833’, ‘v.z no.z yr.dddd’