Collections analysis techniques using library data

As the director of a high-density materials storage facility on behalf of the University of California system I am often engaged in conversations with campus libraries about holdings, duplication and other collections analysis type issues.  The UC regional library facilities (RLFs) have policies to prevent duplication and policies that encourage persistence, both great things, but we do not have the tools we need to help libraries make well informed and efficient data-grounded decisions about their collections (e.g. what can I send to a storage facility, what volumes in a serial does the facility already have, what other shared print programs have this item).

These issues are pressing enough that many of them are surfacing in a research project led collectively by the UC Libraries to study the workflows and data streams around the RLFs. In addition, there are a number of other projects including the UC Libraries Federal Document Archive and a large-scale collection review at UC Berkeley that require us to ask questions grounded in metadata about individual holdings at multiple institutions.

There are already some great tools out there to help this work.  UC Berkeley hosts a duplicate detection tool that checks the OCLC database for holdings at the northern (NRLF) and southern (SRLF) regional library facilities.  This tool works well for monographs and for serials where there are no holdings but, because OCLC does not have the detailed holdings information and because our API interface cannot get to the information it has, the tool has limited effectiveness for serials.  In addition, the tool works by having a library submit a list of OCLC numbers to it.  This is great if you have a known list from which you want to check duplication but does not work so well if you are asking the much more open question – what does my library have that is not at either RLF?

With the motivation in mind to help provide a different system to answer these questions I took on the creation of a set of tools to help libraries ask these questions at the detailed item level.  In doing so I found out that I was driving towards a rudimentary data warehouse application – not because the data I had to analyze was over complex but because the scale of data started exceeding the capacity of my local computers.

In order to gather some of this content together I decided to capture a few of the design decisions and activities.

Needs assessment

Though my design process I decided my application(s) needed to do a few things.  I needed to:

  1.  Be able to transform data from my various sources into a normalized set of metadata that was specific and appropriate to the questions I wanted to ask but not overly complicated.
  2. Design solutions that work at the 20M record level
  3. Create a platform that would support asking many types of questions with relative ease
  4. Create a system that allowed me to answer very specific volume/issue holdings questions.
  5. Be able to transform data from my various sources into a normalized set of metadata that was specific and appropriate to the questions I wanted to ask but not overly complicated
  6. Not require any manual manipulation of data

I’ll talk about these six needs in context of data extraction, data analysis and overall findings below:

Data extraction

Creating programs to transform data turned out to be fairly easy if not straightforward.  I had a good data source from the SRLF with item-level information and I used that as my base – modeling data from NRLF and other sources on it.  An overview metadata_schema shows which fields I was able to gather for all datasets.

I got data dumps from UC Berkeley and UCLA that contained the SRLF, NRLF and UC Berkeley data.  In order to extract the data in a normalized format I wrote two separate programs.

  1.  UCB/NRLF extraction (nrlf_ucb_marc_parser code):  Using the PyMARC library extract data from the MARC records and put into this program outputs a tab separated file that conforms to the metadata schema.  As part of this work the program features two functions designed to extract and normalize data.  The first function fingerprint_keyer(), performs string normalization and tokenization.  This function is used to normalize various pieces of data from the MARC records including the title, oclcnumber and enumchron fields.  The second important function enum_keyer() uses regex to parse out detailed enumeration information from collapsed enumeration statements.
  2. SRLF extraction (srlf_processing_script code):  This program is far simpler as it simply opens the input CSV file for SRLF and performs normalization and data extraction steps through the fingerprint_keyer() and enum_keyer() functions.  Like the UCB/NRLF program it outputs data in the shared schema

Data analysis

In previous iterations of this work I relied largely on specialized python scripts to find matches and overlapping holdings.  This approach worked well but the overhead associated with the design and error checking of a program to complete basically database or SQL type comparisons meant that it was hard to ask a new question.  Attempts to use more common data analysis tools (e.g. Microsoft Access and MySQL running on a small server) failed due  to the large amount of data being analyzed.

Seeking a solution for a SQL-like tool that would work at larger scale, I experimented with the Google cloud tools, particularly Google Bigquery.  The Bigquery platform uses the Google cloud platform to support queries on large table operations. The figure below shows a sample query run on a table from the UC Berkeley dataset counting the languages of items. As you can see on the left hand side of the screen the UCDW schema contains a number of tables including UCB, HathiTrust, and XRLF, each of which follow the standard schema.  Other tables in the image are tables generated through queries.  The storage of query results in tables is one method that this platform uses to ensure fast results on large datasets.

Google Query

Performance and cost

The most complex query run on this dataset to date looked for serial titles located at UCB that had no holdings at either NRLF or SRLF.  This query compares almost 7 Million items at UCB against 14 Million items at NRLF/SRLF.  Comparison points include OCLC number and a tokenized version of the title.  While this query would take approximately 10 minutes to run as a Python program and would not run at all as a SQL query in MySQL, the results return in about 3 minutes on the Google Big Query platform.

During an active month of development the cost to work with this service was $.70.


Data Storage

The Google Bigquery platform is driven by tables created from gzipped TSV files uploaded to the Google Storage platform.  In order to create a table in the BigQuery platform a user generates a TSV file following a schema, gzips it and uploads it to the storage platform.  When query results are downloaded they are first exported by the system to this storage space where they can be downloaded.

Screen Shot 2016-02-03 at 1.23.44 PM

Gzipped files are then loaded into the SQL platform through a table wizard.

Screen Shot 2016-02-03 at 1.25.16 PM

This process means that even without further user interface development this platform could provide libraries with a manageable way to upload data into a shared or individualized query space and run pre-defined queries from a central data store.

Screen Shot 2016-02-03 at 1.27.30 PM

Findings and next steps

Analysis platform

The use of a set of Python programs for data extraction and analysis provides a sustainable way to process data.  All data extraction was completed on a desktop computer with an i5 Quad core processor, 32 GB ram and an SSD.  Memory, CPU and disk speed were all factors in processing time but when executed in a parallel to optimize CPU usage it takes approximately 15-20 minutes on this platform to process 14 Million MARC records.  Processing CSV files takes considerably less time.

In order to enable more in-depth comparison of data I found the Google platform to be far superior to other methods tested.  This seems like an easy way to scale analysis up and potentially engage other users without considerable development time.

Title comparison

It was unfortunate that just as I was undertaking the comparison of data that OCLC deprecated its XID service.  This API-based service would have allowed me to extract considerably more data around OCLC number version history as well as ISSN and ISBN versions.  This information would have greatly improved the accuracy of OCLC Number-based matching.  Without this service I found that tokenized title matching to be an accurate solution.  At the moment UC Berkeley is working through a set of 42,000 serial titles that, according to this process, do not have holdings at NRLF or SRLF. A quality control check shows that the process is largely accurate with some false positives (e.g. titles returned in the data that have holdings at NRLF or SRLF).  I have not tested the data for false negatives (e.g. titles not included in the data that do not have holdings at NRLF/SRLF).

I have consulted with OCLC staff about an alternative technique for getting historic ISSN data associated with serials as this seems to be an area where we need better bibliographic data.  I have a technique outlined using the OCLC Linked Data service that I have not had a chance to operationalize yet.

An example of a document generated through this process for actual collections decision is this list of serial titles held in Main Stacks at UCB which have no holdings at either RLF (ucb_main_serials_notin_xrlf).

Enumeration extraction

The current enumeration extraction function uses rudimentary regex functions to extract volume, number and date information based on common punctuation (e.g. :, -, ()).  I tried some other experimental methods focused on finding patterns and automatically or manually writing functions to handle enumeration statements that fit these patterns (e.g. v1:2 (1999) is broken out to volume 1, number 2 date 1999).  In broad strokes I found arund 142,751 patterns in the 21 M records I analyzed.  With further generalization I brought the unique number of patterns down to 13,78.  I believe that with further consideration of programming techniques to extract data based on these identified patters we may be able to create a highly reliable enumeration extraction process.

The data below shows one such list of extracted enumeration patterns with a count of the occurrence of each pattern.  This Most normalized (by punctuation) 142,751 rows.  When I adjusted the program (enum_pattern_generator code) to translate numbers down I was able to distill the 142,751 rows to 13,678 unique patterns (see below end of this section)

Next steps

I believe that the proof of concept here can inform some next steps thinking about collection analysis work that would be of value in helping campuses analyze their holdings in relation to RLF holdings as well as other shared print/digital platforms.  While the data contained here is a snapshot (e.g. it would need to be updated regularly to be valuable long term) simply refreshing the data monthly would provide campuses with a cycle on which they could perform comparisons.

I also believe that the work completed on identifying enumeration patterns may hint at a feasible strategy for working through the problem of holdings comparison for serials.  Such a process would be of immediate value to the UC libraries but also could help libraries more broadly undertake collections analysis work in their own collections.

Enumeration pattern example

# enumcount, enumpatterns

‘1482958’, ‘v.z’

‘487389’, ‘v.z dddd’

‘451532’, ‘v.z yr.dddd’

‘355999’, ‘dddd’

‘259450’, ‘v-.z’

‘205518’, ‘yr.dddd’

‘195151’, ‘no.z’

‘141676’, ‘v.z no.z yr.dddd mo.mmm.’

‘138499’, ‘v.z:z-z mmm-mmm dddd’

‘122749’, ‘v.z-z’

‘111722’, ‘dddd-dd’

‘111094’, ‘no.z-z’

‘108013’, ‘v.z-z dddd-dd’

‘107466’, ‘v.z:z-z dddd’

‘103756’, ‘no.z yr.dddd’

‘90029’, ‘v.z:z’

‘81453’, ‘v-.z dddd’

‘81036’, ‘dddd:zk’

‘78454’, ‘dddd:z’

‘71833’, ‘v.z no.z yr.dddd’

Posted in Uncategorized | Leave a comment

Metadata Standards and Web Services in Libraries, Archives and Museums

Metadata Standards and web services in libraries, archives and museumsThe worksheets and sample coursepacket linked to on this page are intended to serve as companion materials to the book Metadata Standards and Web Services in Libraries, Archives and Museums.

The text, the active learning worksheets and the course design grew from many iterations of information organization and information technology courses. Many thanks to the colleagues and students at the University of Maryland for your input and feedback.

The following suggested course packet and worksheets are structured around a semester-based weekly course program. Each worksheet contains suggested readings and hands-on activities intended to provide practical experience to reinforce technical and theoretical content in the book and other suggested course readings. Each worksheet includes an answer key.


Class 1:  Information infrastructures and institutions

Introduce the structure for the semester grounded in a broad orientation to how information institutions work. Explore definitions and examples of information institutions including libraries, archives, schools and museums LASM. Explore the roles that these institutions play in society (e.g. memory, community, education, commerce).

Class 2:  Information systems as boundary objects

Expand on the organizational orientation from class 1 and discuss social and cultural roles of LASM institutions. Explore concrete examples of information, cultural heritage and memory institutions and define concepts and ideas to give students a holistic understanding of “information infrastructure” field. Introduce course model (e.g. Metadata >> System >> User) and explore connections with other core courses. Explore theoretical foundation of the process of representation.

Class 3:  Acquiring and managing resources

Explore resource acquisition and management work in LASM institutions. Introduce technical service disciplines and illustrate connections with other functional areas in information institutions by reinforcing role of core courses. For each LASM institution type explore the notion of resource operations in light of changing information institution models. At the end of the class students will understand the role of each of the activities in LASM institutions 1) Publication models (formal, in-formal), 2) Acquisition of materials (published, manuscripts, grey literature), 3) Management of formats (physical and digital), 4) Materials processing and management, 5) Appraisal, access and preservation, 6) Alternative acquisition, management and dissemination strategies.

This optional worksheet designed to be included in class 3 covers advanced JavaScript topics.

Class 4:  Introduction to metadata

Introduce metadata model (cataloging model, metadata schema, data representation model, data encoding/serialization). Discuss different types of metadata (e.g. descriptive, administrative, technical) and situate metadata within the broader context of information system design.

Class 5:  Methods of description, representation and classification

Discuss cataloging methods and different forms of metadata in information institutions. Introduce concept of metadata schemas and role that metadata standards play in enabling creation of digital documents and representations. Reinforce specific cataloging standards/approaches (e.g. RDA, DACS, ISAD/G) and introduce metadata schema (e.g. MARC, DC, EAD). Reinforce context of these standards in broader metadata and information system design models. Draw connections to other types of information systems. Explore and apply classification structures. Explore information seeking processes and the connection between categorization and cognition.

Class 6:  Metadata schema, vocabularies and encoding

Expand on concepts in metadata schema including the notion of application profiles, abstract models (e.g. Dublin Core Abstract Model) and Resource Description Framework. Broaden student understanding of vocabularies by introducing new serialization standards (e.g. XML, JSON).

Class 7:  Database design

Introduce relational database design concepts and techniques. Reframe student understanding of information systems by introducing web-based information system design (e.g. Model – View – Controller). Topics covered include entity relationship modeling, database creation, database querying and information filtering.

Class 8:  Selected topic deep dive

Designed to be a ‘catch-up’ week, class 8 includes an optional deep-dive activity into MARC for students who want to become more acquainted with that standard.

Class 9:  Search and retrieval in information systems

Explore methods for automatic indexing and ranking of information resources. Introduce foundation of web search techniques, full text searching of scanned books and image searching.

Class 10:  Creation of metadata rich web services

Explore services that support access to physical and digital objects. Introduce broad types of information services including user-focused services (library catalog) and system-focused web-services (interoperability, harvesting, transformation) (ONIX, OAI/PHM).

Class 11:  Metadata rich web services

Continue exploration of web services by exploring Open Refine and text manipulation and analysis techniques.

Class 12:  Building blocks of the web

Revisit web-publishing document standards (e.g. HTML, CSS, JavaScript). Acquaint students at a high level with web publishing approaches and reinforce concepts around web-based scripting and programming languages. For classes especially focused on metadata issues this introduction to the eXtensible Stylesheet Language could be an appropriate overview for programming concepts.

Class 13:  Emerging topics – Exploration of data management

In this class we are exploring the broad area of Research Data Management in order to better understand how issues of organization and information technology have an impact in an emerging area of interest in libraries, archives, schools and museums. Students will explore a real-world data management guide and try their hand at data management tools.

Class 14:  Next steps in information infrastructures
Review course content and bridge student knowledge of information infrastructures, systems and services to other parts of the curriculum. Discuss professional paths for different areas of interest. Connect learning by re-visiting institutional, data life-cycle and information system models.

Posted in Education | Leave a comment

Introduction to data management principles and tools

On November 19th I have the pleasure of joining Jeffery Loo in giving a workshop on data management principles and tools at the D-Lab at UC Berkeley.

The workshop included a handout that includes a link to the UC Berkeley Library guide for data management. The workshop also included a series of
hands-on data management activities

Posted in Uncategorized | Tagged | Leave a comment

Some testing notes on Google’s power searching course platform

In the summer of 2012 Google held an online instruction course called Google power searching. The course was developed using an interactive platform that featured text, video and skill-check questions. It also featured cumulative tests and time released activities. Google was the most recent player to enger the Massive Online Open Course and enrolled over 150,000 people in its course.

This past week Google released the software behind the power searching course. While interesting just because it is an alternative for teachers interested in creating interactive online learning environments, the Google application is unique in that it was designed to work on the Google Apps Engine platform. This platform features an integrated development and testing environment and is designed around a cloud platform that scales automatically.

A quick deployment of the instructional platform on GAE revealed some interesting features. First, the course instructional elements are contained in comma separated files and can be easily loaded into the course. This allows course designers to write text and record multi-media resources as needed while also providing a visually appealing and easy-to-use.

One potential downside is that the fact that the interactive skill checks are implemented in JavaScript code files in the application itself. This feature could prove daunting for course designers seeking to efficiently manage their course elements.

The tight integration between Google Apps Engine and the software however underscores the effectiveness of an open source platform that can be deployed without the overhead of IT environment customization and management. In addition, the reliance on the django development framework in Google Apps Engine enables automated administrative interfaces that would otherwise be far down on the development list in an application like this.

In attempting to map some of my existing course activities over to the platform I found that I often designed activities that were complex and larger in scale. Decomposing these activities into discrete step-by-step processes proved difficult in the Google platform. For example in a simple test with one part of a class ( I found that some tasks were difficult to break down into manageable sizes.

A second challenge I ran into was the somewhat limited set of assessment tools. The platform features multiple choice, true/false and auto-assessed short answer questions but did not have functions to support creation of tables or matrices based on student exploration. In addition, only the quizzes gathered data on student activities and preserved it.

Posted in cataloging | Tagged , | Leave a comment

ASIS&T in hurricane sandy

This past week I spent a few days in Baltimore during Hurricane Sandy the ASIS&T 2012 conference. Other than some driving rain, a curfew during the second day of the conference and a few sandbags here and there the conference came off without any issues.

Both of the panels I was on had about half of the panelists come. This made time for extra questions and audience interaction! I enjoyed presenting my work and appreciate my co-presenter Kanti standing by our Poster when I had to get back to my hotel before curfew during the poster sessions. Here’s hoping that November in Montreal will be better weather!

Posted in 2012 ASIS&T | Tagged , , | Leave a comment