How much overlap do we have with the HathiTrust?

Since ZSR library started looking at the HathiTrust as a potential source of out-of-copyright electronic texts people have been asking ‘how does our collection compare to HT?’ The short answer is (based on comparison of valid OCLC numbers) 366,800 out of 8.5 Million titles, or about 29% of our collection, slightly less than was recently indicated as an average coverage for research libraries in general. Granted, this is a simple first pass using OCLC number and is most likely leaving out a number of titles. It is interesting to note that out of 1.5M entries of 035A records in our database, nearly a third contain numbers that are not OCLC numbers.

A second question that has been raised is “What is the copyright status of these matched titles?” Unfortunately, over 87% of our matched titles are in copyright. This means that while they are digitized we can’t use them. Our matching process found only a 11% match of public domain titles (out of a database of 2.2M public domain records in the HathiTrust). This indicates that there are some good opportunities to expand our representation of public domain digitized resources in our catalog. As you might expect however, our MARC records for public domain resources also happen to be more likely to not have OCLC numbers. For instance, our Eighteenth Century Collections Online MARC records do not have OCLC numbers. For the rest of the codes listed in the table below you can visit the HathiTrust site.

This comparison is just our first guess at matching with the HathiTrust. HT includes some sophisticated data APIs and a wide range of identifiers that we can work with to see how our holdings compare. What we do about this is an open topic but I thought that it might be interesting to see initially how our collection compared.

Want to see the data? I can export the matches for you so you can run your own reports. Curious about the process? Visit my office and I can show you the database that was used to run the comparisons.

Table of copyright policy for matched titles

rights Number of records % of matched records
Cc-by 2 0.0005%
Cc-by-nc 16 0.0044%
Cc-by-nc-n 1 0.0003%
cc-by-nc-s 24 0.0065%
ic 320274 87.1446%
nobody 31 0.0084%
opb 137 0.0373%
pd 32100 8.7342%
pdus 9489 2.5819%
und 4809 1.3085%
world 637 0.1733%
Total matches 367520 100.0000%
This entry was posted in technology and tagged , , . Bookmark the permalink.

One Response to How much overlap do we have with the HathiTrust?

  1. Lynn says:

    These are data I’ve been wanting to see. I will want to continue the conversation when I get back!

Leave a Reply