Dirty data can cause trouble in the new Tripod

by Barbara Weir

One of the advantages of a catalog discovery layer such as the new Tripod is the ability to narrow search results using filters or “facets,” which include the characteristics describing a resource such as the format, subject headings, publication date, etc.  However, the effectiveness of faceted browsing relies on descriptive data that is clean and consistent. “Dirty data,” such as typos, inconsistent author headings, or incorrect format coding, previously virtually unnoticed in Tripod Classic, become glaringly obvious when displayed in a filter. These mistakes, while seemingly minor, can affect the retrieval of catalog records.

Besides human error, why are these inconsistencies in the catalog? One reason is that cataloging practices change over time. Until recently the Library of Congress did not consistently add an author’s death date to the author heading. Therefore, an author’s name could appear twice: as Smith, Robert, 1925-  and as Smith, Robert, 1925-2011. Narrowing a search result to just one or the other will cause a patron to lose some of the author’s works.  Another change was to the 3-letter coding of language of multi-lingual works. These are currently coded so each of the languages can be parsed and displayed. However, older cataloging practice was to concatenate all of the codes. As a result, a language code for a title in English, French, and German could have displayed as “engfreger.” In fact, one library implementing a catalog discovery layer found over ten thousand records with bad language codes displaying as a top result in their language filter.

The “date of publication” filter was particularly complicated to construct because the date can be found in a number of places in a bibliographic record and because more than one date can be associated with an item. For example, a piece of music could have been performed in one year, but published in another. Digitized items may include a date of publication of the original print as well as a date when the item was digitized. If a date is unknown, it may have been coded in different ways depending on the cataloging practice at the time.

The largest source of inconsistencies, however, comes from vendor-supplied records we load into Tripod for many of our e-resources. Because e-resource packages such as Ebrary, Naxos Music Library and Early English Books Online often contain thousands of titles, it wouldn’t be feasible to manually create individual bibliographic records. Therefore we take data supplied by the vendor, re-format it to create records with links to the e-resources, and load them into Tripod. While the records provide a way for students and faculty to discover an e-resource, they are a bit of a compromise and often don’t contain the level of detail we would include in a normal bibliographic record. For example, records for films in the Films on Demand package do not include the name of the director. Records for books in the EBL ebook collection contain subject headings; but if the subject includes a geographic location, it is not identified as such. A search filtered to a geographic region could miss these titles.

In addition to full record loads, the libraries contract with a vendor to add table of contents data to new bibliographic records. While this data proves to be very useful in keyword searching and item display, there can be a problem with the author names. Since the names are often indexed as they appear on the title page, they are usually inconsistent with the standard form of the name, leading to the author’s name appearing in multiple forms in the author filter.

The new Tripod implementation team, mindful of these potential problems, combed through hundreds of bibliographic record field codes in order to determine which data to parse and display in filters. The current filters are likely to be tweaked as we work with the system and gather feedback from patrons. We will likely add a separate filter for group or corporate authors, as we found these headings sometimes overwhelmed the results in the author filter. For example, a search for the title Orientalism by Edward Said resulted in the author name displaying well “below the fold,” following a number of corporate authors such as Ebrary, the United States Department of Justice, the London School of Oriental and African Studies and the University of Chicago Oriental Institute.

The librarians are discussing strategies for finding and correcting inconsistent and incorrect data now exposed in the discovery layer. While no library catalog with over two million records will ever be 100% correct, we want to provide users the most complete and accurate data possible. We invite you to report any “dirty data” you find to librarian@swarthmore.edu. We are standing by with virtual scrubbies at hand.

