Preprint material

This page provides a single entry point for technical material in the submitted manuscript on Bibliographic Data Science (Lahti, Marjanen, Roivainen, Tolonen).

Source code for bibliographic metadata harmonization

Brief summary of the data harmonization

For full algorithmic details, see the respective project pages above.

Selected fields Overall, we have in the present work focused on a few selected fields. These include the document publication year and place, language, and physical dimensions including gatherings and page count.

Publication years We had to identify and interpret varying notation formats such as free text, arabic, and roman numerals, and remove spelling mistakes and erroneous entries.

Publication places We harmonized and disambiguated city names by a combination of string clustering and mappings to open geographic databases, in particular Geonames. In addition, we complemented the automated searches with manually curated lists to merge synonymous place names (arising from spelling errors, varying writing conventions, and language versions) and to complement missing city-country mappings. We have paid particular attention to unified treatment of the geographical names across catalogues, in order to allow comparisons and integration of geographical information across independently maintained catalogues.

Language information We mapped the language identifiers in the MARC format to the corresponding full language names, and in this process also identified and corrected occasional spelling errors. Where available, we separately listed the primary and other languages in multilingual documents.

Gatherings This measures document width and height. The available MARC entries are not readily available as standardized height, width, and page count estimates. The notations vary standard gatherings (such as folio or octavo) to inches and centimeters; and sometimes only partial information is available. Where possible, we have estimated and augmented the missing values based on the available information; for instance the document gatherings is often possible to reliably estimate when only height or width information of the document is available.

Page count Regarding page counts, the MARC notation follows a standard convention, which separates the cover sheets, figures, and other document details. We have constructed custom algorithms to convert raw page count information into numeric values, as implemented in the publicly available bibliographica R package.

Print area We have added derivative fields, in particular the print area, which refers to the amount of sheets used to print different documents in a given period, ignoring print run estimates. This helps to quantify the overall breadth of printing activity, reflecting the overall amount of print products in a way that is complementary to mere title count.

Quality control Our automated harmonization efforts are coupled with systematic monitoring and verification of the quality and coverage of the harmonized entries. This is facilitated by automatically generated summaries of the data conversions and mappings between the raw entries and the final data, as summarized at the respective project pages (see the links under the Code section above).