On Digitization


Original art by blog author

I enjoy coding, because a sense of freedom and control over the situation accompanies the writing of a program; the versatility of code gives programmers the ability to retrieve any* output from any* input they’re working with, depending on what they want and need at the moment.

(*Not really “any”. Code has multiple limitations.)

When my professor told us, as we were learning how to use the Optical Character Recognition (OCR) software ABBYY FineReader, that this program allows users to “manually” fix errors that result from text digitization, the very young coder in me jumped up and down in excitement. We were introduced to an incredibly useful tool, that allows world-wide transmission and analysis, as well as preservation, of  texts (that would otherwise be confined to wherever their physical copy is stored); and even though we were immediately aware of the limitations of the program, we also learned about some possible solutions for them. This is why the part of me that likes coding got so very excited;  OCR software opened up a world of possibilities regarding text analysis, possibilities that are not restricted to what the software itself can do.

Our professor explained to us that when a physical copy of a text is scanned and read by the software, some characters might be (almost inevitably) incorrectly identified in the text’s digital version. These errors are not uncommon, despite the fact that ABBYY FineReader recognizes multiple languages and alphabets, and is aided by dictionaries (for some languages).

For instance, Figure 1 shows the misrecognition of a specific character from a Romanian translation of the Bible. The name Domnul Dumnezeu, which addresses God, was originally written with a cursive D at the start. The special character was identified by the software as the copyright symbol instead:



Figure 1. The copyright symbol © is not part of the original text.

We were told that we could go about fixing this problem by instructing (“training”) the program to write x (desired character) everytime it identifies y (unknown character). In this particular case, x would be the cursive D (which I couldn’t find online to copy here… and which I, foolishly, didn’t think to copy from the original document), and the y would be the copyright symbol  ©This can be done with code!

The second text I worked with was written in Arabic, and the software’s reading of it encountered a problem that is the same in nature to the previous one. Figure 2 shows that the program recognized some blocks of text as Arabic (green) but couldn’t identify others (pink):

The software recognized the green blocks only.

Figure 2. ABBYY FineReader recognized the green blocks only.

It was explained to me that this page contains text in stylized script, making it impossible for ABBYY FineReader to recognize the language in which parts of it are written.

These are examples of why my encounter with OCR software taught me that errors in digitization often arise when characters are stylized in non-standard ways. Therefore, the question arises: does digitization affect calligraphy and typography as forms of art? Will the technology evolve to the extent in which it can recognize the smallest nuances in characters, or will styled characters be deemed unreadable and not suited for OCR? Would the pros of standarizing text format outweigh its cons? I don’t think they would, given the value of art forms relating to text. But then, would not standarizing imply sacrificing important content analysis?

Kenneth M. Price states in A New Companion to Digital Humanities (2016) that digitazing texts allows “the analysis of patterns that can be detected and then explored in more detail”, and given the malleable nature of electronic content, “the goal… is the early release of strong rather than perfected content” (p. 139). I have already mentioned examples of the flaws of OCR , but also of its strengths. The analysis of patterns, similar to what could be done regarding the Romanian Bible, is fascinating and can be extremely useful.

Last semester, I worked on a translation project for one of my courses. which consisted on translating a Costa Rican short story from the 1920s into English. For this endeavour, I required the help of a 1890s Costa Rican  dictionary of provincialisms, which I was lucky enough (and extremely grateful) to find online, on Internet Archive. The short story was plagued with words and phrases I didn’t understand too well, because they are no longer used colloquially. I’m thinking that a project with the potential of yielding interesting results could include reading the dictionary with OCR software, storing the terms defined in it that I’m not familiar with, and then trying to trace them in certain Costa Rican texts from the 1890s onwards. This process could show the frequency of these words in texts across time, and how their use has declined. I might even be able to mark the periods in which words became antiquated, to trace which generations stopped using them.

Leave a Reply

Your email address will not be published. Required fields are marked *