Corpus

The texts I’ve chosen for my corpus are both titled Diccionario de costarriqueñismos, which can be translated as “dictionary of Costa Rican slang”.

The first one was written by “narrator, playwright, poet, and philologist” Carlos Gagini, and published in 1918. Gagini is considered one of the greatest representatives of Costa Rican literature, given his efforts to establish literary models that reflected Costa Rica’s national identity. He was also interested in the country’s native cultures, thus some of his work focused on aboriginal languages (“Carlos Gagini”, n.d.). Diccionario de costarriqueñismos is regarded as his masterpiece, “one of the most important lexicographic works and inescapable reference in the study of Costa Rica’s Spanish” (“Diccionarios”, n.d.).

Almost ninety years after Gagini’s dictionary was first released, in 1996, philologist and humanist Arturo Agüero Chaves published a new version of a slang dictionary. He was an internationally acclaimed author, well-known for his research on the origin and development of linguistics (“Arturo Agüero Chaves”, n.d.). In the prologue of his text, Agüero explains that his studies on Costa Rica’s Spanish began as early as 1953, and close examination of vocabulary was carried out in response to “the insistent demand of many people who wished for an updated Costa Rican slang dictionary, given that Carlos Gagini’s had been published in 1918…” (Agüero Chaves, 1996, p. VII).

(It is worth noting that a slang dictionary was published in 1991 by Miguel Ángel Quesada Pacheco, titled Nuevo diccionario de costarriqueñismos. Therefore, Agüero’s text was not the first one to be written after Gagini’s. The corpus and its analysis would have been enriched with the addition of this text, but it isn’t available online.)

Given the significant amount of time that separates both works, I intended to analyze the texts in search of differences and/or patterns that I could then relate to the evolution of Costa Rican slang across the years. However, I realized throughout the process that the lexicographers’ interests and points of view were what most likely determined differences between the dictionaries and patterns within them (and because each lexicographer was influenced by his context, the time period certainly affects the content of each work as well).

I also realized that this was an opportunity to compare Costa Rican Spanish to “general” Spanish; a possible way of doing this was to conduct frequency analyses of the letters in each text, and comparing them to a general frequency analysis of letters in Spanish.

 

Part I. Digitization

Tool: ABBYY FineReader

The first step towards text analysis involved the digitization of both texts, as they were available online as images of their physical versions. In the case of Gagini’s work, a .txt file of the dictionary can be found online, but it contians some character recognition errors that  required close examination of the entire text. Instead, I chose to use OCR software to get my own .txt version of the text, in the hopes that those same errors wouldn’t appear in my version. The same was done for Agüero’s work. Both dictionaries contain introductions, prologues, etc. I indicated ABBYY FineReader, the OCR software I used, not to include these sections in its “reading” of the books, so that any analysis I would carry out afterwards would solely take into account the “dictionary section” (vocabulary words and their definitions).

aagueropagina

Figure 1. Portion of the first page of Agüeros dictionary (scanned image of p. 1).

 

screen-shot-2016-10-10-at-1-28-23-pm Figure 2. Marks in the text may have affected its digitization process, as they could have “confused” the OCR software (scan from Gagini’s dictionary, p.49).

screen-shot-2016-10-10-at-1-26-10-pm Figure 3. Folds in the pages, such as this one, also affect the digitized text; entire words were lost in this scan (from Gagini’s dictionary, p.242).

 

txtfilesdicts

Figure 4. Text files created by ABBYY FineReader, after “reading” both dictionaries.

 

As figures 2 and 3 show, the OCR process is prone to errors, some of which may reuslt from  the quality of the scanned material. Other errors may arise from the nature of the text itself. For instance, Agüero’s dictionary contains an errata sheet, where errors in the texts are listed alongside their page, column in the page, line in the column, and their correction. Ideally, I could have subsituted the errors with their corrections in the text, and this would have given me a “perfected” version of the dictionary. However, an automatic way of carrying out this task didn’t occur to me, and going through it “manually” would have required considerable time (and the process itself would have been prone to errors as well). Therefore, the analysis of Agüero’s text was influenced by those same errors that his errata sheet corrected, which I ignored. Originally, I included the sheet in the OCR process, but soon realized that this would imply double and triple counting of words and letters in the results, for the text would include the errors in the sheet, its corrections, and the errors in the actual dictionary: some of these errors and their corrections might be taken to be the same by text analysis, such as those that are corrected by changing a lowercase letter to an uppercase one, for example.

 

Part II. Letters’ Relative Frequencies

Tools: Sublime Text, Microsoft Excel

The following image is a screenshot from Wikipedia’s entry on the appearance frequency of letters in Spanish (“Frecuencia de aparición de letras”, 2016):

wikipediafrecuencias Figure 5. Appearance percentages of letters in Spanish

My purpose was to compare the relative frequencies of each letter in each dictionary to the data offered by Wikipedia. To do this, I had to consider not only the 26 letters in the English alphabet, uppercase and lowercase, but also special characters used in Spanish, such as the vowels with accents, the “ü” (“u” with diaeresis), and the “ñ” (all lowercase and uppercase).  To account for all these letters, I wrote a code in Python (programming language), on Sublime Text (text editor), that essentially reads all the letters in the text (Spanish special characters included), keeps a tally of how many there are of each, and then calculates each letter’s relative frequency in the text, as a percentage. These results are then saved in a comma-separated values file (.csv), read by Microsoft Excel.

code1

code2 Figure 6. Python code to obtain relative frequencies of letters, from both Gagini’s and Agüero’s dictionaries.

I then gathered the data on both .csv files (Gagini’s and Agüero’s) in an Excel spreadsheet, as well as the percentages on Wikipedia, to set up a table that I then used to create a bar graph. The graph allows a visual comparison of the data from all three sources. The results interest me because the dictionaries do, indeed, follow the pattern of appearance frequency of letters in Spanish – they vary at most by around 2 integers (compare Agüero’s “S” with Wikipedia’s), yet some of the numbers are equal (such as Agüero’s and Wikipedia’s “W”). It makes sense to me that some of the letters would vary more than others. The “S”, for instance, is used frequently in Spanish at the end of words, to create plural nouns. And the “E” is common in prepositions (“de”, “en”, “que”), pronouns (“te”, “se”, “le”), articles (“el”), etc. Therefore, their frequency depends on the style of the author, and the nature of the text. Ideally, I would have isolated the vocabulary words, or entries, from both dictionaries to carry out the frequency analysis; this would have allowed me to study the appearance of letters in Costa Rican “slang words” only, as opposed to including their definitions and explanations as well, which are influenced by the lexicographers’ styles and choices. However, this task would have demanded a lot of time, as I’m not aware of any ways in which it can be done automatically.

(After presenting my corpus to the class and mentioning this limitation, it was brought up that XML could provide a solution, even though it is also time-consuming. If I use XML to mark up the text, then every entry on the dictionaries could be an “entry” element; then, I would be able to perform the analysis on these “entry” elements only. )

tabla1 Table I. Relative frequencies of letters in Gagini’s and Agüero’s dictionaries, as well as Wikpedia’s data regarding frequencies of letters in Spanish.

 

graph1 Figure 7. Graph displaying data from Table I (the relative frequencies of each letter for Costa Rican slang dictionaries and for Spanish in general).

 

It is interesting to observe that the letters “K” and “W”, which have 0 relative frequency in the texts, appear mostly (if not only) in words from other languages (such as English and native tongues like Bribri or Térraba), botanical (Bot.) and other scientific names and terms, etc. The following image shows a search done in AntConc, a text analysis software, for the letter “K” in both dictionaries, and the contexts in which it appears. The highlighted words are an abbreviation for “inglés” (“English”), an abbreviation for “Botánica” (“Botany”), the names of two native Costa Rican tongues (Bribri and Térraba), and “inglés” (the Spanish word for “English”).

kpalabras Figure 8. Appearance of “K” in both Gagini’s and Agüero’s dictionaries.

 

Part III. Abbreviations and Terms

Tools: AntConc, Microsoft Excel

To carry out a search for differences and patterns between both texts, I looked at the abbreviations offered by both. As figures 9 and 10 illustrate, the most efficient way to do this was to go through Gagini’s abbreviations list and select those abbreviations that also appear in Agüero’s list (given that Gagini’s is much shorter than Agüero’s).

gaginiabreviaturas Figure 9. Abbreviations from Gagini’s dictionary (p. 44).

 

1996abreviaturas1

1996abreviaturas2 Figure 10. Abbreviations from Agüero’s dictionary (p. XXIII-XXIV).

I set up each of these abbreviations and the terms they refer to in an Excel table (Table II). As it can be seen, I altered some of the terms and abbreviations (adding variations to some); this seemed necessary after searching for the terms and abbreviations in the text using AntConc, given that not all instances of a concept appeared in the dictionaries as indicated by their abbreviations lists. I also divided the terms into three categories; I thought this would be useful when comparing the appearances of each term/abbreviation in each text, given that I might be able to understand which categories were more important for each lexicographer. Based on the results, which are visually represented in figures 11 and 12, I believe I was right to introduce categories. Figure 11 shows that Gagini put greater emphasis on the geographical origin/use of words, while Figure 12 demonstrates that Agüero refers more often to other languages, especially English. This last result is interesting considering that Gagini was well-known for his anti-imperialistic ideology against the United States (“Gagini, Carlos”, n.d.).

table2 Table II. Terms, their abbreviations, and concordance hits for both Gagini’s and Agüero’s dictionaries.

 

graphterms1 Figure 11. Graph created from the data in Table II, for the “Geography” category only.

 

graphterms2 Figure 12. Graph created from the data in Table II, for the “Languages” category only.

 

The following images have been included to show certain results from AntConc searches through the texts that I believe are relevant and worth noting. Some of them affected the way I carried out this part of the texts’ analysis, while others just show interesting aspects of the texts.

segun-gagini Figure 13. References to Gagini in Agüero’s text (the phrase in blue reads “According to Gagini”).

 

frgagini Figure 14. Search for the abbreviation “fr.” (for “francés”, or “French”) in Gagini’s text. The highlighted phrases refer to people, most likely friars, for which the abbreviation “fr.” was also used. These three instances of “fr.” were subtracted from the concordance hits of this particular abbreviation. The term that is enclosed in red shows an error that most likely occured during the OCR “reading” of the original dictionary. The words “inglés” (“English”) and “plush” are not separated by a blank space. Therefore, this instance of the word “inglés” did not appear in the search for that term.

 

del-fr Figure 15. The abbreviation “fr.” was also problematic in Agüero’s text. As is indicated in his abbreviaitons list, “fr.” not only stands for “francés” (“French”), but also for “frase” (“phrase”). Therefore, many of the concordance hits refer to “phrase” rather than “French”. The two highlighted terms show that “fr.” refers to “French” when it is preceeded by the preposition “Del”, which means “from”.

del-frii Figure 16. Corrected search for “fr.” in Agüero’s dictionary. This time, the search looked for “del fr.”, and all the concordance hits show instances where the lexicographer refers to words in French.

 

Why did I write a code to find out the relative frequencies of letters if I could have used AntConc for the same purpose? Figure 17 shows the concordance hits for “ñ” in Gagini’s dictionary, which match the appearances of the letter that I obtained in Gagini’s .csv file (given that AntConc searches can be case-insensitive; they count “ñ” and “Ñ” as the same character). Even though the result is the same, it seemed that using AntConc for every letter in both texts would be more time-consuming than setting up the code, which analyses all letters in both texts simultaneously, Also, figures 18 and 19 demonstrate that the analysis with AntConc would have required searching for vowels with accents separately, something that the code does alongside the same vowels without the accents.

frecuencianantconc Figure 17. Search for “ñ” in Gagini’s text, using AntConc.

 

agaginiantconc Figure 18. Search for “a” in Gagini’s text, using AntConc.

 

atildegaginiantconc Figure 19. Search for “á” in Gagini’s text, using AntConc.

 

Sources:

(1) Agüero Chaves, A. (1996). Diccionario de costarriqueñismos. San José; Asamblea Legislativa.

(2) Arturo Agüero Chaves. (n.d.) Retrieved from http://www.editorialcostarica.com/escritores.cfm?detalle=1344

(3) Carlos Gagini. (n.d.). Retrieved from http://www.editorialcostarica.com/escritores.cfm?detalle=990

(4) Diccionarios. (n.d.). Retrieved from http://www.acl.ac.cr/x.php 

(5) Frecuencia de aparición de letras. (February 18, 2016). Retrieved from https://es.wikipedia.org/wiki/Frecuencia_de_aparici%C3%B3n_de_letras

(6) Gagini, C. (1918). Diccionario de costarriqueñismos. San José: Imprenta Nacional.

(7) Gagini, Carlos, (n.d.). Retrieved from http://www.sinabi.go.cr/DiccionarioBiograficoDetail/biografia/146