Chapter 4: Corpus-Based Translation Studies
Since 1993, researchers have started to use corpora in translation studies and develop corpora specifically for this use. The term corpus refers to a body of natural language (written and spoken data) material held in machine-readable form and analysable automatically or semi-automatically in a variety of ways. A corpus may include a large number of texts that may originate from different sources and may be based upon a wide range of topics. Texts can be full, i.e. complete, but they can also be part of a larger text, e.g. a chapter from a book.
There are three main types of corpora. In the majority of cases they are bilingual or multilingual, but monolingual corpora are also valuable for a number of applications (Monolingual single and comparable corpora and bilingual and multilingual comparable corpora; for a full discussion on types of corpora, see Baker 1993 and Olohan 2004).
The type of corpus used for this session is a parallel corpus. Parallel Corpora consist of original, source language-texts in language A and their translated versions in language B, and in this context, the results of a query to the corpus are normally given as parallel sentences. The parallel corpus under investigation is composed of one English novel, Virginia Woolf’s Waves, and its French translations: two for The Waves: Les Vagues (1937) translated by Marguerite Yourcenar and Les Vagues (1993) translated by Cécile Wajsbrot. In order to work with the texts, they have to be acquired and converted to electronic form and converted in ASCII text format for subsequent processing using the corpus tools (for more information on this process, this for instance Olohan 2004).
Tools and Software for Corpus Investigation
Corpus-based paradigm can be used to study various phenomena thanks to software packages which allow researchers to process texts in machine-readable form. These complex and comprehensive tools are used for the collection of information from the data stored in a corpus. The most widely used tools are the concordance program and the frequency list, both of which are available with WordSmith Tools, the first piece of software discussed below.
WordSmith Tools is an integrated suite of programs designed to facilitate the analysis of the behaviour of words in texts. It was developed by Mike Scott of the University of Liverpool and consists of three main tools: WordList, Keyword lists and Concord.
This facility allows listing all the word forms in a corpus in order of frequency and/or in alphabetical order. Items are classified according to a given scheme and the results of an arithmetical count of the number of items or tokens occurring in the text belonging to each classification or type are displayed. The operation of counting word frequency gives a list of all the words occurring in a corpus together with their frequency and this can be expressed both in raw form and as a percentage of the total number of words. The word list can be displayed in different ways, in alphabetical order or in descending order of frequency. WordList allows users to view lists of the words in a text in alphabetical order or according to their frequency. There are three lists: the first one ‘new wordlist F’ is a frequency list; the second one ‘new wordlist A’ is an alphabetical list and the third one ‘new wordlist S’ present statistics such as type/token ratio and mean or average sentence length.
The most common types are ‘function’ or ‘grammatical’ words, i.e. conjunctions, determiners or personal pronouns. The second column presents the type’s absolute frequency in the corpus and the third one its relative frequency. This list indicates statistics about the text under investigation such as the number of running words (tokens), the number of different words (types) and the ratio calculating the range of vocabulary in the text.
For instance, the following table shows the first twenty types in Woolf’s The Waves (1931), according to their frequency:
The most common types in this novel, and indeed in most pieces of writing, are ‘function’ or ‘grammatical’ words, i.e. conjunctions, determiners or personal pronouns. The second column presents the type’s absolute frequency in the corpus, and the third one its relative frequency. For instance, there are 2,452 instances of the type I and it accounts for 3.14% of all the tokens in the corpus.
The following table shows the ‘WordList S’ for The Waves:
This list indicates statistics about the text under investigation such as the number of running words (tokens), the number of different words (types) and the ratio calculating the range of vocabulary in the text.
The concordance program allows the user to search the corpus for a selected item or ‘node’. Hence, if the user is interested in the usage of a particular word, he or she can search for it with the concordance program, which will retrieve all the occurrences of that word from the corpus and return a list of all those instances. At this stage, the computer returns a KWIC concordance, KWIC being an acronym for Key Word In Context. This is a list of all the occurrences of a specified keyword in the corpus, with each instance of the word set in the middle of one line of each context.
An example of the word wave in the corpus is shown below:
KWIC concordances can be sorted in a variety of ways, for example, alphabetically according to the words appearing on the left or right of the keyword. Concord also allows the user to view extended co-text online and to look at whole sentences or paragraphs in which the word has been located. As with type/token ratios, one has to be aware of homographs and lemmas. One way to overcome these difficulties is to use a wildcard. Wildcards are characters that can be used instead of others. For instance, the Kleene star * or ‘?’ can be used to replace any number of characters in a search term, e.g. ‘wav*’ to obtain wave, waves, waved, waving, waver, wavers, wavered and wavering. Wildcards can also be combined with for instance ‘nationali?*’, which will give: nationalise, nationalize, nationalizing, nationalising, nationalized, nationalised, nationalizes, nationalises, etc. Finally, they can also be used in ‘phrase’ searches like ‘on * wave’. WordSmith Tools thus allow identifying the word-forms under investigation, which can then be used as search terms in the Multilingual Concordancer Multiconcord, which is discussed in the next section.