Introduction
Collocations are word pairs and phrases in some defined relationship that occur together more frequently than by chance (as opposed to free word combinations) and do not have general syntactic or semantic rules. Due to their extensive use, choosing the correct collocation is necessary to achieve language fluency. They differ from idioms since individual words in the collocation can contribute to the semantics of the n-gram. In many collocations, one word is used figuratively while the rest of the words in the n-gram are used in their normal sense (Dirk, Kerry et al. 1994).
Collocations can be either rigid or flexible. Rigid collocations are those n-grams that always occur adjacent to one another and appear in the same order. Flexible collocations are n-grams that can have intervening words placed between them, or can occur in either order, and may even allow some inflected forms.
Collocations can be used to resolve lexical ambiguities. As has been shown by Yarowsky (1993), an ambiguous word within a specific collocation has just one sense with a very high degree of probability. Ambiguous nouns disambiguated with adjacent adjectives and nouns with 99% precision. Adjectives disambiguated with 98% precision while verbs disambiguated with 95% precision when collocated with an object and 90% precision when collocated with a subject.
Collocations are common, conventional, expressions whose choice among similar words can be considered arbitrary. The fact that in English we speak of a Heavy Smoker as opposed to a Fat Smoker is an arbitrary choice among similar words. However, choosing the correct one is critical in being able to communicate fluently in the language. Choosing a synonym for a word in a collocation will most likely result in an awkward construction (McKeown and Radev 1999).
It should be noted that collocations are dialect specific (for example Set the table in American English and Lay the table in British English) as well as language specific (for example Heavy Smoker translates into French as Grand Fumeur as opposed to Gros Fumeur even though both Grand and Gross are valid translations of the English word Heavy). Collocations also reoccur in similar contexts. For example, the collocation White House tends to occur in the context of US national politics. Lastly, collocations are “interesting” n-grams, which will exclude the most common n-grams, such as it is and in the.
Translation Strategies
In translating from one language to another when several translation equivalents are available, there is an issue of selection (Bian and Chen 2000). There are several strategies that can be used to help in this selection.
1) Select Highest Frequency – Choose the sense in the target language that has the highest frequency. This will provide a translation only as good as the target language corpus and is quite problematic.
2) Select All – Use all of the equivalents when translating and concatenate the results. This is somewhat useful when translating IR queries.
3) Word Co-occurrence – Use the content surrounding the translation equivalents to determine the best selection. Mutual Information is the most common method used to find the strongest relationship:
where X and Y are the two terms; P(X) and P(Y) are the probabilities of X and Y; P(X,Y) is the probability of the co-occurrence of X and Y.
Smadja, et al, (1996) show that the Dice coefficient may be a better choice for selecting correct collocation translations. In their study, there were no instances where the mutual information (referred to as the specific mutual information SI) produced a correct translation with the Dice coefficient produced an incorrect one. In addition, the Dice coefficient correctly translated more than half of the erroneous results given by the SI.
Lexical Functions
The concept proposed Igor Mel’èuk (as described in D. Heylen and K. Maxwell 1996) is that there is a semantic relationship in collocations between the base and the collocate, typically nouns as bases and verbs and adjectives as the collocates. While there are various other words that are synonymous for the collocate, those synonyms are not used with a given base. However the synonym may be the appropriate collocate for another base. For example, while heavy and weighty are synonyms, and we would speak of a heavy smoker and a weighty matter, we would not use the opposite term.
What the two terms heavy and weighty do have in common is that they magnify or intensify the base term they are collocated with. This then brings us to the idea of a lexical function that can be incorporated into a dictionary. For example we could have a magnify function, Magn:
magn(smoker) = heavy
magn(matter) = weighty
This can then be extended to translations, by performing a direct translation on the base and then use the lexical function to determine the collocation. Translating heavy smoker into French would then produce magn(fumeur) = grand giving us grand fumeur or into German magn(Raucher) = starker would give us starker Raucher.
There are several issues with relying upon lexical functions. While there are over 50 lexical function defined (Degrad would make things worse or bad, while Bon would be used to show approval), there is a question of comprehensiveness of the functions and issues of overgenerality and syntactic divergence.
Aligned Corpora
The use of aligned corpora is one approach that addresses the shortcomings of lexical functions. Much like the Rosetta stone was used as a foundation for creating the first hieroglyphics dictionary, aligned corpora can be used to find appropriate translations for translating collocations in one language into collocations (or individual words if appropriate) in a second language. Many concepts have translated collocation that do not have any individual words that would have be directly translated from one language into the other. For example, the English phrase demonstrate support corresponds to the French phrase prouver son adhesion (Smadja, McKeown et al. 1996).
Starting with a large, aligned corpus in two languages, all of the interesting collocations are extracted from one language. By using chaining within a fixed window, and then choosing the translated chain with the highest Dice coefficient, a correct collocation translation can be determined.
Monolingual Corpora
Unfortunately, there are very few languages where a large aligned corpora exists. Other approaches need to be used when all that exists are limited bilingual dictionaries and monolingual corpora. One approach looks at the head, a dependant, and a dependency relationship to produce a dependency triple (Lü and Zhou 2004) represented as (w1, r, w2) where w1 and w2 are words and r is the dependency relation. The three primary types utilized for machine translation are verb-object, noun- adjective, and verb-adverb. A best candidate can then be determined using Bayes’s Theorem.
Using the dependency triple, a triple translation approach would assure a correct translation of the collocation using the following steps:
Extract collocation in the source language, giving Scol
Acquire best dependency triple in the destination language, giving Ttri
Retranslate translated dependency triple back to the source language, giving Stri
If Scol = Stri then we can assume that Ttri is a good translation for Scol
Conclusion
While many of the papers referred to the methods employed in prior papers, none of them suggested combining these various methods. It may be interesting to see the results of translating collocations if one used lexical functions in a first pass, and then processed the remaining text using dependency triples.
Bian, G.-W. and H.-H. Chen (2000). "Cross-language information access to multilingual collections on the internet." J. Am. Soc. Inf. Sci. 51(3): 281-296.
Dirk, H., G. M. Kerry, et al. (1994). Lexical functions and machine translation. Proceedings of the 15th conference on Computational linguistics - Volume 2. Kyoto, Japan, Association for Computational Linguistics.
Lü, Y. and M. Zhou (2004). Collocation translation acquisition using monolingual corpora. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Barcelona, Spain, Association for Computational Linguistics.
McKeown, K. R. and D. R. Radev (1999). Collocations. A Handbook of Natural Language Processing. R. Dale, H. Moisl and H. Somers. New York, Marcel Dekker.
Smadja, F., K. R. McKeown, et al. (1996). "Translating collocations for bilingual lexicons: a statistical approach." Comput. Linguist. 22(1): 1-38.
Yarowsky, D. (1993). One sense per collocation. Proceedings of the workshop on Human Language Technology. Princeton, New Jersey, Association for Computational Linguistics.






Comments
Write New Comment ▼
Write New Comment
Sorry! This knol's owner(s) have blocked you from editing, making suggestions, or commenting here.