DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon

Date:

Proceedings available from here.

This paper introduces DUCKS, Data Unified Conceptual Knowledge Sets, as a tool for aligning lexical data across any number of languages. A starting point in producing a multilingual dictionary is to merge bilingual datasets through the overlapping words in a common pivot language. An essential problem in maintaining accuracy across languages is determining the matching senses of a polysemous pivot term, e.g. a term in Language-A meaning “spicy” might well be paired to a term in Language-C meaning “sweltering” because they are both connected to English “hot”. DUCKS addresses this problem through a game-like interface that invites experts and interested members of the public to participate in the sense disambiguation of linguistic datasets. DUCKS starts with the 100,000 concepts defined in the Princeton WordNet, for 200,000 English lemmas, and English will be expanded through a version of the game that matches senses from Wiktionary. In the basic case, we start with a dataset between Language-A and English. When a user selects a term in Language-A, we show all the contextual information about that item in a graphic block on the left of their screen, and all the senses of the designated English term on the right. The user slides the block to the definition that best matches the meaning in Language-A. If two or more English senses apply, duplicate bricks are available. The user may also select “no definition applies” when relevant. If English is absent from the dataset, the user must first type an equivalent term in English or another language that has already been aligned. We then fetch the possible senses, and play proceeds as above. A match is considered valid when a threshold number of players has made the same selection. DUCKS does not address semantic drift, which is resolved in other games the project has developed. In addition to integrating Language-A with all other languages in the system that share similar concepts for accurate multilingual exchange, concepts without English equivalents can be discovered that may be unique to that language. Data has been aligned among several dozen languages to date, beginning with languages with open data previously linked to WordNet. A large challenge now is that many existing datasets for less-resourced languages are closed data; it is hoped that DUCKS will inspire their proprietors toward joining the multilingual lexicon.