Enhancing a spellcheck dictionary by Wikidata lexemes
2021-09-24, 12:30–13:00, Room 2

The talk summarizes recent changes in development of Czech spellcheck dictionaries, with a focus on employing lexemes from Wikidata as a source of words for a new experimental dictionary. Apart from that, an update of the default dictionary and using a new public data set of words will be mentioned.

The world of free Czech spellchecking dictionaries has made a progress during the last years. The most used Hunspell dictionary, shipped with LibreOffice and many other FOSS applications, underwent a significant update few months ago. Moreover, a new experimental dictionary has been created as a combination of a data set released by a university and lexicographical data available at Wikidata. The individual data sources will be described, including a summary of their advantages and weaknesses. The main attention will be paid to the Wikidata lexemes project, a database of words and their properties for any language. As a universal platform, it might be used for creation of different language tools and it provides a convenient user interface, making adding and editing the data reasonably simple for any volunteer. Therefore, Wikidata can be considered to be a great place where the data of spellcheck dictionaries can be stored and maintained.

See also: slides of the presentation

Stanislav Horáček is a programmer by profession and long-term LibreOffice contributor, member of TDF since 2014. He deals mainly with LO localization to the Czech language, recently also with maintenance of spellcheck dictionaries, and also enjoys promotion of the office suite at events in the Czech Republic.