Creation of corpora and corpus linguistics
A corpus is generally defined as a large, structured and comprehensive collection of texts of a given language, professionally processed and stored in electronic form. Corpus managers are used to work with these corpora. In a properly compiled corpus, it is possible to easily search for terms and track linguistic phenomena, especially words and phrases (collocations), including the frequency of different phenomena and their usage. In a corpus, individual expressions or phenomena can be examined in their natural context, thus enabling data-driven linguistic research on a scale that would not be possible without digital technologies.
Language corpora are very useful both in the study of language itself and in the content analysis of literary or other works (song lyrics and songs) or in translation, where so-called parallel corpora are a very useful basis.
Tools for corpora management
- Sketch Engine Sketch Engine (SkE) is a software that retrieves word profiles (word sketches), groups them based on grammatical relations and creates thesauri from the corpus.
- KonText The KonText interface is a web application used to access and work with CNK corpora.
- TEITOK Online platform for working with corpora, serves as an alternative to KonText.
- NameTag NameTag is an open-source name entity recognition (NER) tool. NameTag identifies proper names in text and classifies them into predefined categories such as names of people, places, organizations, etc.
- MorphoDiTa Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or library along with trained linguistic models.
- Voyant Tools Voyant Tools is a web-based tool for viewing and analyzing digital texts.