As you can see from the examples, there are countless forms of text analytics, and thousands of ways to use it in the process of searching, content analysis, and the resulting data archiving.
Where to find the texts?
Text in the context of text analysis may not only be a classic book, but also text extracted from websites, social media (such as a set of tweets, Facebook statuses or Instagram photo captions), email communications or song lyrics. Thanks to projects like Google Books or Project Gutenberg, you can work with a plethora of texts in digital form.
Corpora, electronic text files in which you can search for words or phrases and find out what context they are in. The most well-known types are corpora of spoken and written language, where linguists study how a particular language is used and evolves (for example, the Czech National Corpus). But there are also corpora of literature from specific periods or language areas, corpora of law, corpora of music lyrics from different genres, from TV series, soap operas, films or magazines.
And where to find text corpora? For example, with the KonText platform. On this website you can find text corpora, in a wide range of topics and languages. For example, there is a corpus of air traffic controllers' communications, Facebook data directly designed for sentiment analysis, but also web corpora. You can also find datasets for working with text directly in the so-called Virtual Language Observatory, located on the Clarin project website. Here we can find thousands of texts or text corpora. For example, there is a news corpus in Czech alone, a Czech-Slovak parallel corpus, or a corpus from parliamentary meetings.
And how do we work with textual analysis?
The method of computer text analysis and working with text corpora has a long history at Masaryk University, where you can be inspired by a number of interesting projects. Here you will find several projects in different areas of text handling.
Natural language processing
A major research area is natural language processing, which is dealt with by the Centre for Natural Language Processing at MU. It focuses mainly on corpus linguistics, lexical databases and the use of machine learning methods for automatic text processing. In addition to research in the area of machine-human communication, NLP is involved in web-based technologies for text analysis, mining and generation, and is also developing a range of tools. These include the Sketch Engine, a tool for working with corpora and text analysis, but also dictionaries, text prediction tools, random text generation, and adding commas in sentences, for example. Other applications and tools that NLP is involved in allow to detect themes in the text, or to recognize whether the text was created by a human or a computer.
What you can find out from the corpus of French rap...
One great example is RapCor, or the corpus of French rap, which was created at the Institute of Romance Languages and Literatures at the Faculty of Arts of Masaryk University. The corpus, which has been gradually updated since 2009, contains more than 4500 songs in French.