Text analysis

Text analysis using computational tools is one of the most basic and longest used methods in the Digital Humanities. It is a process in which the computer helps us to search, analyze, and visualize large amounts of textual data that would otherwise be difficult (and sometimes impossible) to process using traditional methods.

4 Jun 2021 Veronika Wölfelová

The method of computer analysis of text (albeit in a very simplified form) was first used by Robert Busa, whose Index Thomisticus is considered the first "swallow" and the birth of the Digital Humanities. Today, text analysis is a method that has its place in a wide range of disciplines such as literary studies, linguistics, natural language processing, but also in the social sciences and even marketing.

How the first Digital Humanities project was created

The story of the Italian Jesuit Roberto Busa has become, with some exaggeration, a legend in the history of the Digital Humanities. Roberto Busa, a Catholic priest and theologian who was interested in the work of Thomas Aquinas, sought a way to study Aquinas' writings better, and more efficiently, to be able to search them more quickly and to work with them as a single corpus of texts. In the 1940s, when Busa learned of early computer technologies that could help his research, he teamed up with IBM founder Thomas J. Watson, which began several years of work to lemmatise Aquinas' writings and convert the work into digital form. The result of more than 30 years of work is the so-called Index Thomisticus, now a fully digital, lemmatized corpus of Thomas Aquinas' works.

What is the purpose of text analysis? Or How to read thousands of books...

Imagine that as a scientist you are dealing with, for example, literary works from the Romantic period. If you follow the traditional methods, you would find a few representative works from this period, read them, and make an interpretation of each work. Or you're a political scientist who wants to explore how people on social media have reacted to a political event.

You might read a few comments and posts on Twitter or Facebook, and work up an analysis of them. These are classic examples of "close reading". You can read and process dozens or even hundreds of texts very well using this method. But what if there are thousands of such texts?

In 2000, Italian professor Franco Moretti came up with the concept of "Distant reading". He argued that it is not humanly possible to read, for example, all the literary works from the Romantic period or all the articles and papers describing a single event, and thus, he said, we are depriving ourselves of a large amount of data.

In order to examine a larger number of texts to find patterns or prevailing sentiment without reading them one by one, computer text analysis tools can help. You can process entire corpora of literary works, song lyrics, electronic communications, and social media posts. Text analysis reveals patterns that you would have missed by ordinary reading.

Close reading  

Distant reading

  • focuses on in-depth processing of smaller amounts of text
  • is more subjective
  • its undeniable advantage is interpretation
  • humans understand the meaning of words and text better than computers
  • focuses on large amounts of text that would be beyond human power to read and process
  • more objective, working with statistical methods

Different forms of textual analysis

What people most often associate with text analysis is counting the frequency of words and phrases. But text analysis offers much more. An interesting example is sentiment analysis. This method tries to estimate sentiment by analyzing the occurrence of emotionally colored expressions. If death, sadness, worry or pain appear in the text, the sentiment is evaluated as negative. Conversely, expressions such as joy, happiness, love or pleasure indicate a positive 'mood'. This method was used by the authors of the Hedonometer project. They demonstrated the alternation of sad and happy moments in the Harry Potter books. So, if you have noticed while reading the books or watching the movies that the story is much more "depressing" towards the end, sentiment analysis will prove you right:

A chart showing the evolution of emotions in the Harry Potter series of books

In the same way, you can see how specific events have affected the mood of Twitter posts, or the mood of reviews of specific films. One interesting example is trying to determine the "happiest day of the year" from people's Twitter posts.

An excerpt from the Hedonometer project, which measures the emotional charge of Twitter posts.

Sentiment analysis is also widely used in finance. Analysts try to predict people's behaviour in the financial market based on their mood.

Another type of work with text is called Topic modelling. The computer determines the theme or motif of a text based on words that occur together in clusters. And where can Topic modelling be used in humanities research? Researchers at UC Berkeley have been looking for a way to simplify searches in the archives of the Pennsylvania Gazette, for example, by topics such as slavery or war. But when they tried keyword searches, they too often came across texts that were unrelated to the topic. So they found words that appeared together in articles about, for example, runaway slaves or civil unrest. This allowed them to "model" what clusters of words usually make up a particular topic, and thus get more accurate results. Other scholars have used the topic modelling method to analyse thousands of works from the French Enlightenment and identify the prevalent themes in these works.

Example of working with topic modelling

An interesting example of the use of text analysis is the so-called literature mapping. By automatically recognising geolocations in a text, you can find out where a particular work is set, or whether certain literary movements and genres have favourite locations. Combined with sentiment analysis, this is how researchers at Stanford University created a map of London as described by authors of detective novels and fiction in their books. A similar project is a map of Portuguese literature, created by researchers and students at the University of Lisbon.

As you can see from the examples, there are countless forms of text analytics, and thousands of ways to use it in the process of searching, content analysis, and the resulting data archiving.

Where to find the texts?

Text in the context of text analysis may not only be a classic book, but also text extracted from websites, social media (such as a set of tweets, Facebook statuses or Instagram photo captions), email communications or song lyrics. Thanks to projects like Google Books or Project Gutenberg, you can work with a plethora of texts in digital form.

Corpora, electronic text files in which you can search for words or phrases and find out what context they are in. The most well-known types are corpora of spoken and written language, where linguists study how a particular language is used and evolves (for example, the Czech National Corpus). But there are also corpora of literature from specific periods or language areas, corpora of law, corpora of music lyrics from different genres, from TV series, soap operas, films or magazines.

And where to find text corpora? For example, with the KonText platform. On this website you can find text corpora, in a wide range of topics and languages. For example, there is a corpus of air traffic controllers' communications, Facebook data directly designed for sentiment analysis, but also web corpora. You can also find datasets for working with text directly in the so-called Virtual Language Observatory, located on the Clarin project website. Here we can find thousands of texts or text corpora. For example, there is a news corpus in Czech alone, a Czech-Slovak parallel corpus, or a corpus from parliamentary meetings.

And how do we work with textual analysis?

The method of computer text analysis and working with text corpora has a long history at Masaryk University, where you can be inspired by a number of interesting projects. Here you will find several projects in different areas of text handling.

Natural language processing

A major research area is natural language processing, which is dealt with by the Centre for Natural Language Processing at MU. It focuses mainly on corpus linguistics, lexical databases and the use of machine learning methods for automatic text processing. In addition to research in the area of machine-human communication, NLP is involved in web-based technologies for text analysis, mining and generation, and is also developing a range of tools. These include the Sketch Engine, a tool for working with corpora and text analysis, but also dictionaries, text prediction tools, random text generation, and adding commas in sentences, for example. Other applications and tools that NLP is involved in allow to detect themes in the text, or to recognize whether the text was created by a human or a computer.

What you can find out from the corpus of French rap...

One great example is RapCor, or the corpus of French rap, which was created at the Institute of Romance Languages and Literatures at the Faculty of Arts of Masaryk University. The corpus, which has been gradually updated since 2009, contains more than 4500 songs in French.

For example, you will find out how Arabic expressions (Arabisms) entered the French language, how often vulgarisms appear in these songs, which slang expressions have entered the spoken language, how the political situation is reflected in the songs, or how language and themes evolve in the careers of particular artists.

Long live the king, Long live the king or Long live the king? Kapradí - or how to translate an English drama

An interesting project, and a very valuable material not only for the study of drama, is the Kapradí project, which is an electronic database of approximately 400 dramatic texts (original English plays and Czech translations written before the middle of the 20th century). Thanks to the interface, the user can view both the original play and its selected translations in parallel, which perfectly illustrates the variety of approaches of different authors to one text.

Does Distant Reading mean the end of traditional methods of text study? And will it replace the work of literary scholars or linguists in the future? Certainly not. While computers can process vast amounts of text, count the occurrence of words, visualize the relationships between characters in literary works, or determine the sentiment of thousands of Twitter posts, it is ultimately scientists who interpret the results and put them into context. Either way, the Distant Reading method will make you look at texts from a whole new perspective.

Projects and research where text analysis is used
More resources on text analysis


More articles

All articles

You are running an old browser version. We recommend updating your browser to its latest version.

More info