Introduction
The period of new grant calls is approaching, and agencies reflecting the principles of open and data-intensive science are changing their requirements for completing grant applications. If you have encountered a requirement to make your research data available and develop a data management plan (DMP – a standard requirement of the Horizon Europe program and the GAČR and TAČR agencies), you may be asking yourself some of the following questions:
- What data do they want from me if I don't measure or calculate anything?
- What do they mean by data? Does all data have to be quantifiable? How does data relate to understanding and interpretation?
- What is the nature of data in the humanities? Is it different from the data processed in the natural sciences or computer science?
In the following article, I will attempt to briefly answer these questions and contribute to the understanding of the requirements for making research data accessible. First, I will focus on the concept of data, the characteristics of data produced by humanities research, and finally, I will reflect on the need for and benefits of data reflection and data science from the perspective of the humanities. In this article, I will not deal with the preparation of a data management plan; those interested in this issue can find information on the website of the Digitalia MUNI ARTS research infrastructure (1).
Data
Data (from Latin "given" or "self-evident") is a translation of the Greek term dedomena from Euclid's book of the same name. According to Euclid, data are given properties or quantities from which new properties can be deduced. They are axioms that precede the knowledge and interpretation of the subject. Ontologically, they can be conceived as differences in the structure of reality which, like Kant's noumena, are not accessible to direct experience but are derived indirectly from experience. This interpretation forms the basis of the definition of data by the contemporary Oxford philosopher of information Luciano Floridi, who conceives of data as "missing uniformity" (2010, p. 23). However, Floridi's philosophical definition is too general; it does not specify the nature of the difference and its relationship to evidence, and thus "too many things can be considered data" (Lyon, 2016, p. 743).
Although the concept of data has become a common part of scientific discourse in recent years, it has not been the subject of many studies. With growing interest in a clear definition of data, it is becoming apparent that this is a rather complicated concept. It is therefore not surprising that in everyday scientific discourse it is burdened with a number of confusions and myths, including the confusion of data with facts or documents, the idea that data are natural rather than cultural in nature (i.e., that they exist objectively, independently of the human subject), and that data can exist in an unprocessed (so-called raw) or unstructured form, that it has truth value in itself and is independent of knowledge and theories (2). Professional sources often mention that the concept of data is not appropriately chosen, and that the term capta would be more accurate, emphasizing that the process of capturing data is always active, selective, and theory-laden (Kitchin, 2014; Drucker, 2011). Data and their structured descriptions—metadata—are relational in nature. What is metadata for one researcher may be data for another researcher with a different research question; what is noise for one researcher may be essential evidence for another (compare, for example, printing errors in digitized 16th-century books, which complicate language analysis but are useful in researching the development of printing technology) (Borgman, 2015, Flanders – Muñoz, 2012).
Since the aim of our contribution is not to analyze the concept of data in detail, I will quote Max Kaase's definition, which successfully avoids conceptual difficulties: "Data are information about the properties of units of analysis" (Kaase, 2001, p. 3251). Of course, the unit of analysis does not have to be only a measured numerical value. The units can also be words in a text, statements in a discourse, themes in literary works, visual elements in images, or buildings from a historical period. The properties are then, in the corresponding order, for example, word collocations, the prosody of statements, motifs in works, contrasts in images, or the functions of buildings in the period under study. Scientists systematically collect this data because it represents the phenomenon under investigation and its interpretation allows for a better understanding of the phenomenon. Since the units of analysis evoke measurement and counting, it may be more familiar for humanities scholars to conceive of data as a representation of the subject of investigation. This is how Lev Manovich, a prominent new media theorist, conceives of data, adding that it can “include numbers, categories, digitized texts, images, sounds, and other types of media, records of human activities, spatial locations, and connections between elements (i.e., network relationships)” (2019, p. 61). When creating a representation, according to him, we must always decide on the boundaries of the phenomenon under investigation (what to include, what to exclude, and why), on the objects we will represent (i.e., the unit of analysis that is appropriate for the purpose of our investigation), and on its salient features (i.e., the properties we want to observe for our purpose). Our representations can be further modified so that they can be processed by a computer. Computers allow not only numerical processing of measured values, but also symbolic manipulations, which can also be useful in humanities research.
Representation in the form of data objects will not only serve their creators, but can also facilitate research for other scientists. Since data processing and collection is often funded by grant agencies, i.e. taxpayers' money, it is in the public interest for the data to be accessible. The data made available can be used by other researchers, thereby increasing the efficiency of the funds spent. Where appropriate and where the data is not protected or sensitive, it is also accessible to the general public.
Data in the humanities
Compared to data from other fields of science, data from the humanities have a number of specific characteristics:
Data heterogeneity – just as there is a high degree of diversity among cultural objects, there is also a high degree of diversity among data in the humanities. The humanities work with data that can be textual (literary works, manuscripts), visual (paintings, photographs, maps), audio (recordings of music, spoken word), audiovisual (films, videos), spatial, or material (artifacts, architectural monuments, human remains). In addition, individual types of data have a number of formats in which they can be stored.
Multimodality of data – data on the phenomenon under study can be captured in different ways and affect different senses, often synchronously, e.g., audiovisual documents. Multimodal data may require interpretation by experts from different fields and in interdisciplinary collaboration. Openly shared data can be used in ways far beyond the imagination of the scientist who originally collected it.
Complexity of data – humanities data carry multiple layers of meaning, can be examined from different perspectives, and their interpretation may vary in different contexts, e.g., in different cultures or historical periods. For example, a manuscript of a biblical text is not only a theological source in modern times, but also an object of philological, codicological, paleographic, art historical, or digital analysis. The number of layers in which data can be examined is growing. The complexity of data is also increased by its heterogeneity and multimodality.
Data ownership – research data in other scientific fields is generated directly by the researchers themselves, who therefore become the owners of the data. In the humanities, direct data generation is rarer. Experts usually work with representations of cultural objects that belong to institutions responsible for their management and preservation, including digitization. Researchers generate research data by interpreting these representations – categorizing, tagging, annotating, etc. This situation may limit the use of digital methods.
Trevor Owens (2012) identifies four approaches to data in the humanities: we treat data as artifacts, where the researcher decides what objects to collect and how to represent them; we treat data as texts that are subject to interpretation just like their textual sources, as information that can be processed quantitatively by a computer, and as a multipurpose object whose potential value is revealed when used as evidence in argumentation. The most common data objects that humanities scholars work with are tagged and annotated texts, digital scholarly editions including critical editions, text corpora, digital objects supplemented with analysis or notes, and auxiliary resources such as digital bibliographies (Flanders – Muñoz, 2012).
Data from the perspective of the humanities
Many researchers in the humanities may consider the newly proposed requirements for data disclosure to be an intrusion of a foreign element into research practice in the humanities. Creating representations and organizing recorded evidence for the purpose of argumentation is, of course, nothing new for researchers; some researchers may simply not have been accustomed to thinking of these practices as working with data. The problematic status of data is caused, on the one hand, by the requirement to formalize the collection, storage, classification, and management of recorded evidence in the form of data and, on the other hand, by the requirement for transparent data sharing with other researchers and the general public. Those interested in the issue will find a number of existing surveys and discussions (see, for example, Ruediger – MacDougall, 2023; Borgman, 2015; Anderson – Blanke, 2012; Huvila, 2012). The need for expertise from humanities scholars and researchers in the field of data studies and data research deserves attention. The contribution of the humanities can be seen in three areas. Data can be a subject of study in itself for humanities researchers, it can be studied as a cultural object, or as a form of culture open to interpretation and criticism.
Data as a subject of research in the humanities is examined conceptually; at the same time, its role in our knowledge is discussed. Conceptual and philosophical analyses are conducted in epistemology, science and technology studies (STS), the philosophy of information, and the newly emerging philosophy of data (Furner, 2017). Concepts associated with data, such as phenomena, evidence, facts, representation, and argumentation, are well known, systematically analyzed, and critically reflected upon in the humanities.
Data as a cultural object is examined in terms of its role in shaping cultural phenomena in ethical and social dimensions. On an ethical level, this includes, for example, the phenomenon of data philanthropy, the ethics of algorithms, and the ethics of data practices; on a social level, it includes the phenomenon of data culture and data slavery. At the same time, data is also becoming the subject of humanities research, for example in the digital humanities.
Data does not speak for itself, nor is it neutral. As a product of human knowledge, it arises in a cultural and social context, implicitly carries prejudices and expectations, and is theoretically biased. This is where a fundamental task opens up for experts with a background in the humanities, whose involvement enables us to go beyond purely technical and quantitative perspectives on working with data through data phenomenology and data criticism. Data interpretation always takes place in a certain context, within the horizon of the subject's understanding. A subjective element is an integral part of it. Data phenomenology helps to identify and describe how different forms of data influence the understanding and interpretation of data (as in the case of data visualization), how data is perceived and associated with ethical values in different domains and communities, and how it is used in an effort to gain greater influence or power. It is also important to draw attention to various simplifications or limitations of automated decision-making (e.g., in the automatic categorization of data). Just as important as data phenomenology is data criticism (Beaton, 2016). This allows us to place the data being made available and the related technical procedures in a historical and cultural context, analyze the motives and effects of making them available, reveal fashion trends, cultural traditions, and forms of knowledge production, as well as prejudices hidden in the organization, classification, and structuring of data. At the same time, it provides a framework for interpreting forms of data presentation in a cultural context or in a comparative perspective with other data sets, and allows for the identification of undesirable social consequences of their publication, such as colonial or racist discourses in language models trained on inappropriate data.
Summary
Data does not have to be exclusively numerical; it can also take the form of various representations of the phenomenon under investigation, such as categorizations or models. Their digital transformation (digitization) converts these representations into digital code, which enables computational operations not only with numbers but also with symbols, thus opening up space for the use of digital research methods. Humanities data are highly heterogeneous, multimodal, and complex, and much of this data is not held by researchers, as is the case in other fields, but belongs to memory institutions, which makes data and digital research difficult. Working with data requires humanities expertise, which, in addition to technical and statistical skills, builds interpretive skills that use phenomenology and criticism to take into account the context, meaning, and social impact of data.
Author's notes
(1): A set of decisions that need to be made before creating a dataset or database can be found on the page Seven Initial Considerations for Platform Creators and Five Steps to the Goal. M. Růžička's presentation for humanities researchers at the Faculty of Arts, Masaryk University, will help you create a data management plan. At the bottom of the page, you will also find a form requesting institutional support if you need expert assistance or consultation during the preparation process.
(2): Those interested in this topic are referred to the collection of works by Gitelman (2013) and the article on data (Hjørland, 2018).
Literature
Anderson, S.; Blanke, T. (2012). Taking the Long View: From e-Science Humanities to Humanities Digital Ecosystems. Historical Social Research, 37(3), 147–164. https://doi.org/10.12759/hsr.37.2012.3.147-164
Beaton, B. (2016). How to Respond to Data Science: Early Data Criticism by Lionel Trilling. Information & Culture, 51(3): 352–372. https://doi.org/10.7560/IC51303
Borgman, Ch. L. (2015). Big Data, Little Data, No Data: Scholarship in the Networked World. Cambridge, MA: MIT Press.
Clough, P. D.; Hill, T.; Paramita, M.; Goodale, P. (2017). Europeana: What Users Search For and Why. In: Tsakonas, G.; Kalliopi, S.; Inge, C. (Eds.). Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Cham: Springer, 207–219. https://doi.org/10.1007/978-3-319-67008-9_17
Drucker, J. (2011). Humanities Approaches to Graphical Display. Digital Humanities Quarterly, 5(1).
Flanders, J.; Trevor, M. (2012). An Introduction to Humanities Data Curation. DH Curation Guide: a community resource guide to data curation in the digital humanities. Accessed July 2, 2024. https://archive.mith.umd.edu/dhcuration-guide/guide.dhcuration.org/index.html%3Fp=91.html
Floridi, L. (2010). Information: A Very Short Introduction. New York: Oxford Univerity Press.
Furner, J. (2017). Philosophy of Data: Why? Education for Information, 33(1): 55–70. https://doi.org/10.3233/EFI-170986
Gitelman, L. (Ed.) (2013). „Raw Data“ Is an Oxymoron. Cambridge: MIT Press.
Hjørland, B. (2018). Data (with big data and database semantics). Knowledge Organization, 45(8): 685-708.
Huvila, I. (2012). Information Services and Digital Literacy: In Search of the Boundaries of Knowing. Oxford: Chandos Publishing.
Kaase, M. (2001). Databases, Core: Political Science and Political Behavior. In Smelser, Neil J and Paul B. Baltes (Eds). International Encyclopedia of the Social and Behavioral Sciences. Amsterdam: Elsevier.
Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. London: SAGE.
Lyon, A. (2016). Data. In Humphreys, Paul (Ed.). The Oxford Handbook of Philosophy of Science. New York: Oxford University Press.
Manovich, L. (2019). Data: Representing Phenomena as Data. In Heike, Paul (Ed.). Critical Terms in Futures Studies. Palgrave.
Owens, T. (2012). Defining Data for Humanists: Text, Artifact, Information or Evidence? Journal of Digital Humanities, 1(1).
Posner, M. (2015). Humanities Data: A Necessary Contradiction – Miriam Posner’s Blog,” June 25, 2015. Accessed July 2, 2024. http://miriamposner.com/blog/humanities-data-a-necessarycontradiction/.
Ruediger, D.; MacDougall, R. (2023). Are the Humanities Ready for Data Sharing? Ithaka S+R. Accessed October 29, 2024. http://www.jstor.org/stable/resrep49500.
Schöch, Ch. (2013). Big? Smart? Clean? Messy? Data in the Humanities. Journal of Digital Humanities, 2(3).