The future: reflections on emerging machine-learning methods for digital heritage
Article DOI: https://dx.doi.org/10.15180/221818
Keywords
Artificial Intelligence, Deep Learning, few-shot learning, fine-tuning, large language models, Machine Learning, observability
Introduction
Over the past decade we have seen a dramatic increase in the capabilities of machine learning (ML), leading to a real and growing interest in the application and potential for projects such as Congruence Engine.
Here I have identified the trends that are most likely to impact projects such as ours in the future. Most of these gains in machine learning techniques have been related to a sub-area of machine learning – deep learning (DL). The techniques of DL have been around since the 1990s, but their application really started in earnest with the AlexNet image classification system in 2012. We are now at the ‘Cambrian Explosion’ stage of (largely DL-enabled) AI – more than 334,000 research papers were published in 2021 alone.[1] Unless you are immersed in the day-to-day of machine-learning research this is undoubtedly overwhelming, with new applications for AI (artificial intelligence) appearing in the news every day. There are, however, broadly two ways in which we can consider how and where artificial intelligence (read: deep learning) might be applied within cultural heritage projects.
Firstly, there are already several tasks which even pre-deep learning ML has been useful in addressing: clustering and classifying images and records into specific categories; extracting text from images of documents and newspapers or making transcripts from video and audio; parsing text to identify ‘parts of text’ such as names of people, places and businesses, and so on. Given the steady rate of improvement against standard benchmarks for these kinds of tasks, there are several categories of research or historical investigation that might therefore be worth revisiting, either because the quality of the computer-readable data at the time of build was insufficient to provide reasonably reliable answers, or perhaps just to dial up the accuracy and/or re-confirm previous conclusions. In either case (to introduce a metaphor), ML improvements provide us with a more focused historical lens through which to revisit prior research.
The second category is the novel application of new or emergent techniques. For those less technically inclined, there are a handful of strategies for navigating current research and exploring ‘the art of the possible’ from a heritage perspective. A good place to start is with those publications that cover technical advances in ML but which are aimed at a more general readership. Good examples of these would be journalist Jack Clark’s weekly ‘Import AI’ newsletter[2], which covers new techniques and advances with additional explanation and commentary, or the variously authored ‘Last Week in AI’ newsletter, available via Substack.[3]
Alternatively, a quick scan of the front page of ‘Papers with Code’[4] – which aggregates recent research papers and highlights those which are currently drawing a lot of interest – or the well-organised and explained ‘tasks’ section of Hugging Face’s AI community hub[5], are often useful to spark ideas or start conversations. Even for someone who maintains an interest in the machine-learning world, I find that paper titles frequently refer to unfamiliar or emergent techniques. However, this unfamiliarity is often not an impediment to understanding what the authors are trying to achieve, and the results of their application often speak for themselves.
Emergent trends
https://dx.doi.org/10.15180/221818/001Reading around what is currently happening is also a good way of identifying key trends, and areas where a lot of progress is being made. There are a few areas of growing research which it might be interesting to consider in the context of Congruence Engine. In no particular order these are: cross-modal AI, fine-tuning and few-shot learning, large language models, and observability. (There is also a fifth category which I’ll come to later in this article.)
Cross-modal AI refers to machine-learning models which are trained to understand two different kinds of media or data – such as text, audio, images – independently from one another, but that can additionally build a relationship between the two. An example of this would be image captioning: here the ML algorithm needs to both construct proper sentences and recognise what is happening in the image, then link these two concepts together to produce a concise and accurate description. Another example – from a Congruence Engine perspective – might be an algorithm which produces a succinct podcast-style summary of an archived audio interview.
The concept of fine-tuning an existing ML model towards a more specific use case (for example, re-training an image classification system to identify bees versus wasps, for use with a hive webcam) has been around for a while but has become increasingly relevant in recent years as companies like Google and Microsoft release ML systems which have been trained on huge datasets using an incredible amount of computing power. Few-shot learning relates to fine-tuning, in that it’s about finding ways to retrain an existing algorithm using as little new data as possible, often just a handful of new samples. This might be interesting as an angle to pursue from a digital humanities perspective, as a means of unlocking the value of these mainstream, pre-trained algorithms for use on historical media such as photographs, interviews and document scans.
There are several examples of where this might be useful for us, but a specific one would be the application of Deep Layout Parsing.[6] This technique for determining the elements of a scanned document (such as adverts, section headings, tabular and columnar data, or paragraphs) and assessing how they relate to one another on the page could be applied to extracting both the structure and content from pages of (traditionally hard to parse) trade directories[7] by fine-tuning an existing layout model.[8]
In recent years, more and more attention has been paid to techniques which tackle observability challenges for machine-learning, that is interrogating a ‘black box’ system to reveal some of its internal reasoning when carrying out a specific task. There are clear commercial and ethical reasons for doing this – either for auditing purposes, or (often) because a system is displaying unforeseen biases which need to be better understood (and potentially eliminated). These are live problems, and just as relevant from our perspective as they are for commercial applications of AI. But there are other reasons why observability can be useful or interesting from a research perspective. In particular, some researchers have used these techniques to bring patterns and dynamics in the data to our attention, which we might otherwise not have discovered.[9]
Large Language Models (LLMs) are pre-trained deep learning algorithms which are computationally large (with the number of parameters ‘learnt’ through training typically running into the billions) and trained on vast quantities (read: thousands of gigabytes) of data gathered from across the internet. These models are an example of what is frequently (and somewhat dramatically) referred to as ‘the bitter lesson’ in AI.[10] This is quite simply that ‘size matters’, with simple but supermassive models and investment in training consistently outperforming all other approaches. As you might expect, given the amount of resource involved, the development of LLMs has largely taken place behind closed doors at places like Google and OpenAI but there are a growing number of open-source models – such as those from research start-up Hugging Face – which are more widely accessible and usable by researchers. The kinds of tasks which LLMs are useful for opens a range of creative and research possibilities which will be interesting to explore. We might, for example, apply text summarisation and automated question answering towards a more deconstructed exploration of Humphrey Jennings’s Pandaemonium (see Jennings, 1985; Robins and Webster, 1987), uncovering new themes and connections in its rich and varied (often labyrinthine) anthologisation of the Industrial Revolution.
With research continuing apace, some of the most exciting and creative recent developments have incorporated several of these themes. For example, Deepmind’s collaboration with classics researchers at Oxford University, the University of Venice, and the Athens University of Economics and Business on Ithaca, a deep neural network which is designed to restore damaged text in images of ancient Greek inscriptions, date them, and attribute their likely original location.[11]
Also relevant here is Deepmind’s recent work on the general purpose Flamingo Visual Language model[12], which combines large pre-trained systems with a series of connective layers. Taken together, Flamingo has achieved state of the art in few-shot learning towards solving visual challenges which combine object identification, knowledge retrieval, and question answering.
The final category of recent development in AI which may turn out to be of interest is ‘creative’ AI. Machine-learning models which produce some kind of creative output – whether this is in the form of images, audio or written text – are receiving a great deal of attention currently. These include applications such as Stable Diffusion[13], which generates sophisticated imagery based on a given prompt, or AI21 Labs’ attempt to ‘re-create the mind of [late US Supreme Court Justice] Ruth Bader Ginsburg’ as a ML-powered chatbot, with somewhat predictably eccentric (and not entirely well received) results.[14] Whilst these sorts of applications fall more under the ‘interesting’ rather than ‘useful’ category, creative AI is nevertheless worthy of inclusion in the wider discussion around ML in our context, if only because it can provide an intriguing mirror or lens through which to view our research. One could easily imagine the use of ML to go beyond exploring Pandaemonium (as in my previous example), towards creatively re-interpreting the life and works of Humphrey Jennings.
Conclusion
https://dx.doi.org/10.15180/221818/002Deep Learning is a fast-moving field. In this article I have identified some of the key emergent areas which I believe are of greatest relevance for projects such as Congruence Engine, as well as some hopefully useful ‘coping strategies’ for navigating the promise and potential of machine learning as a non-practitioner.
Machine learning – and deep learning in particular – is already transforming many industries, and its commercial application is both widespread and well-documented. For digital humanities practitioners, however, I believe that the benefits of this growing research area are still yet to be fully realised. A significant barrier to date has been the lack of sufficiently well-annotated data for historical research purposes, or the resources available to create new datasets at the scale required to train new systems from scratch. Of all the trends identified here, the industry-wide shift currently taking place – towards repurposing and fine-tuning large and sophisticated ‘general purpose’ machine learning models – offers the greatest promise for research applications such as ours.
Congruence Engine is supported by AHRC grant AH/W003244/1.