Reading around what is currently happening is also a good way of identifying key trends, and areas where a lot of progress is being made. There are a few areas of growing research which it might be interesting to consider in the context of Congruence Engine. In no particular order these are: cross-modal AI, fine-tuning and few-shot learning, large language models, and observability. (There is also a fifth category which I’ll come to later in this article.)
Cross-modal AI refers to machine-learning models which are trained to understand two different kinds of media or data – such as text, audio, images – independently from one another, but that can additionally build a relationship between the two. An example of this would be image captioning: here the ML algorithm needs to both construct proper sentences and recognise what is happening in the image, then link these two concepts together to produce a concise and accurate description. Another example – from a Congruence Engine perspective – might be an algorithm which produces a succinct podcast-style summary of an archived audio interview.
The concept of fine-tuning an existing ML model towards a more specific use case (for example, re-training an image classification system to identify bees versus wasps, for use with a hive webcam) has been around for a while but has become increasingly relevant in recent years as companies like Google and Microsoft release ML systems which have been trained on huge datasets using an incredible amount of computing power. Few-shot learning relates to fine-tuning, in that it’s about finding ways to retrain an existing algorithm using as little new data as possible, often just a handful of new samples. This might be interesting as an angle to pursue from a digital humanities perspective, as a means of unlocking the value of these mainstream, pre-trained algorithms for use on historical media such as photographs, interviews and document scans.
There are several examples of where this might be useful for us, but a specific one would be the application of Deep Layout Parsing.[6] This technique for determining the elements of a scanned document (such as adverts, section headings, tabular and columnar data, or paragraphs) and assessing how they relate to one another on the page could be applied to extracting both the structure and content from pages of (traditionally hard to parse) trade directories[7] by fine-tuning an existing layout model.[8]
In recent years, more and more attention has been paid to techniques which tackle observability challenges for machine-learning, that is interrogating a ‘black box’ system to reveal some of its internal reasoning when carrying out a specific task. There are clear commercial and ethical reasons for doing this – either for auditing purposes, or (often) because a system is displaying unforeseen biases which need to be better understood (and potentially eliminated). These are live problems, and just as relevant from our perspective as they are for commercial applications of AI. But there are other reasons why observability can be useful or interesting from a research perspective. In particular, some researchers have used these techniques to bring patterns and dynamics in the data to our attention, which we might otherwise not have discovered.[9]
Large Language Models (LLMs) are pre-trained deep learning algorithms which are computationally large (with the number of parameters ‘learnt’ through training typically running into the billions) and trained on vast quantities (read: thousands of gigabytes) of data gathered from across the internet. These models are an example of what is frequently (and somewhat dramatically) referred to as ‘the bitter lesson’ in AI.[10] This is quite simply that ‘size matters’, with simple but supermassive models and investment in training consistently outperforming all other approaches. As you might expect, given the amount of resource involved, the development of LLMs has largely taken place behind closed doors at places like Google and OpenAI but there are a growing number of open-source models – such as those from research start-up Hugging Face – which are more widely accessible and usable by researchers. The kinds of tasks which LLMs are useful for opens a range of creative and research possibilities which will be interesting to explore. We might, for example, apply text summarisation and automated question answering towards a more deconstructed exploration of Humphrey Jennings’s Pandaemonium (see Jennings, 1985; Robins and Webster, 1987), uncovering new themes and connections in its rich and varied (often labyrinthine) anthologisation of the Industrial Revolution.
With research continuing apace, some of the most exciting and creative recent developments have incorporated several of these themes. For example, Deepmind’s collaboration with classics researchers at Oxford University, the University of Venice, and the Athens University of Economics and Business on Ithaca, a deep neural network which is designed to restore damaged text in images of ancient Greek inscriptions, date them, and attribute their likely original location.[11]
Also relevant here is Deepmind’s recent work on the general purpose Flamingo Visual Language model[12], which combines large pre-trained systems with a series of connective layers. Taken together, Flamingo has achieved state of the art in few-shot learning towards solving visual challenges which combine object identification, knowledge retrieval, and question answering.
The final category of recent development in AI which may turn out to be of interest is ‘creative’ AI. Machine-learning models which produce some kind of creative output – whether this is in the form of images, audio or written text – are receiving a great deal of attention currently. These include applications such as Stable Diffusion[13], which generates sophisticated imagery based on a given prompt, or AI21 Labs’ attempt to ‘re-create the mind of [late US Supreme Court Justice] Ruth Bader Ginsburg’ as a ML-powered chatbot, with somewhat predictably eccentric (and not entirely well received) results.[14] Whilst these sorts of applications fall more under the ‘interesting’ rather than ‘useful’ category, creative AI is nevertheless worthy of inclusion in the wider discussion around ML in our context, if only because it can provide an intriguing mirror or lens through which to view our research. One could easily imagine the use of ML to go beyond exploring Pandaemonium (as in my previous example), towards creatively re-interpreting the life and works of Humphrey Jennings.
Deep Learning is a fast-moving field. In this article I have identified some of the key emergent areas which I believe are of greatest relevance for projects such as Congruence Engine, as well as some hopefully useful ‘coping strategies’ for navigating the promise and potential of machine learning as a non-practitioner.
Machine learning – and deep learning in particular – is already transforming many industries, and its commercial application is both widespread and well-documented. For digital humanities practitioners, however, I believe that the benefits of this growing research area are still yet to be fully realised. A significant barrier to date has been the lack of sufficiently well-annotated data for historical research purposes, or the resources available to create new datasets at the scale required to train new systems from scratch. Of all the trends identified here, the industry-wide shift currently taking place – towards repurposing and fine-tuning large and sophisticated ‘general purpose’ machine learning models – offers the greatest promise for research applications such as ours.
nbsp;
Congruence Engine is supported by AHRC grant AH/W003244/1.