Automatic speech recognition is most useful when users can also estimate how much trust to place in its output. For long witness testimonies, including Holocaust survivor testimonies, this matters because researchers often work with sensitive, historically valuable material. Agentic AI helped us turn a proof-of-concept workflow into a deployable production feature in roughly two weeks.
For many speech-recognition tasks, a transcript alone is only half of the practical answer. A researcher can read it, search it and quote from it, but sooner or later the same question appears: how reliable is this particular transcript?
This is especially important for long witness recordings. In collections built around testimonies, including Holocaust survivor accounts and USC Shoah Foundation data published on YouTube, recognition errors influence search, interpretation, review effort and the level of caution needed before a transcript is used as research evidence.
For several years, we have been running UWebASR as a web platform for automatic speech recognition. For scientific and non-commercial use, the service is freely available, and the newer Zipformer models now include variants adapted to this type of data. The next useful step was to give users something more than words on a screen: an estimate of how accurately a selected model is likely to recognise their material.
From System Signals to Accuracy Estimates
The technical idea is simple to state and less simple to make robust. During recognition, UWebASR also produces internal signals about model certainty. The calibration workflow uses labelled evaluation data to learn a model-specific predictor that estimates expected word recognition accuracy from those signals.
The public repository uwebasr-calibrate contains the code and the method description. The key design point is privacy-friendly: users can calibrate the estimate on their own labelled data, while reference transcripts remain on their machine. Only audio is sent to the chosen UWebASR endpoint for recognition.
This matters for digital humanities because collections are often sensitive, heterogeneous and hard to evaluate with one universal benchmark. A calibrated estimate cannot replace manual checking, of course. It can, however, help a researcher decide whether a transcript is good enough for search, whether it needs careful correction, or whether a particular recording should be treated with extra caution.
Agentic AI in the Loop
This work also became a small but useful experiment in agentic AI for research software development. In Codex, I first designed a proof-of-concept experiment and tuned the basic method. I then wrote the complete method into CALIBRATION.md and put it into the repository.
Then another agentic AI workflow took that method description and implemented the end-to-end process from scratch. That included recognition, feature extraction, train/test splitting, model training, metrics and reports. I then prepared the code needed for production deployment and could include results from the deployed UWebASR environment directly in the paper.
The most visible practical change is the pace of work. Without agentic AI, this kind of research-software loop would normally be work for several months: design the method, implement it, debug edge cases, run experiments, prepare deployment code and only then write up the production results. With the agentic workflow, the core path took about two weeks.
What Comes Next
We are preparing this topic as a full-paper output connected with the Digital Heritage of European Conflicts Conference (DHECC 2026) in Odense. The digital heritage of European conflicts is a fitting frame for the work: the goal is to give people working with witness recordings a transcript together with a model-specific estimate of how well the system handled their material.
The result is now visible in the current UWebASR ecosystem: specialised Zipformer models can provide an estimate of their own recognition accuracy. If the route from an idea to a production feature and then to a paper sounds interesting, the repository and the calibration method are public. More after the summer break.
Links
- UWebASR: Web-based speech recognition service with Zipformer models, oral-history models and HTTP API access.
- uwebasr-calibrate: Public GitHub repository with scripts and documentation for building calibrated UWebASR accuracy estimates.
- CALIBRATION.md: Method description for training and evaluating a confidence-derived accuracy predictor for UWebASR models.
- MALACH: Search interface over USC Shoah Foundation data published on YouTube.
- Digital Heritage of European Conflicts Conference (DHECC 2026): Conference on digital heritage of European conflicts, connected with the MEMORISE project.
- DHECC 2026 full-paper special issue: Springer Nature collection for full papers from DHECC 2026 in the International Journal for Digital Humanities.