Data Science is History’s Future

Benjamin Peck
3 min readMar 10, 2020

History, as a discipline, has a data problem. Not that there isn’t enough data to work with — though the preponderance of dusty tomes in this resolutely qualitative field may give that impression at first glance. On the contrary, the field of history is so awash in data that up until now, absorbing and understanding it has been the work of lifetimes. While other academics, such as mathematicians, are stereotyped as reaching their “peak” early in their career, historians are traditionally said to do better and better work until they die. The reason is simple: the longer a historian has to study, the more data they manage to see and incorporate into their work. For centuries, the only way to confidently make a statement like “the British public were concerned about the threat of invasion by Germany in the years leading up to the First World War” was to immerse oneself in the relevant literature. Only after reading and pulling evidence from hundreds of documents — contemporary newspapers, letters, diaries, and novels, to say nothing of secondary literature written by other historians — could a historian make such a statement about the cultural perspective of a given time and place, and even then the judgment would be largely subjective. Evidence would be presented, of course, but usually in the form of supporting examples, a tiny subset of the literature that could easily be cherry-picked. Ultimately, a historian’s credibility rested on the persuasive strength of their argument and on their reputation. There was simply too much to take in, too much to read over even an entire career, and often no objective way to support the findings gleaned.

The cover of the first edition of “The Invasion of 1910,” a prominent piece of prewar invasion literature
Invasion literature: not just a fun hypothetical

Data science presents an immense opportunity for historians. Where previously study was limited to a human scope, the advent of accessible, broadly applicable data science tools opens the possibility of augmenting human perspective with the ever-increasing processing power of computers. Instead of requiring close reading of a carefully selected slice of documents, our invasion example from above could be accomplished through distant reading. This essentially means having a computer read an immense corpus of documents for you and return aggregate data, an output small enough to be human-interpretable while still yielding insights on the entire corpus. You can’t ask a computer if the British public were worried about German invasion in 1910, but you certainly can ask it if stories featuring invasion and submarine attack increase for that period as a proportion of a corpus of British fiction. You can’t ask a computer how the British public felt about the German Empire, but you can ask it what words references to Germany were most likely to appear alongside in period newspapers. Better still, the computational aspect allows for more (but not entirely!) objective measures of confidence. Not only can you say that references to invasion increased in a given period, but also exactly how much of an increase you observed. Want to see if invasion stories can be put into distinct groups based on word choice? Clustering algorithms yield categorizations that not only are not arbitrary, but can be objectively rated with metrics like the silhouette score. Can you guess based solely on the text whether an invasion story was written before or after some pivotal real-life event? You can build a logistic regression model to do just that, and even learn from it how confident it is in its judgments and how much of a role different factors play in making the call.

Data science isn’t magic, and it certainly can’t replace the human aspect of history. It is a tool, one that still requires human judgment to guide where and how to apply it. Data science simply serves to extend and augment human study by allowing us to ask old questions in new ways and gain insights from far more information than could ever be imbibed by a single individual. With so many fields being transformed by data science methods, it’s time history took note. For a subject so perpetually oversaturated with data, the benefits could be remarkable.

--

--

Benjamin Peck

Data scientist and history geek committed to answering old questions in new ways.