Americans commonly refer to US regions like the South or the Midwest in conversation, but we disagree on their borders as often as we mention them. Is DC in the South? Is Kansas City in the Midwest? Is the Mid-Atlantic even a region at all? Ask any ten Americans these questions and you’ll get eleven different answers — at least if my own family is anything to go by. So why do we use these groupings at all? And if we are to use them, is there a better way to define them?

Regionalization — the practice of conceptually dividing…


Much of the buzz around data science focuses on its predictive power — the ability to create models which can distinguish between preexisting categories and estimate unknowns. There are other applications of data science, however, that can contribute to or exist independently from predictive modeling. One such application is clustering, or algorithmically grouping observations into categories which were not previously defined. Simply put, clustering is using math to put things in groups, with the math (rather than you) determining what the groups should be.

In a previous article on distant reading I used the example of grouping books based on…


Why Peter Turchin’s “cliodynamics” can’t quite predict the future

In recent weeks, an old idea has gotten a renewed burst of press. While many considered the tumultuous events of 2020 to have arrived as a bolt from the blue, one academic named Peter Turchin predicted a spike of instability for the United States in 2020 at least as early as 2010. In an article published in Nature that year, Turchin noted various trends — stagnating real wages, rising inequality, a growing youth population — indicating a cycle of instability due to come to a head around the start of the next decade. Naturally, the accuracy of this prediction ten…


Linear regression is the grandfather of all predictive models. Two centuries on, it’s been largely eclipsed in glamor by its progeny, but it remains no less relevant — or less useful. So what is linear regression, exactly?

A scatterplot with a red line of best fit drawn in and residuals indicated for each point
It looks kind of like this, for starters

Linear regression models relationships between variables by fitting a linear equation to the data. The familiar visualization of this is a “line of best fit” drawn through a scatterplot. …


One of the most promising applications of data science for history has been the technique known in the digital humanities as “distant reading.” Distant reading is a deliberate inversion of the more familiar term “close reading,” meaning a careful, fine-grained examination of the particulars of a text. In contrast, distant reading involves the use of automation to make generalizations about vast corpora of text. I alluded to this in my previous post using the example of invasion literature, but now want to take the opportunity to elaborate on what exactly distant reading is and how it works.

The current incarnation…


Any discussion of the intersection of history and new technologies — technologies that are new to the field of history, at least — will inevitably run into a morass of terminology. Academic writing isn’t exactly known for being clear and concise in the first place, and with novel terminology being coined to describe advances, anyone foolhardy enough to try writing about it finds themself in a minefield of changing, often-debated usage. I am apparently that foolhardy, and think some clarification of terms is necessary before going any further.

Digital history is a sub-field of digital humanities, the current buzzword in…


In a previous post, I discussed the vast amount of data available to historians and the possibilities presented by new methods of working with it. This naturally raises the question of where that data is, exactly, and how researchers can access and make use of it. The good news is that history data is everywhere. The bad news is that it’s generally really, really gross.

When we hear the word “data,” we typically think of spreadsheets of numbers, experimental measurements from bench sciences, line graphs and stock charts and the like. The truth is that data can be all sorts…


History, as a discipline, has a data problem. Not that there isn’t enough data to work with — though the preponderance of dusty tomes in this resolutely qualitative field may give that impression at first glance. On the contrary, the field of history is so awash in data that up until now, absorbing and understanding it has been the work of lifetimes. While other academics, such as mathematicians, are stereotyped as reaching their “peak” early in their career, historians are traditionally said to do better and better work until they die. The reason is simple: the longer a historian has…

Benjamin Peck

Data scientist and history geek committed to answering old questions in new ways.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store