Mapping Genres by Language Usage

by Ryan Micallef
Share:

We work with a lot of words here at Hidden Door. Making worlds out of words is... kind of our thing. So naturally, the sort of product we’re building requires an excellent grasp of language, from both the reader's and writer's perspective. From early on, we’ve thought a lot about what makes a textual world feel consistent.

For example: How do we know we’re reading science fiction vs. high fantasy? What does it mean for an author’s work to be their own? As we’ve thought about these questions, we’ve collected and analyzed a lot of text, with an eye toward understanding what makes it semantically coherent, what makes it interesting to a reader, and what makes it important to a writer.

There are a growing number of powerful tools for examining text. One is Word2vec. It’s a staple of natural language processing (NLP): As the name suggests, it takes words and transforms them into a “vector” of numbers that represent the semantic content of the words. Vector is just a fancy word for a point in space, defined by a list of numbers. Our Word2vec vectors happen to be defined by three-hundred number long lists because they’re in a three-hundred dimensional space. So any text, from a single word to a novel, can be “embedded” in a three-hundred number long vector that roughly represents the meaning of the words.

We wanted to compare words and sentences from within one genre to others. Word2vec lets us embed text from different genres and make those sorts of quantitative comparisons. And, of course, quantitative comparisons can be visualized, so we took a swing! Here’s what we got:

A visualization of several literary genres, mapped in space.

To generate this plot, we took groups of sentences from books in specific genres and encoded them using Word2vec into a 300-dimensional space. We then used UMAP to attempt to reduce the 300 dimensions into two for plotting. Each chunk of text then becomes a dot in the scatter plot. We expected groups of text from within one genre or by a certain author would be similar to other groups of text within that same genre, or by that same author. If we use the same color dot to represent each genre or author, that’s exactly what we see.

At the time, several months ago, we were preparing to chat with some brilliant science fiction authors (we love sci-fi here), so we wanted to focus on that specific genre. This useful list of fiction genres and subgenres made it easy to round up texts we'd expect to be similar in content. From within that list, we selected ones familiar to us (it’s always useful to work with data you know some things about). This is especially true of text data, where the answers are rarely obvious. Our team happened to have read children’s fantasy as kids, such as The Chronicles of Narnia and A Wrinkle in Time, and some of the Hitchhiker’s Guide books, so these were good points of reference.

The plot shows borders where we might expect, and also some surprises! For example, we can intuit the proximity of nanopunk, biopunk, and the superhero genre by reference to the powers conveyed by technology or ‘magic’. Whether it’s because of a mutant virus, nano-implants, or being from the planet Krypton, somehow you can fly and lift improbably heavy objects. We see comedy sci-fi near childrens’ fantasy, which we suspect has to do with linguistic style, and perhaps a level of humor we wouldn’t expect to see in harder sci-fi, being more focused on technical accuracy and detail.

Speaking of contrast, when doing this sort of research, it’s useful to have a foil: something you expect will be fairly distinct from your area of interest, as a sanity check. In this case, we chose Dan Brown. Dan Brown is a very popular American author known for his thrillers, primarily set in the contemporary real world. We expected his books would provide a contrast against science fiction, which tends to be set in pretty diverse timelines and locations. And indeed, Dan Brown’s books, such as Angels and Demons and The Da Vinci Code, sit nicely together, and away from the more fantastical genres. The same is also true of hard-boiled mysteries like The Maltese Falcon.

One exciting thing about building quantitative models of genre from existing stories is that it means we can also go the other way. Using this quasi-spatial understanding of text, we can “translate” across genres when creating worlds. This allows us to give our players the ability to express themselves without breaking the texture or feel of their world, in the same way a good author or dungeon master would. The end result is an immersive story that responds intuitively to the player’s actions in the context of the world they’re playing in. For example: You decide to make your great escape from the antagonist. In a Western, you’d gallop away on horseback with six-shooters blazing. But in other universes, like sci-fi or fantasy, you’d be firing blasters from a hovercraft or jumping through a magical portal with a crossbow.

This research has sparked even more questions for our team about how text might cluster. What might a this look like over the course of an author’s life, or within a given fictional world as imagined by different authors? And what’s up with dystopian science fiction? How might titles such as A Clockwork Orange, Fahrenheit 451, and Do Androids Dream of Electric Sheep fit into these genre maps? We’re still thinking about all of these questions and more, and we hope to dive deeper into these topics soon!