Introduction to TokenX on the Cather Archive

TokenX, created by Brian Pytlik Zillig, is a powerful text analysis, visualization, and play tool that has been customized for use on the Willa Cather Archive. Specifically, the complete corpus of Cather's published fiction has been digitized so that the computer can be used as a tool for understanding qualities of the text in new ways: word frequency, patterns of usage, and more. We hope that this unprecedented access to certain kinds of information about Cather's texts will support a variety of research and teaching goals, from ambitious projects that seek to understand Cather's use of language throughout the duration of her writing career to more focused work on individual stories and novels.

Texts Used in Digization of Cather's Fiction Corpus

Cather's works have appeared in multiple editions, from first editions to the 1937 Autograph Edition to modern paperbacks and carefully edited scholarly editions. To ensure consistency, we decided to use digital texts of the first editions of all of Cather's novels and short story collections. For the short fiction published individually, we used the texts from the first periodical printings (when available) and the standard reprintings of the stories. The complete list of texts and their sources can be found here: List of Texts Used in Text Analysis Project.

Because we want the information about the body of Cather's fiction to be as accurate as possible, each text has been proofread multiple times. However, with over 1.2 million words in the texts, mistakes are bound to be discovered. If you discover a mistake, please email Andrew Jewell, editor of the Willa Cather Archive, at ajewell2@unl.edu.

What Does TokenX Do?

TokenX is a tool that allows users to experience text in new ways. Specifically, it uses Extensible Stylesheet Language for Transformations (XSLT) and Extensible Markup Language (XML) to automatically mark each word across Cather's corpus individually as a token. Once tokenized, the computer can be instructed to perform a myriad of functions and gather a wide range of information about the texts. Currently, TokenX allows you to visualize, analyze, and play with texts.

Visualization: When you first encounter TokenX on the Cather Archive, you may select a text from a drop-down file which lists all of Cather's fiction not protected by copyright and therefore republishable in their entirety. Then, using the options on the left of the screen, you can follow onscreen instructions to select a variety of options that allow words or punctuation to be seen differently. This option can be particularly useful if a researcher or teacher wants to make certain patterns of usage dramatically apparent.

Analysis: TokenX can take the text of your choice and automatically give you statistical information about that text, which in turn can be presented as numbers in tabular form, as a concordance (the "decontextualize and count" option), or in the context of the original textual order.

Play: Because the creator of TokenX believes useful knowledge can be gleaned by willful deformation of text, users are given the option to replace words with other words or with small images, remaking the text in order to notice it differently.

Word Frequency Database: All of Cather's fiction texts, including those published after 1922, can be analyzed in the word frequency database. This database will give you results about word usage across the corpus, but will only take you to the full text if that text is not protected by copyright.

What Do You Think?

Since TokenX on the Willa Cather Archive is an experiment, an attempt to see what kinds of new knowledge can be created with new kinds of access, we would appreciate any response or suggestions you may have that can help future developments. You may email Andrew Jewell, editor of the Willa Cather Archive, at ; you may email Brian Pytlik Zillig, creator of TokenX, at .

Support for the creation of this tool was provided by the University of Nebraska-Lincoln's Arts and Humanities Enhancement Fund and the Center for Digital Research in the Humanities