20.11.2019

Data Mining Tools: Techniques, and Visualizations In this chapter, we set up and explore some basic text mining tools, and consider the kinds of things these can tell us. We move on to more complex tools including how to set up some of them on your own machine rather than using the web-based versions. Regular expressions are an important concept to learn that will aid you greatly; you will need to spend some time on that section.
Data Mining Tools: Techniques, and Visualizations In this chapter, we set up and explore some basic text mining tools, and consider the kinds of things these can tell us. We move on to more complex tools including how to set up some of them on your own machine rather than using the web-based versions.

Regular expressions are an important concept to learn that will aid you greatly; you will need to spend some time on that section. Finally, you will learn some of the principles of visualization, in order to make your results and your argument clear and effective.

Now that we have our data — whether through wget or by using the Outwit Hub or some other tool — now we have to start thinking about what to do with it! These can be as simple as a word cloud, as we begin our chapter with, or as complicated as sophisticated topic modeling the subject of Chapter Four or network analysis Chapter Five and Six. Some tools are as easy to run as clicking a button on your computer, and others require some under-the-hood investigation.

This chapter aims to introduce you to the main contours of the field, providing a range of options, and also to give you the tools to participate more broadly in this exciting field of research. Yet we do need to realize that these tools shape our research: Basic Text Mining: Word Clouds, their Limitations, and Moving Beyond Having large datasets does not mean that you need to jump into programming right away to extract meaning from them: There are three approaches that, while each have their limits, can shed light on your research question very quickly and easily.

The simplest data visualization is a word cloud. In brief, they are generated through the following process. First, a computer program takes a text and counts how frequent each word is. In many cases, it will normalize the text to some degree, or at least give the user options: Second, after generating a word frequency list and incorporating these modifications, the program then puts them into order, sizing by frequency and begins to print them. The word that appears the most frequently is placed as the largest and usually, in the centre.

The second most frequent a bit smaller, the third most frequent a bit smaller than that, and continuing on to dozens of words. While, as we shall see, word clouds have strong critics, they are a very useful entryway into the world of basic text mining.

Try to create one yourself using the web site Wordle. Who are the protagonists? Who are the villains? As adjectives are separated from other concepts, we lose the ability to derive meaning. With these shortcomings in mind, however, historians can find utility in word clouds. If we are concerned with change over time, we can trace how words evolved in historical documents. While this is fraught with issues – words change meaning over time, different terms are used to describe similar concepts, and we still face the issues outlined above – we can arguably still learn something from this.

Take the example of a dramatically evolving political party in Canada: While we understand that most of our readers are not Canadian, this will help with the example. What do you think that this document was about? When you have a few ideas, read on for our interpretation.

At a glance, we argue that you can see the main elements of the political spirit of the movement. You can piece together key components of their platform, from this visualization alone.

For historians, though, the important element comes in change over time. Remember, we need to keep in mind that words might change. Again, via a word cloud, figure 3. There is more focus on the international, Canada has received more attention than before, and most importantly, words like socialized have disappeared. Indeed, the CCF here was beginning to change its emphasis, backing away from overt calls for socialism. But the limitations of the word cloud also rear their head: Is opportunity good, bad?

Is freedom good, or bad? Without context, we cannot know from the image alone. But the changing words are useful. By , what do party platforms speak of? Figure 3. In three small images, we have seen the evolution of a political party morph from an explicitly socialist party in , to a waning of that during the Cold War climate of , to the mainstream political party that it is today.

On his blog, digital historian Adam Crymble ran a quick study to see if historians would be able to reconstruct the contents of documents from these word clouds — could they look at a word cloud of a trial, for example, and correctly ascertain what the trial was about. With Big Data, it is sometimes important to let the sources speak to you, rather than looking at them with pre-conceptions of what you might find.

Word clouds need to be used cautiously. This is the biggest shortcoming of word clouds, but we still believe that they are a useful entryway into the world of data visualization. Complementary to wider reading and other forms of inquiry, they present a quick and easy way into the world of data visualization. In the pages that follow, we move from this very basic stage, to other basic techniques including AntConc and Voyant Tools, before moving into more sophisticated methods involving text patterns or regular expressions , spatial techniques, and programs that can detect significant patterns and phrases within your corpus.

AntConc AntConc is an invaluable way to carry out some forms of textual analysis on data sets. While it does not scale to the largest datasets terribly well, if you have somewhere in the ballpark of or even 1, newspaper-length articles you should be able to crunch data and receive tangible results.

AntConc can be downloaded online from Dr. Laurence Anthony’s personal webpage. Let’s take a quick tour. Installation, on all three operating systems, is a snap: Let’s explore a quick example to see what we can do with AntConc. Once AntConc is running, you can import files by going to the File menu, and clicking on either Import File s or Import Dir, which would allow you to import all the files within a directory.

In the screenshot below, we opened up a directory containing plain text files of Toronto heritage plaques. The first visualization panel is ‘Concordance.

North York a later municipality until , ties to New York state and city, various companies, other boroughs, and so forth. A simple search for the keyword ‘York’ would reveal many plaques that might not fit our specific query. The other possibilities are even more exciting. The Concordance Plot traces where various keywords appear in files, which can be useful to see the overall density of a certain term. For example, in the below visualization of newspaper articles, we trace when frequent media references to ‘community’ in the old Internet website GeoCities declined figure 3.

It turns out, upon some close reading, that this is borne out by the archival record: Collocates are an especially fruitful realm of exploration. With several documents, one could trace how collocates change over time: Finally, AntConc also provides options for overall word and phrase frequency, as well as specific n-gram searching.

A free, powerful program, AntConc deserves to be the next step beyond Wordle for many undergraduates. It takes textual analysis to the next level. Finally, let’s move to the last of our three tools that we explore in this section: Voyant Tools. Voyant Tools With your tongue whetted, you might want to have a more sophisticated way to explore large quantities of information.

The suite of tools known as Voyant previously known as Voyeur provides this. It provides complicated output with simple input. Growing out of the Hermeneuti. Getting started is quick. Simply navigate to http: For the former, just upload one file or paste the text in; for the latter, upload multiple files at that initial stage. After uploading, the workbench will appear as demonstrated in figure 3.

The workbench provides an array of basic visualization and text analysis tools at your disposal. With a large corpus, you can do the following things: For example, with multiple documents uploaded in order of their year, you can see what words see significant increases over time, or significant decreases.

For each individual word, see how its frequency varies over the length of the corpus. Clicking on a word in the text box will generate a line chart in the upper right. You can control for case. Track the distribution of a word by clicking on it and seeing where it is located within the document alongside the left hand of the central text column.

If you press ctrl and click on multiple words, you can compare words in each of these windows. These are all useful ways to interpret documents, and a low barrier to entering this sort of textual analysis works. Voyant is ideal for smaller corpuses of information or classroom purposes. This default version is hosted on the McGill University servers, which limits the ability to process very large datasets. They do offer a home server installation as well, under development at the time of writing, and so at this point we recommend that learning the basics of the Programming Historian 2 can help you achieve similar things while learning some code along the way.

None of this however is to minimize the importance and utility of Voyant Tools, arguably the best research portal in existence. Even the most seasoned Big Data humanist can turn to Voyant for quick checks, or when they are dealing with smaller yet still large repositories. A few megabytes of textual data is no issue for Voyant, and the lack of programming expertise required is a good thing:

