So, in our previous multivariate analysis modules,
we've talked a lot about categorical data and quantitative data.
So you're given some data set like a CSB file or an Excel table.
We keep coming back to like our country,
GDP, life expectancy, as you just have rows in your data set.
Like USA has some dollars,
and has a life expectancy of 80 years,
where we have Sweden,
and we have some count,
and we have some life expectancy, and we plot these.
But another common data set that you might have is you might just
have a whole bunch of word documents.
You might just have a whole file of these word documents on your computer.
Somebody may drop you a whole bunch of files on your desk.
You might be an intelligence analyst.
You have a whole bunch of reports that you want to go
through and extract information from,
and look for how does report one relate to report two.
Do we keep seeing this guy Ross is referred to in all of
these different intelligence reports talking about bad guys,
and what's his connection between all of these different bad guys?
Or what are the different trends and themes if this
is a bunch of different books by the same author?
Can we detect any patterns in the way the author writes?
Are there any common themes among his or her stories?
So, this is a very common sort of data set you might find.
And so we want to start thinking about what are
the attributes of this multivariate data set,
and how might we visualize text?
And in text visualization, basically,
we're trying to think about summarizations of important elements within these documents.
And so, there's lots of different pre-processing steps that we may have to go
through prior to getting information from these text documents.
So again remember, you're given a whole bunch of different documents,
and these can be Tweets,
so 144 characters all the way up to novels thousands of pages.
And remember, we have a bunch of different words.
Each document has a set of words.
We have punctuation.
Even if it's a Tweet, you've still got spaces,
and characters, and information, and those.
We may have people, places, and things.
And so we may want to do what's called named entity recognition to extract people,
places, and things from these different chunks.
We may identify this is a person.
This is a place.
And we can grab all of those words.
And we can even then start thinking about which documents use the same words.
And in text visualization,
oftentimes what we're trying to do is summarize
the most important words and the relationships in the document.
It's not necessarily just about counting the number of
times that XXX appeared in all of these documents,
but it may also be how many times that YYY occurred right next to XXX in a document.
So, there's a lot of analysis we often have to do,
and a lot of process we have to do for text data prior to visualization.
There maybe things like named entity recognition,
maybe things like just counting the number of unique words to get an overview.
It could be things
like Latent Dirichlet allocation to the topic modeling,
so I may want to know what topics are in a particular document.
And one of the most common text visualizations is we just find the most common words,
and we draw the words.
We write the words down, and we create what's called a word cloud.
And so, for example, here's the 2002 State of the Union Address by U.S. President Bush.
And what we did is we counted the frequency of the words in his speech.
So, we're given his speech, and we count how many times he says a particular word.
And notice we get some weird things here,
that we had America and American.
So you can do things like stemming,
and you can try to combine words that are the
same but have slightly different changes to their roots.
Here we see the same problem here with President Obama, American, and America.
So, thinking about how we process the data,
but what this does it allows us to compare and contrast the 2002 speech,
the 2011 speech, and we see that words are highlighted like terrorist and terror pop out.
But here we're seeing things like people,
and new, and work, and dream.
It's not that we're not seeing some of those words here.
We see children and justice.
But we're seeing things like Afghanistan,
so we know that something's going on in Afghanistan during this time.
Here we don't see Afghanistan at all in the word cloud.
And this lets us get an overview about what might be going on.
Compare and contrast two documents and get
a high-level summary of these to allow us to compare.
There's other variations of this called wordles that we can try out,
where wordles do text layout.
Again, changing the size of our text based on count.
But now let's think about other things we get here as well.
We could actually position the words and the position could have meaning.
We could change the color of the words,
and we could change the size of the words.
Now part of the challenge here is America is a relatively long word.
So, it's going to take up more space because of the length.
So, we get challenges in encoding these different variables because
the word length may not have any relationship to a variable,
but the length then gives it strength on the screen.
It attracts our eyes more,
so we have to be careful about these different things.
We often remove small words like a, an,
and the, and we do things that we call stemming.
And there's lots of libraries in Python and the standard for
natural language processing library that help us process these text data
sets to then let us do some visualization and analysis on these.
This is just one example of how we could process text visualization,
and as you take more data mining, machine learning,
you can learn other techniques like Latent Dirichlet allocation for extracting topics.
So we can think about how to show topic changes over time,
and there's tons of really exciting visualizations.
We're trying to summarize documents,
document changes over time,
document relationships between entities,
and all sorts of information.
So, the word cloud is just sort of one quick summary overview.
I don't think anybody would say this is the best visualization necessarily for text,
but this is one of the most common ones you'll see to give you
a quick summary of what's going on in the data. Thank you.