Text network visualization is a powerful tool for analyzing textual data, identifying the main keywords and topics within a document or a corpus of documents.
It also provides a range of tools from graph theory that enable us to look at a text from a different perspective and, hopefully, get new ideas related to the discourse we study.
For this example, we will use Barack Obama’s last inauguration speech made in 2013.
Create a new graph, add a new text, click
Save and your text will be visualized as a network.
The words in the text (minus the stopwords, like is, the, all etc.) are converted into lemmas (e.g. taken → take, regarded → regard) are represented as the nodes of the graph.
Co-occurrences of the words are represented as the edges that connect the nodes in the graph (the graph is directed). The more often the nodes / words co-occur together, the closer they are to each other in the graph, designated by a distinct color.
The size of the nodes is based on betweenness centrality measure of the node’s influence in the network (top ranking, descending). The bigger the node / word, the more influential it is to the discourse.
Betweenness centrality is a measure from graph theory that measures how often every specific node appears on the shortest path between any two randomly chosen nodes in the network.
What that means is that the words with the higher betweenness measure are not only the ones that occur most frequently in a text, but also the ones that connect the different contexts or topics present within your text together. So there is a degree of correlation with the frequency and tf–idf measures, but it’s not the same thing. We can say that the BC measure is more context-aware.
In our example, the most influential nodes are:
american, require, time, people, believe, citizen, country, journey, america
InfraNodus models topics based on the words’ co-occurrence. The words that appear next to each other in text (but not with the other words), define the topics present within that text.
The nodes that have the same color are the nodes that belong to the same community and form a distinct topical cluster. This measure is based on the iterative Louvain community detection algorithm, which detects the words that co-occur more often together than with the other words in the text and assigns a specific community and color to them.
You can see the topical clusters if you just look at the graph or if you click the
Essence menu at the top or the
Analytics button at the bottom right:
The nodes’ alignement on the graph is based on the iterative Force-Atlas algorithm, where the most connected hubs are pushed apart, while the nodes that are connected to the hubs are pulled towards them. This correlates with the community detection algorithm above, but is less precise and is better for visual analysis.
By default, InfraNodus uses n-grams to build the co-occurrence matrix, scanning the text using a window of 4 word-lemmas (4-grams).
The words that are located next to each other have the strongest connection (weight = 3), the words that are separated by 1 words have a weaker connection (weight = 2), the words separated by 2 words have the weakest connection (weight = 1). If the same connection repeats several times, the weights are added, and the corresponding edge will be shown thicker on the graph — just like the edge between the
complete below. This is taken into account both in the community detection (topic modeling) algorithm and in the most influential words identification.
You can change the
n in n-grams in the Settings.