I “needed” to have a backup of my chat conversations, didn’t trust Facebook —of course. After the disappointment in the available solutions and tools for such a task, I just decided to rely on the old and gold DIY technique.
Once I got a simple working scraper, I felt the need of at least a basic parser, in order to get a nicely-readable format for my precious conversations.
I am sure you will agree with me when I say that having clean and tidy data is a waste, if you don’t try to extort all the info and stats you can from it, so here I am.
The content of this article is mostly based on my conversation-analyzer python project. There you can find specific info about the implementation, how to set it up and actually run it on your conversations (the code can be used for any kind of conversation once the text content is properly parsed). On the other hand, what I will discuss here is a generic overview of various aspects and methods related to the task of conversation analysis.
In this scope, a conversation is simply a textual interaction between two or more participants (or senders).
Here I will include views and considerations belonging to different areas, from natural language processing and text analysis, to data modeling and visualization, as well as sociological interpretations — that is, personal debatable speculations— of some analytical results. As the title suggests, it is just an introduction: it contains information that is probably obvious to many readers, but with time I will try to embed or expand it with results from more specific areas and covering more complex techniques.
Overall, I hope to provide some insight or inspiration, and I more than welcome all kinds of comments, critiques, suggestions and — obviously — corrections.
Basic Length Stats
The first basic set of statistics for a conversation consist of length measures, like total number of messages, total length of all messages, and message average length. These measurements can be about the overall conversation, or for a specific sender, and can be used as basic building blocks for more complex and interesting stats. For example, grouping by other parameters (like date or time), gives access to additional views, and constitute the base for sender-activity comparison.
Intuitively an important aspect of a conversation is its duration (or interval).
Being start date the datetime of the first message, and end date the datetime of the last one, the conversation overall duration can be simply defined as end date - start date. An interesting info for this interval is the list of days for which no message from any participant has been sent. The ratio between the length of this list and the total number of conversation days, constitute the “density” of the conversation. From this, other minor and more specific stats can be derived, like the maximum number of consecutive days without messages, or the density distribution across different time-frames.
The density of a conversation can provide insight about the relationship between the participants. Higher density should imply a stronger relationship — in the analyzed digital context — especially if we consider only a one-to-one conversation. On a second note, also the amount of information shared during each day should be considered before jumping to conclusions.
Given the characterization of a conversation, we are dealing in a way or another with time-series, and as such a really useful and powerful operation is the aggregation. With aggregation we refer to the operation of grouping a set of messages by a specific feature, and collapsing the resulting multiple values into a single one, by means of a function (e.g. sum, average, count). Multiple values are the result of messages sharing the same value for the feature we are grouping by.
Take for example the hour feature. We can aggregate all messages by it, summing together all multiple values (since an hour will most likely appear again for different days, month, years, etc.). By doing this, we can observe the message-length trend for the conversation and derive the sender’s hourly-pattern-activity.
You can spot sender-specific routine patterns by simply looking at the resulting diagrams. Apart from the obvious sleeping cycles you can see when you are more prone to write, when you write more per message, or when you simply write more messages. Moreover, if graphs are observed on different periods, it is possible to recognize how habits changed in time.
Lexical stats are more toward a linguistic point of view of the conversation, considering its words and vocabulary. We consider a word (or token) to be a sequence of characters meaningful to us. The richness (or lexical diversity) of a conversation is then simply defined as the ratio between total word count and distinct word count (or vocabulary).
It is interesting to observe how the richness of a conversation changes in time. I would say that it’s more likely to observe a decrease of it for various reasons. One is that we might end up narrowing down the discussed topics, another is that we might start adapting to the other participants, and rely on a implicitly-agreed-upon vocabulary.
All starts by simply counting words: how many times a token appears in the conversation. We are not interested in the meaning or context of a word at this point.
Overall-stats simply consider the counting for the conversation as a whole, with basic info like the top N words (the N words that occurs the most in the conversation). We can then start considering one or few specific words, grouping by sender, and aggregating word-count by features.
Instead of word count, the term frequency is generally more common at this point. Simply consider the entire conversation as our corpus, and then all messages for each sender can be grouped to form a specific document of the corpus. The final result we are interested in is a word-frequency matrix, which tells how many times each word has been used by each sender.
However, if we want to measure the actual relevance of a specific word for a sender, we have to rely on a smarter technique. In order to do this, we can use the tf-idf, a measure often used in information retrieval (we will here consider only the basic variant). Being w our word, let’s define
- TF (term frequency) = sender w-count/total w-count
- IDF (inverse document frequency) = log(number of participants/number of participant who used w)
TF*IDF, will give us a weighted value reflecting the importance of w for the considered sender.
The usage-trend of specific words can help to better understand sentimental aspects of a conversation. This understanding is strongly related to the analyzed conversation, but as a naive hint just try to speculate on what a difference it would make if the word described in the following plot was a positive, or a negative one, in the conversation context. A proper consideration about formal emotion or sentiment analysis might be discussed in further blog entries.
While here, let’s test how faithfully our conversation follows the Zipf’s Law. In the context of language, this law provides a correlation between ranking and frequency of words: the most frequent word of a text occurs twice as often as the second most frequent one, three times more often that the third one and so on. More formally, the frequency of a word is proportional to the inverse of its ranking: if you multiply this value for the number of occurrences of the most frequent word, you should get a good approximation of how many times the word occurs in your text.
Nowadays most conversations are characterized by high usage of emoticons; already from the previous image we can observe that two of them (“:D”, “:P”) are ranked in the top 20 word-count.
If we can provide some sort of regular expression, in order to recognize all emoticons in the conversation, we can build a new set of stats based on their meaning and frequency.
It might be easier to start a sentiment analysis just from emoticons: no need of additional tools or techniques required. If we can divide our emoticons set into subsets, each associated with an emotion, then we can easily start deriving sentiment patterns by relating frequency to the emotion-subset that each one belongs to.
And this concludes a first introduction to the project and related topics. I just gave it a try to the “write something and go public”. I would be more than glad to hear from you about possible specific aspects that should be better explained, or analyzed in more detail. These can be anything, from data modeling and transformation, to more “sociological analysis” of possible results. I will keep working on the project whenever I will have time, so feel free to also comment or contribute to the code and implementation.