Conversation Analyzer - An Introduction

In this scope, a conversation is simply a textual interaction between two or more participants (or senders).

Here I will include views and considerations belonging to different areas, from natural language processing and text analysis, to data modeling and visualization, as well as sociological interpretations — that is, personal debatable speculations— of some analytical results. As the title suggests, it is just an introduction: it contains information that is probably obvious to many readers, but with time I will try to embed or expand it with results from more specific areas and covering more complex techniques.

Basic Length Stats

The first basic set of statistics for a conversation consist of length measures, like total number of messages, total length of all messages, and message average length. These measurements can be about the overall conversation, or for a specific sender, and can be used as basic building blocks for more complex and interesting stats. For example, grouping by other parameters (like date or time), gives access to additional views, and constitute the base for sender-activity comparison.

An heatmap can provide an immediate view of a conversation’s feature through time. Here we can see the total length of messages over a 10 month period

Interval Stats

Intuitively an important aspect of a conversation is its duration (or interval).
Being start date the datetime of the first message, and end date the datetime of the last one, the conversation overall duration can be simply defined as end date - start date. An interesting info for this interval is the list of days for which no message from any participant has been sent. The ratio between the length of this list and the total number of conversation days, constitute the “density” of the conversation. From this, other minor and more specific stats can be derived, like the maximum number of consecutive days without messages, or the density distribution across different time-frames.

Aggregation

Given the characterization of a conversation, we are dealing in a way or another with time-series, and as such a really useful and powerful operation is the aggregation. With aggregation we refer to the operation of grouping a set of messages by a specific feature, and collapsing the resulting multiple values into a single one, by means of a function (e.g. sum, average, count). Multiple values are the result of messages sharing the same value for the feature we are grouping by.

Lexical Stats

Lexical stats are more toward a linguistic point of view of the conversation, considering its words and vocabulary. We consider a word (or token) to be a sequence of characters meaningful to us. The richness (or lexical diversity) of a conversation is then simply defined as the ratio between total word count and distinct word count (or vocabulary).

Lexical richness variation by year, aggregated by month

Words Frequency

All starts by simply counting words: how many times a token appears in the conversation. We are not interested in the meaning or context of a word at this point.

Sender-specific frequency of three example words, aggregated by hour
  • TF (term frequency) = sender w-count/total w-count
  • IDF (inverse document frequency) = log(number of participants/number of participant who used w)
Boxplot showing the usage-by-sender (word count) of a specific word on a 10 months period
(1) Frequency-count plot for the top 20 words (2) and for all words, on a log-log scale

Emoticon Stats

Nowadays most conversations are characterized by high usage of emoticons; already from the previous image we can observe that two of them (“:D”, “:P”) are ranked in the top 20 word-count.

Conclusions

And this concludes a first introduction to the project and related topics. I just gave it a try to the “write something and go public”. I would be more than glad to hear from you about possible specific aspects that should be better explained, or analyzed in more detail. These can be anything, from data modeling and transformation, to more “sociological analysis” of possible results. I will keep working on the project whenever I will have time, so feel free to also comment or contribute to the code and implementation.

--

--

Data Scientist @ Zalando Dublin - Machine Learning, Computer Vision and Everything Generative ❤

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Martinelli

Alex Martinelli

Data Scientist @ Zalando Dublin - Machine Learning, Computer Vision and Everything Generative ❤