One Day in Twitter: Topic Detection Via Joint Complexity

Abstract : In this paper we introduce a novel method to perform topic detection in Twitter based on the recent and novel technique of Joint Complexity. Instead of relying on words as most other existing methods which use bag-of-words or n-gram techniques, Joint Complexity relies on String Complexity which is defined as the cardinality of a set of all distinct factors, subsequences of characters, of a given string. Each short sequence of text is decomposed in linear time into a memory efficient structure called Suffix Tree and by overlapping two trees, in linear or sublinear average time, we obtain the Joint Complexity defined as the cardinality of factors that are common in both trees. The method has been extensively tested for Markov sources of any order for a finite alphabet and gave good approximation for text generation and language discrimination. One key take-away from this approach is that it is language-agnostic since we can detect similarities between two texts in any loosely character-based language. Therefore, there is no need to build any specific dictionary or stemming method. The proposed method can also be used to capture a change of topic within a conversation, as well as the style of a specific writer in a text. In this paper we exploit a dataset collected by using the Twitter streaming API for one full day, and we extract a significant number of topics for every timeslot.
Complete list of metadatas
Contributor : Dimitrios Milioris <>
Submitted on : Sunday, March 30, 2014 - 9:21:56 PM
Last modification on : Tuesday, May 14, 2019 - 10:15:08 AM


  • HAL Id : hal-00967776, version 1


Gérard Burnside, Dimitrios Milioris, Philippe Jacquet. One Day in Twitter: Topic Detection Via Joint Complexity. Snow Challenge, Second Workshop on Social News on the Web, WWW '14, Apr 2014, Seoul, South Korea. ⟨hal-00967776⟩



Record views