🔦 Free topic extraction tool for documents and social media

Use it directly by selecting the data to be analyzed:


... or read below about what it does, and how it works:

Topic extraction - automate the analysis of key topics in your texts

This function identifies automatically the key topics in a text, an operation called topic extraction or topic modelling. It analyzes the text line by line and determines groups of words and expressions which tend to cluster together, forming topics.
It works on texts written in a large variety of languages (including texts in non Latin alphabet). The function follows the principles of unsupervised learning, which is a type of machine learning.

If you use this function in an academic context (research or studies), you must reference it in your bibliography:

Benabdelkrim, M., Levallois, C., Savinien, J., & Robardet, C. (2020). Opening Fields: A Methodological Contribution to the Identification of Heterogeneous Actors in Unbounded Relational Orders. M@n@gement, 23(1), 4-18.

Innovative: control how "big" or "micro" the topics will be

The technology includes a "precision" parameter to control precisely if you need big ("macro") topics to be found, or instead if you prefer to identify many "micro" (smaller) topics:

a slider for the precision parameter

Import your data, get the results: an example

A wide variety of options are available to import the texts to analyze:

list of options to import texts

how to import a text and do topic extraction on it

Model (short description)

The function identifies pairs of terms in each line of the text. These pairs are called cooccurrences. Aggregating all pairs of terms, a network of terms is constructed. The network is cut into subregions, and each subregion corresponds to a topic.

Model - long description: from cooccurrences to network clustering

The principles followed by the tool are described in this academic publication studying how to find communities and topics on Twitter. The technology follows these steps:

  1. cleaning of the text: flatten to ASCII, removal of urls, removal of punctuation signs.
  2. lemmatization.
  3. decomposition of the text in n-grams up to four-grams, removal of less relevant n-grams. This step is identical to the one followed by the function for sentiment analysis
  4. count of cooccurrences: which pairs of n-grams tend to appear frequently in the same lines of the text?
  5. the list of cooccurring n-grams is used to create a network: it is made of the most frequent n-grams. Two n-grams are connected if they are frequently cooccurring.
  6. a community detection algorithm is applied to the network: the Louvain algorithm, which is fast and very effective.
  7. the parameter chosen by the user is applied: a large value will detect a few big communities. A small value will detect many little communities.
  8. each community (or cluster) in the network is a topic. The list of key terms in the topic are the n-grams contained in the cluster.

Tips and tricks for effective results in topic detection

Structure and format of the text: Topic extraction works by detecting pairs of terms which appear on the same line of text. So you should be careful about how your text is formatted. Ideally, it should be made of relatively short paragraphs, each on one line. If you are using an Excel file, each paragraph or significant block of text should appear on a different row.

Volume of text: topics are found by measuring frequencies: which pair of terms tend to co-occur? For this to work, the text should be sufficiently long so that these counts are meaningful. The longer the text, the better. Texts of at least 5,000 words seem a good start.

How to define the number of topics to be found? Is it a good thing?

The most classic approach for topic detection is based on a clustering technique called the "k-means". With it, the user decides how many topic should be found in the text, and then the algorithms finds these topics.
This approach can make sense when we know in advance how many topics there are in the text. But what is the point of topic detection if we know the topics already? 🤔
In nocode functions, the number of topics to be found is not predetermined. The analyst wil learn a lot by discovering how many topics the algorithm can find in the text, without a preset limit. The analyst remains fully in control thanks to the precision parameter, which helps tune the algorithm to find more or less topics - but always with a degree of freedom on the exact number.


© 2022 Nocode functions by Clement Levallois