How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Dr. Gregor Wiedemann schreibt in der neuen Ausgabe des Journals Computational Communication Research gemeinsam mit drei weiteren Autor*innen, auf welche Weise Topic Modeling vereinfacht werden könnte. Topic Modeling ist eine immer beliebtere Methode, um Texte in den Sozialwissenschaften auszuwerten. Für besonders große Textmengen kann die Berechnung von Topic Modellen allerdings schnell zu zeitaufwändig werden. Der Artikel beantwortet deshalb die Frage, ob und wie Topic Modelle auf kleineren Samples zur Analyse der Grundgesamtheit zuverlässig und ohne Qualitätseinbußen berechnet werden können.

Artikel downloaden (PDF)

Abstract

Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.

Maier, D.; Niekler, A.; Wiedemann, G.; Stoltenberg, D. (2020): How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. In: Computational Communication Research, 2(2), S. 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE

Dr. Gregor Wiedemann schreibt in der neuen Ausgabe des Journals Computational Communication Research gemeinsam mit drei weiteren Autor*innen, auf welche Weise Topic Modeling vereinfacht werden könnte. Topic Modeling ist eine immer beliebtere Methode, um Texte in den Sozialwissenschaften auszuwerten. Für besonders große Textmengen kann die Berechnung von Topic Modellen allerdings schnell zu zeitaufwändig werden. Der Artikel beantwortet deshalb die Frage, ob und wie Topic Modelle auf kleineren Samples zur Analyse der Grundgesamtheit zuverlässig und ohne Qualitätseinbußen berechnet werden können.

Artikel downloaden (PDF)

Abstract

Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.

Maier, D.; Niekler, A.; Wiedemann, G.; Stoltenberg, D. (2020): How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. In: Computational Communication Research, 2(2), S. 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE

Infos zur Publikation

Autoren

Dr. Gregor Wiedemann

Erscheinungsjahr

2020

Verwandte Projekte und Publikationen

Kooperationsprojekt mit Uni HH

Neue Methoden in der Automatisierten Inhaltsanalyse

Das Kooperations-Projekt überträgt zwei in der Informatik entwickelte Methoden auf den Bereich der empirischen Kommunikationswissenschaft. Teilautomatisierte Inhaltsanalysen können so auch riesige Datenmengen untersuchen.

Bestandsaufnahme

Überblicksstudie Digitale Methoden

Ein umfassender und kompakter Überblick über methodische Ansätze der Computational Social Science (CSS) soll in der Technikfolgenabschätzung das Methodenspektrum erweitern.

Projekt Computational Social Science

Tracking Political Polarization

Welchen Zusammenhang gibt es zwischen der Vielfalt unserer Online-Informationsrepertoires und politischer Polarisierung, also der Verstärkung von Meinungsunterschieden zu bestimmten Themen und Gruppen im Zeitverlauf?

Pilotprojekt zur Methodeninnovation

Journalistische Nutzung algorithmisch geprägter Informationsumgebungen

Wie relevant die Angebote einzelner Suchmaschinen und Sozialer Medien im Arbeitsalltag von Journalist*innen sind, soll in diesem Pilotprojekt im Bereich der Computational Social Science mithilfe von Browser-Datenspenden erfasst werden.

Teilprojekt Forschungsinstitut Gesellschaftlicher Zusammenhalt

(Social) Media Observatory

Mit dem (Social) Media Observatory wird eine Datenbasis zur systematischen Beobachtung medienbasierter Öffentlichkeit aufgebaut, die andere Projekte im Forschungsinstitut Gesellschaftlicher Zusammenhalt für ihre Projekte nutzen können.

Open Access-Artikel