Element 68Element 45Element 44Element 63Element 64Element 43Element 41Element 46Element 47Element 69Element 76Element 62Element 61Element 81Element 82Element 50Element 52Element 79Element 79Element 7Element 8Element 73Element 74Element 17Element 16Element 75Element 13Element 12Element 14Element 15Element 31Element 32Element 59Element 58Element 71Element 70Element 88Element 88Element 56Element 57Element 54Element 55Element 18Element 20Element 23Element 65Element 21Element 22iconsiconsElement 83iconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsiconsElement 84iconsiconsElement 36Element 35Element 1Element 27Element 28Element 30Element 29Element 24Element 25Element 2Element 1Element 66

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Dr. Gregor Wiedemann schreibt in der neuen Ausgabe des Journals Computational Communication Research gemeinsam mit drei weiteren Autor*innen, auf welche Weise Topic Modeling vereinfacht werden könnte. Topic Modeling ist eine immer beliebtere Methode, um Texte in den Sozialwissenschaften auszuwerten. Für besonders große Textmengen kann die Berechnung von Topic Modellen allerdings schnell zu zeitaufwändig werden. Der Artikel beantwortet deshalb die Frage, ob und wie Topic Modelle auf kleineren Samples zur Analyse der Grundgesamtheit zuverlässig und ohne Qualitätseinbußen berechnet werden können.
 
Artikel downloaden (PDF)

Abstract
Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.
 
Maier, D.; Niekler, A.; Wiedemann, G.; Stoltenberg, D. (2020): How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. In: Computational Communication Research, 2(2), S. 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE
 

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Dr. Gregor Wiedemann schreibt in der neuen Ausgabe des Journals Computational Communication Research gemeinsam mit drei weiteren Autor*innen, auf welche Weise Topic Modeling vereinfacht werden könnte. Topic Modeling ist eine immer beliebtere Methode, um Texte in den Sozialwissenschaften auszuwerten. Für besonders große Textmengen kann die Berechnung von Topic Modellen allerdings schnell zu zeitaufwändig werden. Der Artikel beantwortet deshalb die Frage, ob und wie Topic Modelle auf kleineren Samples zur Analyse der Grundgesamtheit zuverlässig und ohne Qualitätseinbußen berechnet werden können.
 
Artikel downloaden (PDF)

Abstract
Topic modeling enables researchers to explore large document corpora. However, accurate model specification requires the calculation of multiple models, which can become infeasibly costly in terms of time and computing resources. In order to circumvent this problem, we test and propose a strategy introducing two easy-to-implement modifications to the modeling process: Instead of modeling the full corpus and the whole vocabulary, we (1) use random document samples and (2) an extensively pruned vocabulary. Using three empirical corpora with different origins and characteristics (news articles, websites, and Tweets), we investigate how different sample sizes and pruning strategies affect the resulting topic models as compared to fully modeled corpora. Our test provides evidence that sampling and pruning are cheap and viable strategies to accelerate model specification. Sample-based topic models closely resemble corpus-based models, if the sample size is large enough (usually >10%). Also, extensive pruning does not compromise the quality of the resulting topics. Altogether, pruning and sample-based modeling leads to increased performance without impairing model quality.
 
Maier, D.; Niekler, A.; Wiedemann, G.; Stoltenberg, D. (2020): How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models. In: Computational Communication Research, 2(2), S. 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE
 

Infos zur Publikation

Erscheinungsjahr

2020

ÄHNLICHE PUBLIKATIONEN UND VERWANDTE PROJEKTE

Newsletter

Infos über aktuelle Projekte, Veranstaltungen und Publikationen des Instituts.

NEWSLETTER ABONNIEREN!