Section 5 Data preparation

The texts of the posts were prepared for the analysis following the recommendations of the methodological literature on topic models (Schofield and Mimno 2016; Maier et al. 2018): All words were set to lower-case. German stopwords (frequent words which do not differentiate between texts), punctuation, and links were removed. Relevant bi- and tri-grams (meaningful combinations of two or three words) were added to the term list. Terms which occurred in less than 0.5% or in more than 99% of all posts were pruned from the corpus. The final dataset for the analysis consisted of 98,505 documents and 1,016 terms.

The final document-term-matrix is available in the OSF repository.

## Document-feature matrix of: 98,588 documents, 1,016 features (98.2% sparse).
## List of 3
##  $ documents:List of 98505
##  $ vocab    : chr [1:1016] "1" "10" "100" "11" ...
##  $ meta     :'data.frame':   98505 obs. of  3 variables:

Custom stopword-list:

custom_stopwords = c("ab", "aber", "ach", "all", "alle", "allem", "allen", "aller", 
    "alles", "als", "also", "am", "an", "andere", "anderen", "anderes", "anders", 
    "auch", "auf", "aufs", "aus", "bei", "beim", "bin", "bis", "bist", "bzw", 
    "da", "dabei", "dadurch", "daher", "dahin", "damit", "dann", "das", "dass", 
    "daß", "dazu", "dein", "deine", "deinem", "deinen", "deiner", "dem", "den", 
    "denen", "denn", "dennoch", "der", "deren", "des", "deshalb", "deswegen", 
    "dich", "die", "dies", "diese", "diesem", "diesen", "dieser", "dieses", 
    "dir", "doch", "dort", "dran", "drauf", "drin", "drüber", "du", "durch", 
    "durchaus", "eh", "ein", "eine", "einem", "einen", "einer", "eines", "einige", 
    "einigen", "einiges", "einmal", "er", "es", "etc", "etwas", "euch", "euer", 
    "eure", "euren", "für", "fürs", "gegen", "gehabt", "getan", "gewesen", 
    "geworden", "hab", "habe", "haben", "habt", "halt", "hast", "hat", "hatte", 
    "hätte", "hatten", "hätten", "her", "hier", "hin", "hinter", "ich", "ihm", 
    "ihn", "ihnen", "ihr", "ihre", "ihrem", "ihren", "ihrer", "im", "in", "ins", 
    "is", "ist", "ja", "je", "jede", "jedem", "jeden", "jeder", "jedes", "jetzt", 
    "kann", "kannst", "kein", "keine", "keinem", "keinen", "keiner", "können", 
    "könnt", "konnte", "könnte", "könnten", "mach", "mache", "machen", "machst", 
    "macht", "mal", "man", "manche", "mein", "meine", "meinem", "meinen", "meiner", 
    "meines", "mich", "mir", "mit", "muss", "müssen", "musst", "musste", "müsste", 
    "mussten", "na", "nach", "nachdem", "naja", "ne", "nein", "nem", "nen", 
    "ner", "nicht", "nichts", "nix", "noch", "nun", "nur", "ob", "oder", "ohne", 
    "ok", "okay", "raus", "rein", "rum", "schon", "sehr", "sei", "seid", "sein", 
    "seine", "seinem", "seinen", "seiner", "selber", "selbst", "sich", "sie", 
    "sind", "so", "solche", "solchen", "soll", "sollen", "sollte", "sollten", 
    "solltest", "somit", "sondern", "sonst", "sowas", "soweit", "tun", "tut", 
    "über", "um", "und", "uns", "unser", "unsere", "unserem", "unseren", "unserer", 
    "unter", "usw", "viel", "viele", "vielen", "vieles", "vom", "von", "vor", 
    "war", "wäre", "waren", "wären", "wars", "was", "weder", "weg", "wegen", 
    "weil", "weiter", "weitere", "welche", "welchen", "welcher", "welches", 
    "wenn", "wenns", "wer", "werd", "werde", "werden", "wie", "wieder", "wieso", 
    "will", "willst", "wir", "wird", "wirst", "wo", "wobei", "wollen", "wollte", 
    "wollten", "worden", "wurde", "würde", "wurden", "würden", "z.b", "zb", 
    "zu", "zum", "zur", "zwar", "zwischen")

Relevant bi- and tri-grams:

Preparation of the the document-feature-matrix with quanteda (Benoit et al. 2019). Note that we did not reduce the words to their stems or lemmata (Schofield and Mimno 2016).

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, Akitaka Matsuo, et al. 2019. Quanteda: Quantitative Analysis of Textual Data. https://CRAN.R-project.org/package=quanteda.

Maier, Daniel, A. Waldherr, P. Miltner, G. Wiedemann, A. Niekler, A. Keinert, B. Pfetsch, et al. 2018. “Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology.” Communication Methods and Measures 12 (2-3): 93–118. https://doi.org/10.1080/19312458.2018.1430754.

Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.