Section 5 Data preparation
The texts of the posts were prepared for the analysis following the recommendations of the methodological literature on topic models (Schofield and Mimno 2016; Maier et al. 2018): All words were set to lower-case. German stopwords (frequent words which do not differentiate between texts), punctuation, and links were removed. Relevant bi- and tri-grams (meaningful combinations of two or three words) were added to the term list. Terms which occurred in less than 0.5% or in more than 99% of all posts were pruned from the corpus. The final dataset for the analysis consisted of 98,505 documents and 1,016 terms.
The final document-term-matrix is available in the OSF repository.
## Document-feature matrix of: 98,588 documents, 1,016 features (98.2% sparse).
## List of 3
## $ documents:List of 98505
## $ vocab : chr [1:1016] "1" "10" "100" "11" ...
## $ meta :'data.frame': 98505 obs. of 3 variables:
Custom stopword-list:
custom_stopwords = c("ab", "aber", "ach", "all", "alle", "allem", "allen", "aller",
"alles", "als", "also", "am", "an", "andere", "anderen", "anderes", "anders",
"auch", "auf", "aufs", "aus", "bei", "beim", "bin", "bis", "bist", "bzw",
"da", "dabei", "dadurch", "daher", "dahin", "damit", "dann", "das", "dass",
"daß", "dazu", "dein", "deine", "deinem", "deinen", "deiner", "dem", "den",
"denen", "denn", "dennoch", "der", "deren", "des", "deshalb", "deswegen",
"dich", "die", "dies", "diese", "diesem", "diesen", "dieser", "dieses",
"dir", "doch", "dort", "dran", "drauf", "drin", "drüber", "du", "durch",
"durchaus", "eh", "ein", "eine", "einem", "einen", "einer", "eines", "einige",
"einigen", "einiges", "einmal", "er", "es", "etc", "etwas", "euch", "euer",
"eure", "euren", "für", "fürs", "gegen", "gehabt", "getan", "gewesen",
"geworden", "hab", "habe", "haben", "habt", "halt", "hast", "hat", "hatte",
"hätte", "hatten", "hätten", "her", "hier", "hin", "hinter", "ich", "ihm",
"ihn", "ihnen", "ihr", "ihre", "ihrem", "ihren", "ihrer", "im", "in", "ins",
"is", "ist", "ja", "je", "jede", "jedem", "jeden", "jeder", "jedes", "jetzt",
"kann", "kannst", "kein", "keine", "keinem", "keinen", "keiner", "können",
"könnt", "konnte", "könnte", "könnten", "mach", "mache", "machen", "machst",
"macht", "mal", "man", "manche", "mein", "meine", "meinem", "meinen", "meiner",
"meines", "mich", "mir", "mit", "muss", "müssen", "musst", "musste", "müsste",
"mussten", "na", "nach", "nachdem", "naja", "ne", "nein", "nem", "nen",
"ner", "nicht", "nichts", "nix", "noch", "nun", "nur", "ob", "oder", "ohne",
"ok", "okay", "raus", "rein", "rum", "schon", "sehr", "sei", "seid", "sein",
"seine", "seinem", "seinen", "seiner", "selber", "selbst", "sich", "sie",
"sind", "so", "solche", "solchen", "soll", "sollen", "sollte", "sollten",
"solltest", "somit", "sondern", "sonst", "sowas", "soweit", "tun", "tut",
"über", "um", "und", "uns", "unser", "unsere", "unserem", "unseren", "unserer",
"unter", "usw", "viel", "viele", "vielen", "vieles", "vom", "von", "vor",
"war", "wäre", "waren", "wären", "wars", "was", "weder", "weg", "wegen",
"weil", "weiter", "weitere", "welche", "welchen", "welcher", "welches",
"wenn", "wenns", "wer", "werd", "werde", "werden", "wie", "wieder", "wieso",
"will", "willst", "wir", "wird", "wirst", "wo", "wobei", "wollen", "wollte",
"wollten", "worden", "wurde", "würde", "wurden", "würden", "z.b", "zb",
"zu", "zum", "zur", "zwar", "zwischen")Relevant bi- and tri-grams:
relevant_ngrams = dictionary(list(trotz_impfung = "trotz impfung", grippe_impfen = "grippe impfen",
mmr_impfung = "mmr impfung", hepatitis_b = "hepatitis b", gut_vertragen = "gut vertragen",
`6fach_impfung` = "6fach impfung", `6_fach` = "6 fach", `6_fach_impfung` = "6 fach impfung",
meningokokken_b = "meningokokken b", gute_besserung = "gute besserung",
`6-fach_impfung` = "6-fach impfung", erhöhte_temperatur = "erhöhte temperatur",
kein_fieber = "kein fieber", kein_problem = "kein problem", keine_ahnung = "keine ahnung",
keine_impfung = "keine impfung", nicht_geimpft = "nicht geimpft", nicht_impfen = "nicht impfen",
nicht_zu_impfen = "nicht zu impfen", selbst_entscheiden = "selbst entscheiden"))Preparation of the the document-feature-matrix with quanteda (Benoit et al. 2019). Note that we did not reduce the words to their stems or lemmata (Schofield and Mimno 2016).
# create document-feature-matrix from corpus
impf_dfm = crps %>% dfm(stem = FALSE, tolower = TRUE, remove_twitter = FALSE,
remove_punct = TRUE, remove = custom_stopwords, remove_url = TRUE, verbose = TRUE,
thesaurus = relevant_ngrams)
# Prune corpus
impf_dfm = impf_dfm %>% dfm_trim(max_docfreq = 0.99, min_docfreq = 0.005, docfreq_type = "prop")
# Convert matrix to stm format
impf_stm = impf_dfm %>% quanteda::convert(to = "stm")References
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, Akitaka Matsuo, et al. 2019. Quanteda: Quantitative Analysis of Textual Data. https://CRAN.R-project.org/package=quanteda.
Maier, Daniel, A. Waldherr, P. Miltner, G. Wiedemann, A. Niekler, A. Keinert, B. Pfetsch, et al. 2018. “Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology.” Communication Methods and Measures 12 (2-3): 93–118. https://doi.org/10.1080/19312458.2018.1430754.
Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.