Section 3 Data collection

The data were collected using custom search and scraping scripts for each online community. The general approach was similiar for all four communities. A first script used the search function of the community to collect all links to threads which contained the term impf (roughly equivalent to the search term “vaccin” in English texts) in at least one post. The second script visited these links, scraped the relevant information, and saved them in a structured format. The subsequent sections present some excerpts from these script. Please note that the data collection is not perfectly reproducible, because both the content and the structure of the online communities may change over time. We therefore did not aim to provide a fully reproducible script, but document our basic scraping functions.

3.1 Urbia

3.1.2 Collection

Urbia requires a login to access older posts. Username and password have to be set before the web scrape (inspired by this source). Note that we used the mobile version of the website (m.urbia.de).

Function to collect information on a thread.

Function to collect information on a post (called from within the thread function).

3.2 Rund ums Baby

3.2.2 Collection

Function to collect all information given the link to the thread.

get_thread = function(link, wait_time = 0, verbose = FALSE) {
    
    if (verbose) 
        print(link)
    
    txt = read_html(paste0("https://www.rund-ums-baby.de", link))
    
    subforum = txt %>% html_nodes(".forum_titel") %>% html_text()
    
    thread_title = txt %>% html_nodes("#content") %>% html_nodes("h1") %>% html_text()
    
    first_post = txt %>% html_nodes("#content") %>% html_nodes("p.beitrag") %>% 
        html_text()
    
    later_posts = txt %>% html_nodes("#content") %>% html_nodes("p.antwort") %>% 
        html_text()
    
    posts = c(first_post, later_posts)
    
    first_author = txt %>% html_nodes("#content") %>% html_nodes("td:contains(von)") %>% 
        html_nodes("b") %>% html_text()
    
    later_authors = txt %>% html_nodes(".antwort_von_name") %>% html_text()
    
    author = c(first_author, later_authors)
    
    first_date = txt %>% html_nodes("#content") %>% html_nodes("td:contains(von)") %>% 
        html_text(trim = TRUE) %>% str_remove(fixed(first_author)) %>% str_remove("von") %>% 
        str_remove("am") %>% str_remove("Geschrieben") %>% str_squish()
    
    later_dates = txt %>% html_nodes("p.antwort_von") %>% html_text()
    
    if (length(later_dates) > 0) {
        for (i in 1:length(later_dates)) {
            later_dates[i] = str_remove(later_dates[i], fixed(paste0("Antwort von ", 
                later_authors[i], " am ")))
        }
    }
    
    postdate = c(first_date, later_dates)
    
    thread_replies = txt %>% html_nodes(".anzahl_antworten") %>% html_text() %>% 
        str_remove(" Antwort:") %>% str_remove(" Antworten:") %>% as.integer()
    if (length(thread_replies) == 0) 
        thread_replies = NA_integer_
    
    tibble(posts = posts, postdate = postdate, author = author, thread_replies = thread_replies, 
        thread_title = thread_title, subforum = subforum, link = link) %>% mutate(postnumber = 1:n())
}

3.3 Netmoms

3.3.1 Search

The search results on Netmoms are presented with fixed paginations. Note that we access the static version of the website (static.netmoms.de).

The function extracts the links to the threads.

We apply the function to the first page and all subsequent pages. All links were written to a text file.

3.3.2 Collection

Function to collect information on a thread.

Function to collect information on a post (called from within the thread function).

3.4 Babycenter

Babycenter was by far the hardest platform for the data collection. The conversation style differed from the other communities, which were more similar to traditional discussion boards with topically organized threads. The discussions on Babycenter were more similar to a group chat in which multiple topics were discussed in the same thread. This resulted in much more threads with the search term and in much longer threads. Many posts were irrelevant for our research interest and had to be sorted out later on (see Section 5).

3.4.1 Search

We read the first result page. We extract the links from the first page and the number of search results. We calculate the number of result pages from the latter.

The function extracts the links to the threads. The safe version prevents an abortion of the search loop in case of an error.

We loop over all subsequent result pages. All links were written to a text file.

3.4.2 Collection

Function to collect information on a thread.

Function to collect information on a post (called from within the thread function).