Google Searches Vary Significantly!

Google Trends API Returns Highly Variable Data

If you are using Google Search data, you should be aware of some basic limitations. Nobody knows the etiology of a search. Searches with the keyword "commit suicide" could be (A) people who are experiencing suicidal ideation, (B) people who want to know if a celebrity committed suicide (e.g., "did Michael Jackson commit suicide"), (C) mental health researchers wondering what comes up when you search "commit suicide", etc.

But that's not all. Google Searches are sampled for each pull, which means that every time they are pulled from the API, they can change. Even when searching the same terms over the same date(s), they can be vastly different. With a simple test, I show the data Google returns is highly variable -- even when you request the same data for the same terms within just a few seconds of each other.

To demonstrate, I pull data for several suicide-related search terms over 10 overlapping time windows.

from gtrendspy import timeline
from datetime import datetime, timedelta
import re
from random import randint
from time import sleep

terms = [
        'commit suicide',
        'suicide statistics',
        'suicide',
        'suicide hotline number',
        'suicide quotes + suicidal quotes',
        'suicide prevention',
        'suicidal thoughts',
        'how to commit suicide',
        'teen suicide',
        'suicide song + suicide songs'
]


## The first date to pull
initial_date = datetime.strptime("2019-03-01", "%Y-%m-%d")

# How different each pull should be
step = 2

# The window for the pull
interval = 360

# Number of pulls
numruns = 10

for index in range(0, numruns):

    # The start date changes with each pull
    start = initial_date + timedelta(days = step * index)
    start_string = start.strftime("%Y-%m-%d")

    # The end date is just the start + interval
    end = start + timedelta(days = interval)
    end_string = end.strftime("%Y-%m-%d")

    # Print the window
    print("start date is {}".format(start_string))
    print("end date is {}".format(end_string))


    # Pull the data
    timeline.theo_timeline(
        terms = terms,
        # I use this to make file names based upon the terms and the run (index)
        names = ["{}_{}".format(re.sub(" \+ | ", "_", x), index) for x in terms],
        start = start_string,
        end = end_string,
        timeframe_list = ['day'],
        geo_country_list = ['US'],
        us_states = False,
        worldwide = False,
        timestep_years = 10,
        batch_size = 5,
        outpath = "~/gtrends-variance/input",
        creds = "~/info_theo.py"
    )

    # I sleep for somewhere between 10 and 20 seconds between each run
    sleep(randint(10,20))

We can then combine this data in R to see whether the different pulls retrieved significantly different values. The first step is to create data frames that merge all the runs for each individual term.


## Arguments
terms <- c(
  'commit suicide',
  'suicide statistics',
  'suicide',
  'suicide hotline number',
  'suicide quotes + suicidal quotes',
  'suicide prevention',
  'suicidal thoughts',
  'how to commit suicide',
  'teen suicide',
  'suicide song + suicide songs'
)

timeframe <- 'day'
ROOTPATH <- "~/gtrends-variance"


## Load packages
if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, tidyverse, gsubfn, psych, lubridate, reshape2)


## Build Data

# Get all the file names for that term
setwd(ROOTPATH)
files <- dir("./input", ".csv", full.name = T)
files <- grep(timeframe, files, value = T)
names <- gsub(" [+] | ", "_", terms)


# Create empty lists to fill in through a for loop
summ_data_list <- list()
full_data_list <- list()
ct <- 1

# For each term
for(name in names){

  # We take all the files with that term
  same_files <- grep(name, files, value = T)
  # ... read them in and merge them into a list
  dat <- list()
  for(f in same_files){
    term <- basename(f)
    num <- strapplyc(term, "_([0-9]+)_[A-Za-z]+.csv$") %>% as.numeric() + 1
    df <- read.csv(f, header = T, stringsAsFactor = F)
    names(df) <- c("timestamp", paste0("run", num))
    dat[[num]] <- df
  }

  # Which we then combine
  df <- Reduce(function(x, y) merge(x, y, by="timestamp", all = F), dat)

  # We only use those dates where there was no missing values across pulls
  df <- df %>% filter(complete.cases(df))


  # We can get some summary statistics on the rows

  # `tmp' allows us to use the rows without worrying about timestamp`
  tmp <- df %>% select(-timestamp)
  summ_data <- data.frame(
    timestamp = ymd(df$timestamp),
    # we get the sd by row
    sd = apply(tmp, 1, function(x) sd(x)),
    # the mean by row
    mean = apply(tmp, 1, function(x) mean(x)),
    # the min by row
    min = apply(tmp, 1, function(x) min(x)),
    # the max by row
    max = apply(tmp, 1, function(x) max(x))
    ) %>% mutate(
    # compute a few interesting numbers
    sd_over_mean = sd / mean,
    range = max - min,
    range_over_mean = range / mean
  )

  # We add these to lists of data
  summ_data_list[[ct]] <- summ_data
  names(summ_data_list)[ct] <- name
  full_data_list[[ct]] <- df
  names(full_data_list)[ct] <- name
  ct <- ct + 1

}

Each data frame in full_data_list corresponds to the raw data for a different term by timestamp. I am unable to share this data because it is raw data, but the first column is timestamp, the second is run1, then run2, up to run10. Under each run column is the search volume for the date. If you run this code on your own, you will see that the search values are different among runs -- even though are for the same query on the same date using the same API executed just seconds apart.

Each data frame in summ_data_list corresponds to a summary of a different term by timestamp. The rows are different dates, and there are statistics related to the variance of the rows in the columns. Here is the raw data table for the search term commit suicide.

timestamp	sd	mean	min	max	sd_over_mean	range	range_over_mean
2019-03-19	6.318569	16.39029	9.341083	31.61500	0.3855068	22.27392	1.3589704
2019-03-20	6.097820	20.33431	9.597624	32.24064	0.2998784	22.64302	1.1135375
2019-03-21	4.726079	11.74259	6.313096	18.91806	0.4024734	12.60496	1.0734399
2019-03-22	4.210195	16.64089	12.957275	26.11812	0.2530030	13.16085	0.7908740
2019-03-23	4.667966	10.13054	7.442979	18.78849	0.4607815	11.34551	1.1199317
2019-03-24	5.416404	27.51927	21.182260	38.60976	0.1968222	17.42750	0.6332837

We can summarize these date-level statistics to get a better sense for how much variance we can expect for the average observation. For example, it appears that the range between the highest and lowest observation on any given date is typically a meaningful percentage of the query fraction.

lapply(summ_data_list, function(x) sprintf("%.2f%%", mean(x$range_over_mean) * 100)) %>%
  data.frame() %>% t() %>% kable(format = "markdown")


commit_suicide	101.47%
suicide_statistics	97.15%
suicide	87.14%
suicide_hotline_number	105.60%
suicide_quotes_suicidal_quotes	92.52%
suicide_prevention	50.12%
suicidal_thoughts	65.09%
how_to_commit_suicide	101.47%
teen_suicide	87.14%
suicide_song_suicide_songs	77.63%

This means, for example, the range in search volumes over 10 API runs for the keyword "commit suicide" is, on average, 101% of its mean. That is, if you were to request this data 10 different times for the same date, you would expect that the maximum value you got would be more than twice the minimum value you got. In fact, we can see how large this range could be:

lapply(summ_data_list, function(x) sprintf("%.2f%%", max(x$range_over_mean) * 100)) %>%
  data.frame() %>% t() %>% kable(format = "markdown")


commit_suicide	230.46%
suicide_statistics	205.66%
suicide	236.32%
suicide_hotline_number	213.62%
suicide_quotes_suicidal_quotes	213.75%
suicide_prevention	130.47%
suicidal_thoughts	128.33%
how_to_commit_suicide	230.46%
teen_suicide	236.32%
suicide_song_suicide_songs	190.79%

This means that the range between the minimum and maximum value you got could be as high as 230% of the mean!

You can also see the size of the average range between minimum and maximum values for each time point.

lapply(summ_data_list, function(x) sprintf("%.2f", mean(x$range))) %>%
  data.frame() %>% t() %>% kable(format = "markdown")


commit_suicide	17.02
suicide_statistics	20.35
suicide	24.46
suicide_hotline_number	16.20
suicide_quotes_suicidal_quotes	22.35
suicide_prevention	39.25
suicidal_thoughts	30.95
how_to_commit_suicide	17.02
teen_suicide	24.46
suicide_song_suicide_songs	25.11

This means that, if you chose the run with the highest search volume for any particular date, you'd expect to be able to say "suicide statistics" was 20.35 (searches per 10M) higher than if you were to choose the run with the lowest search volume. We can also see how large this range can be:

lapply(summ_data_list, function(x) sprintf("%.2f", max(x$range))) %>%
  data.frame() %>% t() %>% kable(format = "markdown")


commit_suicide	39.08
suicide_statistics	42.42
suicide	57.37
suicide_hotline_number	39.76
suicide_quotes_suicidal_quotes	46.24
suicide_prevention	117.79
suicidal_thoughts	61.67
how_to_commit_suicide	39.08
teen_suicide	57.37
suicide_song_suicide_songs	61.05

This means that for one date, the difference between the highest and lowest pull for searches with the query "suicide prevention" was 118 searches per 10M. Given that this is daily data, that is a difference of approximately 11,000 searches in a single day!

We can visualize how different individual runs are from the mean with a dot plot. We use searches for "commit suicide" as an example. Each vertical line is a (randomly sampled) date. The black line is the mean for all runs, and the colored dots represent a different run.

set.seed(1234)
df <- full_data_list[[1]]
long_df <- melt(df %>% filter(complete.cases(.)) %>% sample_n(60), id = "timestamp", value.name = "searches", variable.name = "run")
long_df$run <- gsub("run", "", long_df$run)
long_df$timestamp <- ymd(long_df$timestamp)
long_df <- long_df %>% arrange(timestamp)

grouped_df <- long_df %>% group_by(timestamp) %>% summarise(meansearches = mean(searches, na.rm = T)) %>% ungroup()

p <- ggplot(long_df)
p <- p + geom_vline(aes(xintercept = timestamp), linetype = "dotted")
p <- p + geom_point(aes(x = timestamp, y = searches, col = run))
s1 <- seq.Date(min(ymd(long_df$timestamp)), max(ymd(long_df$timestamp)) + 28, by = "1 month")
p <- p + scale_x_date(
    lim = c(min(s1), max(s1)),
    breaks = s1,
    labels = function(x) format(x, format = "%b %Y")
  )
p <- p + geom_line(data = grouped_df, aes(x=timestamp, y = meansearches))
p <- p + theme_classic()
p <- p + theme(axis.text.x = element_text(angle = 55, hjust = 1))
p <- p + labs(
  x = "Date",
  y = "Query Fraction for 'Commit Suicide'",
  col = "API Run"
)

ggsave("./output/commitsuicide_dotplot.png", p, width = 10, height = 4)

commitsuicide-dotplot

In summary, when you request the search volume for a given term on a given date from the Google Trends API, you need to know that this data is sampled. From this basic test, it seems that the data Google returns is, in itself, highly variable -- even when you request the same data for the same terms within just a few seconds of each other. Standard statistical analyses that assume each data point is its true value (rather than a value with randomness of its own) will underestimate standard errors.