Sampling is a statistical tool for estimating the characteristics of a whole population by assessing a subset of individuals as a representative sample. The method is also used in digital music production by sampling analogue sounds at frequent intervals and digitizing the results of each sample.
A similar technique is commonly used in Google Analytics, where a subset of website visits is used to represent the website metrics whole. This sampling can drastically improve the processing speed of results when dealing with large datasets, but it can compromise on accuracy. Unfortunately, GA default sample is 250,000 visits, so any reporting above this figure will be a sample of it.
This tutorial by Ryan Praskievicz on RyanPraski.com shows followers how to avoid GA’s sampling limitations even when pulling 1 million rows of data using Google Analytics API and Python. This solution checks for the presence of sampling and breaks your query down into more manageable 10,000-row chunks representing shorter date ranges. These chunks are later stitched together into a single CSV file to represent the full date range.
Followers are required to set up the Google Analytics API Python client library. The author provides all of the relevant code with step-by-step instructions to run the sample query of 120,000 total results pulled into a CSV file.
Error messages notifying users that the query contains sampled data can be addressed by shortening the date range accordingly to pull fewer rows on each run. The final results arrive in a single CSV file that is independent of GA’s sampling limitations, and this example hits Excel 2010’s upper limit of 1,048,576 rows. This solution can also pull data from multiple GA profiles from a single Python application, saving time for high-traffic Google Analytics users.