How to: Analyze Data from BigQuery using GoodData and CloudConnect

Guest Author
Jan. 10 2013, 03:00PM EST

Guest blogpost written by Jiri Tobolka, Jr. Developer Evangelist at GoodData.

I really like playing around with cutting edge technologies. One of them is GoodData CloudConnect, the tool for ETL and building GoodData projects. Since we released this new feature, I have dozens of ideas how the data from various sources can be easily transformed in the cloud. Recently, for example, I found myself wondering: "How about using the Google BigQuery API as a data source? What about visualizing the data from it in GoodData?"

I know that BigQuery has a sample datasets like weather data, Github data or data from Wikipedia. How about using weather data to analyze average temperatures in San Francisco from the last 20 years? This sounds interesting to me so let's check out the BigQuery API a little bit more. The following API request will do the job:

POST:
https://www.googleapis.com/bigquery/v2/projects/${GOOGLE_API_PROJECT}/queries

- Headers:
Authorization: Bearer ${Access_Token}
Content-Type: application/json

- Body:
{
"query": "SELECT year, month, day, mean_temp FROM [publicdata:samples.gsod] WHERE station_number=724940 AND year > 1989 ORDER BY year,month,day"
}
* The query parameter is valid BigQuery query, if you are not sure about the validity, you can check it here

How to do this in CloudConnect? Simply use REST connector component with a proper setup. I won't include all the Authentication details here, but if you are familiar with the Google OAuth 2.0 - setting it up in CloudConnect is quite easy. To sum it up, all that I need is

- GoodData Project & Model
- CloudConnect
- Google BigQuery

Creating a new GoodData project in CloudConnect is a one-click matter. Let's create a simple data model for analyzing weather data. We have a very useful component - LDM modeler. One dataset with one fact and date dimension should be enough for our example. See how the model looks inside CloudConnect.

Let's move back to my ETL process. The BigQuery API will give me data in JSON, so all that I need to do is to use CloudConnect JSON reader and configure it properly to extract the data from JSON structure to the output port for the processing.

JSON Reader uses XPath for JSON. The BigQuery returns value's description appears in the first section and the values, row by row, appear in the second. I would need to extract the row values in order to fit my metadata on the output. See the part with values in the JSON below.

...
"rows": [
  {
   "f": [
    {
     "v": "2000"
    },
    {
     "v": "1"
    },
    {
     "v": "1"
    },
    {
     "v": "50.70000076293945"
    }
   ]
  },
...
I can use the f[index] identifiers to extract the data in the order that I need.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Context xpath="/root/object/rows" outPort="0"]]>
  <Mapping cloverField="year" xpath="f[1]"/>
  <Mapping cloverField="month" xpath="f[2]"/>
  <Mapping cloverField="day" xpath="f[3]"/>
  <Mapping cloverField="temperature" xpath="f[4]"/>
</Context>

Last step that I need to do before starting the data upload is date transformation. I've extracted the date fields separately for year, month, and day from the API. I need to transform these strings to the single field that I will use in the GoodData project for reporting.

CloudConnect has many components and Reformat is the component that commonly-used. Check out the function that will help me with date transformation.

function integer transform() {
$out.0.meantemperature = $in.0.temperature;
$out.0.date = str2date($in.0.year+"/"+$in.0.month+"/"+$in.0.day,'yyyy/M/d');
return OK;
}

Now, let's upload the data! We already have it transformed, so the GD Writer configuration is very simple! Just select the dataset that I've created with LDM Modeler, choose the field mapping and run the graph! See the complete data loading process visualized below.

Switch to the GoodData application and create your first report with the following metric:

Select MIN(mean_temperature)

Now, slice the metric by Month and choose your preferred visualization. The average monthly temperature in San Francisco based on data extracted over REST API from Google BigQuery services.

What's next? More data! What about mashing up the data with other sources? Do you think that the weather has some influence on your sales?

This is just one high level example of how you can easily leverage various data sources using CloudConnect. Want more details? Get in touch! I'd be happy to share them with you!

If you like it and want to try it yourself, hurry up and download the CloudConnect! We just launched it last week for you! We also prepared the Developer Program for technical users like you to enjoy the benefits of harnessing and manipulating data in the cloud.

Guest Author

Comments

Comments(1)