Qwiklabs — Baseline: Data, ML, AI

Koo Ping Shung
DataDrivenInvestor

--

The Badge I went for!

I have been wanting to find out more about Google Cloud Platform (GCP) and was shown by a Google Community Manager, Qwiklabs, a website where they have several “quests” as they call it. Each of these quests allows you to either learn more about GCP or AWS. I had some time to work on a quest and chose to work on Baseline: Data, ML, AI quest.

Overview

No explanation on why I start with that quest, since it will be fantastic for me to see what are the Data, ML and AI features that are made available at GCP. It was a great learning experience coupled with hands-on labs (Credits need to be purchased to activate most of the hands-on labs). In this particular quest, it allows you to see BigQuery (SQL) and BigTable (NOSQL) briefly so you can pick up basic SQL. You get to check out Google’s Entity Recognition and Speech-to-Text capabilities, coupled with data transformation/preparation and report creation. Not forgetting the most essential, Google’s ML engine.

The hands-on labs are available for a limited period of time and they will create practice accounts for each lab. Although the labs are timed, you are given ample time to complete them. My experience thus far is they give about 20 to 30% buffer, enough to take a short washroom break.

They do provide, for some of the sections, “Next Steps/Learn More” which I encourage you to try it out.

Below, is an overview and some thoughts on the hands-on lab to help readers with their own learning. You may skip this part and move to the conclusion where I share quick thoughts on the interesting features in GCP for data scientists.

Sections

There are 13 sections altogether and it took me about 1.5 calendar days to complete, so readers have to commit a weekend perhaps to run through the sections.

1-Introduction to SQL for BigQuery and Cloud SQL

You will learn this hierarchy, Projects -> Database -> Tables. Here you will learn some basics of SQL, how to write simple queries using Select, From and Where clauses, sorting and creating summary statistics. There is also a sub-section tables management as well, like Union and Insert Into clauses.

My Note: Try to take advantage of the lab to look at other clauses as well. For more information, you can go to W3 School and practice it at the lab (Remember it uses credits).

2-Big Query: Qwik Start

BigQuery is the Enterprise Data Warehouse (EDW) in GCP. For this lab, it provides either a Web Interface or Command Line to interact with BigQuery. I tried out Command Line as I want to get myself familiar with it again, rather than point-and-click which can be difficult to replicate.

Here you will learn how to access the metadata of the table, list the datasets available in a project. Another interesting hands-on (at least to me), is the part about creating a dataset (“babynames”) and loading a raw file into a table (names2010). Subsequently, you will perform a simple SQL query on the table that you have loaded.

My Note: Try to load in other raw files into the dataset as well, having some repetitive exercise, loading raw files from other years, helps to ‘internalize’ your learning.

3-BigTable: Qwik Start

BigTable is GCP’s NoSQL database. In this lab, you will get to connect to a Cloud BigTable instance, perform basic administrative tasks, and read and write data in a table.

For me, there is an interesting concept here and that is called the “column families”, where it defines groups of columns together. I am not from a CS background and do not work with databases that often except for pulling data from it, this was an interesting point to me and I will look into it further.

My Note: Unfortunately, I feel the lab is too short and could have allowed the participants to try some hands-on on querying a NoSQL database, if possible. Overall, it is a pretty short hands-on.

4-Cloud Natural Language and Cloud Speech API

This is a very interesting lab, both of them, but especially the Cloud Natural Language API.

In the Cloud Natural Language API, you get to pass in a sentence through a “gcloud” command and use the API to recognize entities in the statement. Of course, the Natural Language API can do more than entity recognition, like sentiment analysis, syntax analysis and so on. What is interesting to me at least is that if there is a Wikipedia Page on the entity it recognizes, the link will be provided as well.

For the Cloud Speech API, there is a ready-made audio file which you will be given encoding details. The details will be entered into a JSON file which you will “curl” into the API with the key.

This hands-on I find it lacking as I will prefer Qwiklabs point to a site to create more audio files and use these audio files with encoding details provided as further exercises.

My Note: For the Cloud Natural Language API, I had more fun with it as I parse in different sentences to see how good the API is and was good…limited to the examples I tried with, of course. So come up with other sentences and have fun with it. For the Cloud Speech challenge, I was technically challenged to try further, moreover, the time for the lab is pretty short to experiment further.

5-DataProc (GUI or Command Line)

DataProc is a cloud service for spinning Apache Spark and Apache Hadoop clusters. It's a straight forward hands-on that requires domain details to set up the computation cluster. You will be asked to change the number of “workers” as well. It's a very straightforward lab, not much variation you can introduce in to make the hands-on more fun (i.e. more learning points).

6-DataPrep

DataPrep is a data service for visually exploring, munging, and preparing data for analysis. This is a third-party software that resides in GCP. The software is called “Trifacta” Not to give too much away, but you will be building a data flow, that is very good visually, to see what are the steps taken to transform the data. What I like about it is that you can explore the data and record down the data transformation steps you want. The data visualization helps a lot in determining how the data is after certain transformations. It cuts down a lot of work compared to a programming language (not saying that it is bad, just the weakness it has) to determine how to clean the data. You can also do quick summary statistics as well if needed, all within the same interface.

My Note: The time given by the lab is ample thus after the last task, do try to navigate and play around with the data pipeline/flow you have created.

7-Google Cloud Datalab

In this section, it is presenting GCP’s ability to have notebooks in its environment. In addition, you will still have the git capability like commit and push etc.

8-Cloud ML Engine

So this is the fun and critical part of the whole quest…. machine learning! Here you can train a ready model from Tensorflow and run it locally or into the cloud (but my poor computer science background still cannot figure out the difference since both of them are on the cloud rather. Someone can explain to me?).

In this section, the lab uses a familiar dataset, the US Census Income data…well you know the above 50K or below 50K….that dataset? Remember…..? So if you are familiar with the dataset, we are doing a classification problem. In the lab, only one model is used and that is DNNCombinedLinearClassifier. You will not be learning a lot about machine learning, just to let you know first, but rather more on the ML function and features in GCP…just to let you know. You will get to deploy the model that you have trained too.

GCP provides TensorBoard (of course!) for you to look at your model training progress so that you can have an idea if there is a need for further tweaking or management of the training process.

I was looking through the steps and trying to look for ways to implement other classification models but do not see where I can change it though. My suspicion is that I have to nano/vim into trainer.task? So if you can try to use other machine learning models offered by TensorFlow.

This was the command used to start training the model.

9-Data Studio

Data Studio is the place where you can either build up your reports or dashboards (By the way, its FREE to use). In the hands-on, you get to pull a dataset from BigQuery to create a time-series data followed by styling the report (well it is just adding a background color to the visual) and adding a textbox at the top of the ‘report’. I can sense that pushing out Data Studio is not a priority in the whole quest so they only showed very few features in it. Pretty meh…if you asked me. There is really so much opportunity to show how a dashboard can be quickly created and layout.

Anyway, the interface is pretty similar to QlikSense and SAS Visual Analytics. My preferred visual tools are still R or Python. But point-click interface will definitely have a market.

10-Google Genomics: Qwik Start

Here you will be asked to work with the Cloud Genomic Pipelines API. I do not work with Genomics often so was not able to relate much to it but went with the motion and complete the lab because I want to complete the quest! :)

11-DataFlow

In this section, participants get to set up Python development environment and implement Python code. Again, the lab is more of going through the motion to completion rather. I would have preferred they get the participants to download python codes from a Google Drive perhaps, have it uploaded and then have it executed. Its closer to the real actual environment rather and it builds a more concrete and relatable experience.

12-Cloud Filestore

Personally, I feel this section is more about data engineering rather than data science. Think it will be great to have the lab done in a business scenario/setting so participants will understand why certain steps are taken. Having that understanding will help participants to understand the value of Filestore.

For the last section of this quest, it was pretty much going through the motion of clearing it and know the features of Filestore. The learning experience here does not ‘stick’ because of the lack of data engineering background, unfortunately.

Tips for Learning

Below are some tips from me on making this a great learning experience.

  • Have a Notepad handy, you will need to record some details for the labs
  • DO NOT rush through the hands-on, try to understand the reasons behind each step. Google is your learning buddy.
  • Read a few lines of instructions before going to the lab sessions to work on it
  • Time for the labs are ample, take advantage of it by introducing some “variations”.

Conclusion

Although some of the sections are more of going through the motion, I still enjoyed it and have a better idea of GCP. In some areas, I can see the ease of using it like DataPrep. The visualization really enables better and faster data preparation. The quest also makes me want to improve my “command line” skill. (I started in the DOS era but went to Windows pretty quickly). I enjoyed the section on DataPrep, ML Engine, Natural Language and Speech API.

I will definitely be keen to work through other quests with the necessary support and resources available. As mentioned, I run a tech community (together with a group of friends) and I have seen a higher take-up of GCP for tech firms and for readers that are interested in the field, do try out Qwiklabs to familiarize yourself with GCP.

Also, I had the pleasure of leading a Cloud Study Jam with members from the following groups participating. They are BigDataX, DataScience SG (my co-founded data science community) and PyData Singapore. It was a great experience and I had a lot of fun doing it. :)

I hope the blog has been useful to you. I wish all readers a FUN Data Science learning journey and do visit my other blog posts and LinkedIn profile.

--

--