Know Datajoins, Jump over Pitfalls!

Setup and requirements

The setup and requirements of this lab are pretty easy. All you need to do is to open the Google Console by following the given instructions. You also need to open Big Query Console and keep it ready for use.

Create a new dataset to store your tables

The first task of the lab is pretty straightforward. You have to create a data set. You will easily be able to do it as you follow the instructions. Once you are done with this task, you will see the ecommerce dataset listed under your project. This is what you should see…

Pin the Lab Project in BigQuery

Next, pin the Lab Project in BigQuery. We will use this are going to require our dataset frequently in this lab. Pinning makes life easier, especially as you apply what you learn to your everyday tasks and work with multiple projects. Now you are ready to do some real work in BigQuery.

Examine the fields

As you will be working on the ecommerce dataset, have a look at the fields of all the tables in the dataset. This will help you in analyzing the queries as you move further in this lab. After you become familiar with the fields, try to identify some key fields in the tables which can be potentially useful for joins.

Pitfall: Non-Unique key

We are going to use the SKU id as the base for joining the tables, so if the SKUs are not unique, it is going to be a bit difficult to use that column to perform joins.

Join pitfall: Unintentional many-to-one SKU relationships:

We can now JOIN the inventory dataset with the Website Analytics dataset using the query.

Join pitfall solution: use distinct SKUs before joining

To eliminate the unintentional many-to-one relationship pitfalls, we can use distinct SKU IDs while joining the dataset. You can try the same by firing the query given in the lab. The query will aggregate the count and it will show only one row for every distinct SKU ID. As the query will return three rows of Product description, you would want to keep only one row in the result table.

Join pitfall solution: Selecting the correct join type and filtering for NULL

The default type of Join is Inner Join. It will only return the SKUs that are available in both datasets. If you want to see all the SKU records regardless of their presence in Inventory dataset, you can try out different types of Join, viz, Left Join, Right Join, Full Join and Cross Join. Here is a brief summary of the different types of Joins:

Join pitfall: Unintentional Cross Join

If you do not know the exact relationship between the keys of two datasets, you might unintentionally cross join in the database. This can significantly reduce the performance.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store