Initial Phase: Getting to Know the Data (Structure)
I wrote the title as a question because it is also still an open question for me. This thing is one of the difficult things I ran into during the Second Intern Project, where we were given a database with 95 tables, that is grouped into different departments. We have learned the basics in joins, merge, and connections in Tableau, but it's still always a challenge – at least for me –, when it comes into a real practice. It is, of course, different for each person, but for me, I still need to carefully look at the database schema and to the individual table, to see these things:
- which column contains which information,
- whether this column has a full data or should I connect it to other tables,
- whether it has duplicates,
- how much rows do I need to answer the question(s),
- whether I could use just one table as a main table, or do I need different main tables,
- will I get duplicates if I were to join/connect these tables,
- and so on.
So I would slowly write down a list of things I need from the data and mark specific tables and/or columns in the schema, that I can use to answer those questions. After I become a bit more confident to make the first contact with the data, I would take a peek at the tables, to see which table has the complete rows that I need. I would check how clean the tables are, as in whether there are nulls, or duplicates. For this, you can use different tools and/or methods, like Tableau Prep, Power BI, or create Calculated Fields in Tableau. You can, of course, use other tools like Python or anything else; nobody's stopping you from being creative, hacky, awesome, and strong analyst. The important thing is that you check on the number of rows, number of unique values of a column, number of nulls, the value distribution (not always important, but sometimes you'll need it for your analysis), and which key column you would need to take if you want to join/connect that table with another. That is what I would do when working with a database with multiple tables: to get to know the data (structure) better.
At this point, I haven't really dived into the data and read the values. I only analyzed the shape of the data as a whole structure, and as individual tables.
Navigate Through the Tables to the Information Related to the Question(s)
Next, I would want to slowly read the business question that I need to answer with the data. It helps if you previously prepared and formulated the question in such a way that you understand what it means, what we are actually looking for, and in which format we need it presented. Then check again on the data structure that we have analyzed, whether we could find all the data we need to answer the questions. At this point, I still haven't read the table contents, but instead I'd rather be scanning through the column names (and the description for the columns in the documentation, if provided), and mark the columns with color. I'd do some scribbles with MS Paint, or Excalidraw, to mark the columns, the tables, and the connections, which for me helps.
For me, these two first steps usually take a bit longer than I anticipated. That probably means either that I'm still lacking in analytical thinking, or that I tend to overthink and/or overplan the analytical process. But for now, if it works, it works.
Important to – again – notice, when (or where) on the data schema you do need a full row, or just the ones that you need. This will play a major role on deciding whether to use join or connection when preparing the data.
Join or Connection
If you have a – let's say – a transaction table as your main table, chances are you would need 100% of the rows for your analysis. If you then need a customer detail to each of the transaction, you will want to retrieve more information in the customer table. In this case, you might not need to take all of the customer information, but rather just the customers that have made a transaction stated in the transaction table. If you want to get the store addresses, where the transactions took place, then you will want to check only the stores on the store table, that is/are mentioned in the transaction table. If you have a main table, and you need to retrieve some additional information from another table that belongs to the rows on the main table, then you would want to use join in this case. We want to keep the number of rows exactly the same as our main table, and not adding any unnecessary row(s).
Now you want to manage the store logistics, as in you want to check the current total number of items for all products in the store. You will need all the information on our main table – the transaction table –, so you can monitor, which products were purchased and how many was bought. You'll need to sum up the total quantity of the purchased item. But on the same time, you also need the information of all products stored in the logistics data, so you can subtract the number of items purchased from the number of items in the store. In this case, you will need the whole rows of a new table – our logistics table–, while also wanting to have a certain connection from logistics table to the transaction table. For this, you'll need the connection. What it does, is it opens the logistics table beside the transaction (main) table, while still keeping the whole logistics data. It doesn't affect the number of rows of both the transaction and the logistics table, but now both the tables are connected, i.e. you can use column A from transaction table and use column B from logistics table in a same visualization. This way you'll have smaller chance to actually lose some of the data, because connection doesn't chip away a portion of the table like in joins, but instead it displays both tables as separate tables, which are connected through a "key" column, that both tables have. One example of this "key" column is like a "Customer ID" column, which is unique for each individual customer. The Product ID "DP-7662" should describe the exact same product, be it in transaction or in logistics table.
Umm Okay... So Which Should I Use?
It always depends on how you would extract the info from the whole database, but I'd say it makes sense to use both. That's why I wrote at the beginning, that it's crucial to first understand both the question and the data structure, so you have the gist of what you need and in what way you want them served. Keep in mind that it sometimes takes more computer memory, when you use connection, because you don't reduce the number of rows when you connect two or more tables together with connection.
