Homework 06

Fundamentals of Data Engineering

There are three problems this week, the first two dealing with SQL grouping, and the third dealing with data modeling. For the first two problems, the repository contains a zip file which can be used to recreate a collection of tables of airplane flights over the month of January, 2020. If you are interested in more, similar type data, I grabbed the original data from the Bureau of Transportation Statistics here.

Accept Assignment


In the repository lives a zip file named hw6.zip. You should download that zip file and extract it, which has inside two SQL files (compressing made them small enough to upload to GitHub). These .sql files are a bit different from what you have seen in the past though, as they are technically a database dump. These are a common way to generate exact duplicates of the contents of a database. However, interacting with them can have some differences. The easiest, and fastest, method is to use the command line (terminal) tool psql that should have been installed along with the rest of PostgreSQL at the start of the semester (though it may or may not have gotten correctly added to your system PATH). If you open a terminal (or command prompt) at the location of the hw6_fast.sql file (or open a terminal anywhere and then navigate to that location), the command that you want to run is simply:

psql -d analysis -U yourusername -f hw6_fast.sql

where analysis is presumably the name of the database where you’ve been placing everything, and yourusername is whatever user-name you’ve been using with your database. If you happened to set up the database server on a non-default port, you’ll also need to include -p portnum. Assuming all goes successfully, this will create a new hw6 schema in your analysis database and set up the 6 different tables, complete with data, constraints, and primary and foreign keys. You should be able to confirm this by opening whatever client you usually use and querying the tables.

If this is not working, for whatever reason, I’ve tried to give you another option as well. The hw6_slow.sql file is also a database dump, but one which has been tweaked to use standard INSERT INTO statements. The result is that you should be able to run this file like any other collection of SQL commands in your client, but the price is that it will be MUCH slower to populate the initial tables. I’d recommend the first option if possible, but if you are having absolutely no luck getting psql to work in the terminal, I think this will work. If you are still having trouble getting the tables generated, please reach out! I’m happy to help you through this!

The benefit of loading a database dump, of course, is that the database structure is all ready for you, with a central table flights having information about individual flights as well as references to other tables with information about airports, carriers, and city markets (to name a few). An image of the ERD is included below, as well as in the repository, for your reference, or you can look at the version online here.

Entity Relationship Diagram of the airlines database table structure.

Problem 1: Birds of a Feather

This initial batch of questions will focus on being able to use keywords such as GROUP BY effectively. For each you should include any queries you used to determine the answer, as well as the answer itself. You may need several queries or temporary tables to answer some parts. Show all your work.

  1. What is the name of the airport which, over the given month’s data, has the greatest average delay in arriving flights? Limit your answer to only include airports with at least 100 arriving flights over the course of the month.
  2. Which 3 US cities have the greatest number of inbound flights across all airports serving that city? (Some cities have multiple airports, which is what the city_market_id is supposed to assist with.)
  3. Which 5 airports have the greatest number of outgoing flights per day on average? How many flights (on average) are leaving those airports each day?
  4. For each major airline carrier, which two airports represent the endpoints of that carrier’s longest flight (by distance)? There are 17 distinct carriers in the month represented in the dataset, so your output table should also have 17 rows with the name of the carrier and the name of both airports. Dealing with round trip duplicates here can be a little tricky, so think about how you could reliably filter out one of the directions. It doesn’t matter whether you choose PDX to SEA or SEA to PDX (for instance), you just need one of the pairings for each carrier. Export the results to a CSV named longest_flights.csv with a header and upload it back to GitHub.

Problem 2: That’s Not Normal

One of the more common ways that you will receive non-normalized data is through various JSON formats. Consider then the example JSON response shown below, which might depict a response from an internal API containing information about student assignment submissions and corresponding feedback and rubric items.

[
  {
    "student_id": "2948294",
    "name": "Janet",
    "submitted_date": "2022-09-22",
    "assignment": "Homework 1",
    "total_available_points": 20,
    "deductions": [
      {
          "id": 1, 
          "comment": "Didn't read instructions well", 
          "deduction": -1
      },
      {
          "id": 2, 
          "comment": "Missed a negative", 
          "deduction": -0.5
      }
    ]
  }, ...etc...
]

The repository has a more complete example that you can look through, which includes multiple submissions. Note that you can have the same deduction item showing up multiple times for a given assignment. Your task here is to take this information and progressively create from it tables of various normal forms. I will not ask you to show the specific contents within these tables (here), but am instead just interested in the structure of the tables. As such, the easiest way to represent table structure (and potential relations between tables) is through ER Diagrams. For each of the below parts, I’d ask you to upload an ER diagram image created by a tool such as DrawSQL to the repository.

  1. The first step is always to get the data / table into the first normal form. As a reminder, in this form you just need to have 1 piece of information per cell, and have some primary or compound primary key that uniquely identifies each row of the table. Create an ERD for a table that would hold the above data. This will be just a single table in this case, so you really just need to worry about specifying the column names and types, as well as determining what your primary key will be.

    Warning

    Deduction items can show up multiple times in a submission! But each student will only submit an assignment once.

  2. To put the table into 2NF, you need to ensure that there are no partial dependencies present, where some columns only depend on a subset of the columns that comprise a compound primary key. If you do not have a compound primary key, then your table is already in 2NF. Break the above into more tables as necessary to ensure that you no longer have any partial dependencies in any table. Be sure to include in the ERD which tables and columns are related to others through foreign key dependencies.

  3. Finally, to ensure that your tables are in 3NF, you must ensure that there are no transitive dependencies, where one column actually depends on another (non-primary key) column, instead of the primary key. Continue to break apart your tables as necessary to ensure that there are no transitive dependencies between any non-primary key columns. Again, be sure to include in your ERD the foreign key constraints that indicate what rows are related.

Don’t forget to include an image of the ERD for each step of the process above! If you could upload them back to GitHub with names like Prob3_1NF.png, Prob3_2NF.png, and Prob3_3NF.png, that would be very appreciated!

Problem 3: Placing Jay’s Son

In some ways, this problem might be considered an extension of Problem 2. In fact, if you have badly messed up something in Problem 2, it might come to light here. Your task in this problem is to create the 3NF tables you described in Problem 2, and then populate them with the contents of Prob3_Submissions.jsonlines. Doing so is going to require:

  • Warning

    Remember that you’ll need to populate the tables that are referenced before you can populate the table that references those other tables!

    Tips
    • Keep in mind that when you extract things from JSON, it will generally be a jsonb type. So you’ll need to convert things to get them to be able to be inserted into your table.

    • “Unpacking” a JSON array (square brackets []) across multiple SQL rows can be extremely useful to do these sorts of bulk operations. The easiest way to do this is with the jsonb_array_elements() function. Simply pass in a JSON array to the function, and it will unnest the elements across multiple rows. For example, the code:

      SELECT
        jsonb_array_elements(
          '[{"id":1, "name":"Jed"}, {"id":2, "name":"Luke"}]'
        ) AS people;

      would give a table of two rows:

      people
      {“id”: 1, “name”: “Jed”}
      {“id”: 2, “name”: “Luke”}

      You can also then immediately chain another extract from this to get just all of the id’s in a column, for instance. More documentation is here.

    • If you followed a pattern similar to what we did with the in-class normalization example, you would need to populate a column called something like “deduction number”. Populating that column in SQL would, as far as I know, require Window Functions, which we haven’t quite gotten to yet (but will!). As such, in the provided JSON I have provided a key for this value in each of the deductions. So you can use it if you need it, or if your normalization scheme didn’t require it, then you can ignore it.

    • There is a LOT of duplicated information in the JSON. As such, when unpacking it, don’t be surprised if you have lots of duplicated rows! DISTINCT will be your friend here!

    • Lastly, I encourage you to really focus on writing a SELECT statement that will return the information as you want to insert it into your table. This gives you an easy way to test and ensure each of your queries is giving you content that should be able to be inserted into your table. I actually wrote all of my queries in the same tab, so that once I was ready I could just run all to see if everything inserted properly.