Homework 03

Fundamentals of Data Engineering

As per usual, I have provided a markdown template file for each problem on this assignment. The first problem will be involving the BASH shell, while the second two problems will deal with calculations. I’m going to ask that you only use techniques that we have talked about in class on this week’s homework. We will discuss GROUP BY later which makes some of this easier, but working with more limited options is good practice when learning those specific skills.

Accept Assignment


The data CSV for problem 2 and 3 this week is too large to store in a GitHub repository, so I’ve stored it elsewhere, and I am linking you to it below. It is a large CSV (about 350 MB) so make sure you have a decent internet connection when you go to download it. I am also linking you to the official data dictionary for this table.

Taxi Data CSV Taxi Data Dictionary

Problem 1: Name Bashing

You will need a BASH or ZSH shell for this problem. On Unix type systems this should be simple, but in Windows should install WSL2 to get a Unix shell. There are some other methods to get a BASH shell, but you will need WSL2 when we get to Docker later in the semester anyway.

The file Names.txt in the repository has 1000 first names within it, one name per line in a random order. Write a BASH shell command which would determine the number of duplicated names (those that show up more than once) within a file and print that number (and only that number) to the shell. In addition to the programs introduced in class, you might find the BASH programs uniq and sort useful, and want to consult their documentation for proper use. Piping information from program to program will also be useful. If you are trying to check your command against the actual Names.txt, then you should get 197.

Problem 2: In a New York Minute

The data for this problem is based on all the yellow taxi rides that took place within New York City over the month of January 2024. Once you’ve downloaded the data, you’ll need to begin by creating a new table named taxi_rides and populating it with the necessary columns and corresponding data types. The provided (and official!) data dictionary should help here with getting column data types correct, though you will probably still want to look at the CSV itself. In particular, note that the columns in the CSV are not entirely in the same order as provided in the data dictionary. Additionally, you should add a SERIAL or GENERATE FROM IDENTITY column called ride_id, which will just be a unique integer assigned to each ride to be able to uniquely identify them. Note that such a column does not exist in the CSV.

Importing in the CSV data may take a little time, as it is a large CSV file. Once everything is imported, you can proceed to answering the following questions.

  1. This data supposedly contains all yellow taxi rides that took place in the month of January, 2024. How many trips are included in the data which both started and ended outside the supposed January 2024 time span? (Some of these are pretty wild if you look at their times!)
  2. What percentage of the total trips had a disputed charge? (Consult the data guide for how to identify a disputed charge!)
  3. What was the median amount charged per passenger across all rides? Ignore any rides that report 0 passengers.
  4. What is the most common pickup location? Drop-off location? What about the most common route (pair of pickup + drop-off locations)? Hint: You can get the mode of a pair of values together if you include them both in a set of parentheses in the ordering statement, e.g. (ORDER BY (col1, col2)) If you want to know what these correspond to, there is a separate officially provided location lookup table here.
  5. How many taxi rides seemingly took their passengers backward in time?! What was the average duration that they traveled back in time?

Problem 3: Speedin’ Taxis

This question is still using the taxi rides data set, but is a bit more involved, so I wanted to give it more room for an explanation. Suppose you wanted information on the statistical speeds of taxi drivers across the city. You have access to both distances and times, so calculating a speed should be straightforward. However, there are a few things to consider:

  • As seen above, not all time durations seem like real physical trips. So you should only consider rides that were at least 30 seconds long.

  • Also, some trip distances are somehow reported as 0. Those should be ignored.

  • Finally, by default, when you subtract two timestamps you will get an interval. Unfortunately, the way that Postgres stores intervals is not conducive to the sorts of arithemitc you need to do here, as you need to be able to divide by a time in a known unit. So to get an interval in hours that you can actually use to calculate a speed, you can do the following:

    (EXTRACT( EPOCH FROM (dropoff_time - pickup_time)) / 3600 )

    where dropoff_time and pickup_time are whatever you named those columns in your table. We’ll talk more about how these functions involving times and intervals work later in the semester (Ch 12), but for now you can use the above to get the trip duration in hours, which will allow you to compute a speed in miles per hour.

Your primary objective here is to identify all the trips with speeds that meet the above criteria but which are outliers of the main speed distribution. Here we are going to define outliers as points that are below the lower whisker or above the upper whisker on a classic boxplot, for which I am including a diagram below. In this case, the whiskers are located 1.5 IQR below and above the 1st (25%) and 3rd (75%) quartiles, respectively.

Q1 Median Q3 IQR 1.5IQR 1.5IQR Outlier Outlier

You need to identify all the valid rides that fall into these ranges (and are thus an outlier), and write them out to a CSV file entitled bad_taxi_mphs.csv. The file should be ordered by increasing speed and should include just the taxi ride ID and the speed of that ride. Include a header at the top and make sure to upload your CSV back to GitHub (it shouldn’t be that large).

Do not feel like you need to do this all in a single query! Break it up as you need! And you are free to do simple calculations like arithmetic either with your own calculator or using Postgres as a calculator.