Homework 01

Fundamentals of Data Engineering

All work and answers for this assignment should be added to the corresponding template files included in the downloaded repository. All template files are in a basic Markdown format. If you have never used markdown before, you can get a breakdown of the basic formatting options here. Markdown files are plain text files and can be opened directly in anything that supports plain text files: Notepad would work, or something like VSCode, Sublime Text or Notepad++. If you are familiar with R, you can also open them in RStudio if you’d prefer to edit them in that fashion. Just remember to copy and paste your SQL commands from Beekeeper/pgAdmin into fenced code blocks in the markdown file (which I’ve provided for you on this first assignment). Just upload the altered files back to GitHub (please don’t change the filenames) in order to submit your work!

Accept Assignment


Problem 1: An Introduction

We’ll start off with a super easy question just designed to help me get to know you a bit better. Obviously there are no right or wrong answers here! I’m just trying to get to know you a bit better.

  • What name do you prefer to go by? How do you pronounce it?
  • What pronouns do you prefer I use when addressing you?
  • What are you hoping to get out of this course?
  • What do you enjoy doing when not learning about data science?

Problem 2: Database Setup

Before you can do too much, we need to ensure that you have a version of the PostgreSQL server running on your computer and can connect and run SQL queries on that server using either Beekeeper, pgAdmin, or some other client of choice. The textbook has instructions in Chapter 1 for each operating system about how you can install the PostgreSQL server, which you should follow. It also has instructions for setting up and using pgAdmin, or you can follow the guidance I gave in class or on the webpage for using Beekeeper. Regardless of which method you use, ensure that you can run an SQL command by entering in and running the below command:

SHOW server_version;

If you get an output, excellent! Record it in the template file. If you do not, something is still broken, so consider contacting me so that we can get that figured out before you attempt the rest of this assignment.

Before you continue, the book recommends creating a new database in which you can keep all the work pertaining to this semester, and I agree with that logic. If you haven’t already, create a new database called analysis that all your tables can live in going forward.

CREATE DATABASE analysis;

Problem 3: Rocking Out

Included in the repository is a metal_bands.sql script that will create a table of heavy metal bands that were scraped from the website metalstorm.net. You can either open the script in your client and run all the commands inside (there are just two, one to make the table and one to populate it) or you could just copy and paste the commands into your client and run them yourself. Either way, you should add the metal_bands table to your analysis database so that you can use it to answer the following questions. For each question, include the following:

  • The answer to the question
  • What query (or queries) you used to arrive at your answer.
  • A short description/explanation giving any other information needed concerning how you took the output of the query(ies) to arrive at your answer.

Part A

What is the earliest formed band in the data set?

Part B

I discovered an interesting fact while preparing these questions, which is that this dataset has some fully duplicated rows! How many rows are duplicates? How did you figure it out?

Part C

How many bands have the colors black or white appearing somewhere in their name (not caring about capitalization)?

Part D

What is the second most popular band originating from Germany that has a “Viking folk” style? (Possibly not their only style!)

Part E

How many metal bands from the USA existed for at least a year before splitting, but nevertheless have less than 10 fans?