Homework 07

Fundamentals of Data Engineering

This week’s data heralds largely from the Museum of Modern Art, which makes their collection data1 available publically here, though I have already included everything you need in this assignment’s repository. The data initially comes in two pieces, with one containing information about the works of art and the other containing information about the artists. As you will see though, some work can be done to clean things up and make the data easier to work with, so this assignment will be an exercise in two parts: cleaning up the data and then gaining some insight from it. There is a lot of text here, but most of it is explanatory to try to give you plenty of guidance, so don’t be intimidated.

Accept Assignment


Problem 1: Table Restructuring

The first step is to do a bit of restructuring of our tables to get them into something more closely resembling 3rd normal form (though they still won’t necessarily be perfect). You are starting mostly from scratch here, but I have included template code in the repository that you can copy and run to create the necessary starting tables with data types that will successfully import the data. I will detail further cleanup steps in each of the parts below. For each clean-up step, include the command(s) you use to perform that operation in the template.

  1. Create the initial artists and artworks tables in whatever database you like, and copy the information in from the included CSV files into the appropriate tables. If you are using the template code I provided for you, this should be simple, and you need to only include the code you used to copy in the table information. If you made your own tables with your own chosen names, please include the code for creating them as well.

  2. Perusing through the tables, you will likely notice one thing fairly quickly: there is a lot of duplicated information in the artworks table that actually describes the artists (and is also present in the artists table. Moreover, if you look around a bit more in the artworks table, you will see something unfortunate: some pieces of art have multiple artists, and so all of that information is smashed into compound strings within each column of the artworks table. Good table design in relational databases dictates that, in almost all circumstances, you want to have one (and only one) value in each column of a record. In order to accomplish this, we are going to need to generate a third table, which we might call artwork_artists, which will have only 2 columns: one containing the artwork object_ids, and the other containing the artist constituent_ids. If multiple artists worked on the same artwork, then that same artwork object_id should simply appear on multiple rows with a different constituent_id on each. This is the same idea as to how we combined students and classes earlier in the semester.

    The difficulty lies in how to accomplish this easily. We have all the needed information, but it is buried inside artworks.constituent_id. If you have worked with strings in other programming languages, you might want to do something like splitting apart the current constituent_id field at the commas, to get each individual constituent id. We can actually do the same in Postgres! The function string_to_table(some_text, text_to_split_on) will return a table of the individual pieces of some_text split at each occurrence of text_to_split_on and spread across multiple rows. If you are selecting multiple columns, the other column’s information gets duplicated over each row. For example, running

    SELECT 
      STRING_TO_TABLE('Cake,Pie,Cookies', ',') AS desserts,
      5 AS amount;

    would give the table:

    desserts amount
    Cake 5
    Pie 5
    Cookies 5

    Create the artwork_artists table such that it has one row for each artwork id and contributing artist.

  3. The constituent_ids in artwork_artists are just increasing integers, so it would be nicer to work with them as actual integers instead of the strings they came in as when you split the original string. (And importantly, this removes any extra spaces that might be still lurking around as well.) Change that column data type to be an integer (you may need to use something like USING constituent_id::INT at the end of your ALTER TABLE statement to get it to go through).

  4. You now have three related tables, but no constraints are in place. Alter the tables to add primary keys object_id and constituent_id to the artworks and artists tables, respectively. For artwork_artists, create a compound primary key that includes both columns, which we can do since we just confirmed that those combinations are all unique. Add in the necessary foreign keys as well to link the artwork_artists table columns back to the necessary reference table columns.

  5. Now that we have better related the tables, there is really no need for the artworks table to still have all the columns containing information that is already in the artists table. Remove those 7 columns from artworks.

Problem 2: Content Clean-up!

While you have gotten the basic table structure in a better place, there is some content in some of the columns that is not formatted as nicely as it could be. The below steps will describe what and how to fix some of these issues. For each clean-up step, include the command(s) you use to perform that operation in the template.

  1. If you check for missing information in the artists table, you’ll realize that there is plenty, but much of it is information that we probably do not have. One thing you might notice though is that there are artists with no recorded nationality, but who do have something about a nationality mentioned in their bio. These probably could be fixed, but there are a lot of them and no great automatic way to do so, so I’m not going to ask that of you! One other aspect of this table though is that missing or “not yet occurring” birth and death dates are recorded with 0’s, which seems potentially problematic if you wanted to do calculations with those columns. Change all instances of 0 in the birth year (begin_date) or death year (end_date) to NULL instead.

  2. Moving over to the artworks table, take a look at the date column, which is when the artwork was supposedly created. This column is an absolute mess of formatting. Some are years, some are ranges of years, some are specific days, and everything in between. For the most part though, each record contains at least one 4 digit date somewhere amidst the text. While grabbing just this 4 digit date may not be perfectly precise, it would surely be more useful than the current state of things. How can we accomplish that date extraction though? We can match off of patterns, but we haven’t talked yet about how to extract a pattern. We’ll be talking about working with text and regular expressions after Spring Break, but for now, you can extract the first 4 digit date from any string using:

    substring(date_text FROM '\d{4}')::INT

    This extracts the part of the string that matches the regular expression \d{4}, which stands for “four numeric digits adjacent to one another”. Add a new column named date_int to artworks and populate its rows with the first 4 digit date that appears in the row’s original date column. If you want a quick check that this worked, if I sum all those years in the entire column, I get a value of 291029466.

  3. You may also notice that there are a lot of missing entries in the size columns at the end of artworks. A closer inspection of these columns along with the dimensions column seems to indicate that the artwork largely falls into 4 categories:

    • 2 dimensional and squarish, in which case the width_cm and height_cm columns are used
    • 2 dimensional and round, in which case the diameter_cm column is used
    • 3 dimensional, in which case the length_cm, height_cm and depth_cm columns are used
    • No physical size, in which case none of them are used

    Interestingly, the circumference_cm column is unpopulated for many pieces of 2D art that have their sizes indicated. Rename this column to be perimeter_cm, and then, assuming them to be perfect rectangles or circles, populate the perimeter values for all pieces of art that do not already have a circumference/perimeter defined but which have the other necessary columns defined. As a reminder, to compute a perimeter:

    • For a rectangular object, it is 2 times the width + 2 times the height
    • For a circular object, it is the diameter times \pi, which you can access with pi() in Postgres

Problem 3: Query Time

The tables are looking far better now than when you started! Time to proceed to use the data set to answer the following questions. As per usual, show all queries that you used to arrive at your solution.

  1. What is the title of the piece of art with the most collaborators/contributors at the Museum of Modern Art?

  2. What is the median age of a piece of art when the museum acquires it?

  3. What artist has pieces in the MoMA collection across the greatest number of different classifications?

  4. For pieces of art that have a defined perimeter, what is the title(s) of the artwork with the 30th largest (no skipping ranks) perimeter in each department? Are there any ties? Are all departments represented?

Footnotes

  1. Robot, O. D. (2024). MoMA Collection - Automatic Update [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13933308↩︎