This week’s data heralds largely from the Museum of Modern Art, which makes their collection data1 available publically here, though I have already included everything you need in this assignment’s repository. The data initially comes in two pieces, with one containing information about the works of art and the other containing information about the artists. As you will see though, some work can be done to clean things up and make the data easier to work with, so this assignment will be an exercise in two parts: cleaning up the data and then gaining some insight from it. There is a lot of text here, but most of it is explanatory to try to give you plenty of guidance, so don’t be intimidated.
Accept AssignmentProblem 1: Table Restructuring
The first step is to do a bit of restructuring of our tables to get them into something more closely resembling 3rd normal form (though they still won’t necessarily be perfect). You are starting mostly from scratch here, but I have included template code in the repository that you can copy and run to create the necessary starting tables with data types that will successfully import the data. I will detail further cleanup steps in each of the parts below. For each clean-up step, include the command(s) you use to perform that operation in the template.
Create the initial
artistsandartworkstables in whatever database you like, and copy the information in from the included CSV files into the appropriate tables. If you are using the template code I provided for you, this should be simple, and you need to only include the code you used to copy in the table information. If you made your own tables with your own chosen names, please include the code for creating them as well.Perusing through the tables, you will likely notice one thing fairly quickly: there is a lot of duplicated information in the
artworkstable that actually describes the artists (and is also present in theartiststable. Moreover, if you look around a bit more in theartworkstable, you will see something unfortunate: some pieces of art have multiple artists, and so all of that information is smashed into compound strings within each column of theartworkstable. Good table design in relational databases dictates that, in almost all circumstances, you want to have one (and only one) value in each column of a record. In order to accomplish this, we are going to need to generate a third table, which we might callartwork_artists, which will have only 2 columns: one containing the artworkobject_ids, and the other containing the artistconstituent_ids. If multiple artists worked on the same artwork, then that same artworkobject_idshould simply appear on multiple rows with a differentconstituent_idon each. This is the same idea as to how we combined students and classes earlier in the semester.The difficulty lies in how to accomplish this easily. We have all the needed information, but it is buried inside
artworks.constituent_id. If you have worked with strings in other programming languages, you might want to do something like splitting apart the currentconstituent_idfield at the commas, to get each individual constituent id. We can actually do the same in Postgres! The functionstring_to_table(some_text, text_to_split_on)will return a table of the individual pieces ofsome_textsplit at each occurrence oftext_to_split_onand spread across multiple rows. If you are selecting multiple columns, the other column’s information gets duplicated over each row. For example, runningSELECT STRING_TO_TABLE('Cake,Pie,Cookies', ',') AS desserts, 5 AS amount;would give the table:
desserts amount Cake 5 Pie 5 Cookies 5 Create the
artwork_artiststable such that it has one row for each artwork id and contributing artist.The
constituent_ids inartwork_artistsare just increasing integers, so it would be nicer to work with them as actual integers instead of the strings they came in as when you split the original string. (And importantly, this removes any extra spaces that might be still lurking around as well.) Change that column data type to be an integer (you may need to use something likeUSING constituent_id::INTat the end of yourALTER TABLEstatement to get it to go through).You now have three related tables, but no constraints are in place. Alter the tables to add primary keys
object_idandconstituent_idto theartworksandartiststables, respectively. Forartwork_artists, create a compound primary key that includes both columns, which we can do since we just confirmed that those combinations are all unique. Add in the necessary foreign keys as well to link theartwork_artiststable columns back to the necessary reference table columns.Now that we have better related the tables, there is really no need for the
artworkstable to still have all the columns containing information that is already in theartiststable. Remove those 7 columns fromartworks.
Problem 2: Content Clean-up!
While you have gotten the basic table structure in a better place, there is some content in some of the columns that is not formatted as nicely as it could be. The below steps will describe what and how to fix some of these issues. For each clean-up step, include the command(s) you use to perform that operation in the template.
If you check for missing information in the
artiststable, you’ll realize that there is plenty, but much of it is information that we probably do not have. One thing you might notice though is that there are artists with no recordednationality, but who do have something about a nationality mentioned in their bio. These probably could be fixed, but there are a lot of them and no great automatic way to do so, so I’m not going to ask that of you! One other aspect of this table though is that missing or “not yet occurring” birth and death dates are recorded with 0’s, which seems potentially problematic if you wanted to do calculations with those columns. Change all instances of 0 in the birth year (begin_date) or death year (end_date) toNULLinstead.Moving over to the
artworkstable, take a look at thedatecolumn, which is when the artwork was supposedly created. This column is an absolute mess of formatting. Some are years, some are ranges of years, some are specific days, and everything in between. For the most part though, each record contains at least one 4 digit date somewhere amidst the text. While grabbing just this 4 digit date may not be perfectly precise, it would surely be more useful than the current state of things. How can we accomplish that date extraction though? We can match off of patterns, but we haven’t talked yet about how to extract a pattern. We’ll be talking about working with text and regular expressions after Spring Break, but for now, you can extract the first 4 digit date from any string using:substring(date_text FROM '\d{4}')::INTThis extracts the part of the string that matches the regular expression
\d{4}, which stands for “four numeric digits adjacent to one another”. Add a new column nameddate_inttoartworksand populate its rows with the first 4 digit date that appears in the row’s originaldatecolumn. If you want a quick check that this worked, if I sum all those years in the entire column, I get a value of 291029466.You may also notice that there are a lot of missing entries in the size columns at the end of
artworks. A closer inspection of these columns along with thedimensionscolumn seems to indicate that the artwork largely falls into 4 categories:- 2 dimensional and squarish, in which case the
width_cmandheight_cmcolumns are used - 2 dimensional and round, in which case the
diameter_cmcolumn is used - 3 dimensional, in which case the
length_cm,height_cmanddepth_cmcolumns are used - No physical size, in which case none of them are used
Interestingly, the
circumference_cmcolumn is unpopulated for many pieces of 2D art that have their sizes indicated. Rename this column to beperimeter_cm, and then, assuming them to be perfect rectangles or circles, populate the perimeter values for all pieces of art that do not already have a circumference/perimeter defined but which have the other necessary columns defined. As a reminder, to compute a perimeter:- For a rectangular object, it is 2 times the width + 2 times the height
- For a circular object, it is the diameter times \pi, which you can access with
pi()in Postgres
- 2 dimensional and squarish, in which case the
Problem 3: Query Time
The tables are looking far better now than when you started! Time to proceed to use the data set to answer the following questions. As per usual, show all queries that you used to arrive at your solution.
What is the title of the piece of art with the most collaborators/contributors at the Museum of Modern Art?
What is the median age of a piece of art when the museum acquires it?
What artist has pieces in the MoMA collection across the greatest number of different classifications?
For pieces of art that have a defined perimeter, what is the title(s) of the artwork with the 30th largest (no skipping ranks) perimeter in each department? Are there any ties? Are all departments represented?
Footnotes
Robot, O. D. (2024). MoMA Collection - Automatic Update [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13933308↩︎