There are 4 problems on this assignment. The first is about choosing columns and data types to fit some target questions. The remaining three problems are about building practice with the same general steps: creating tables with appropriate data types, importing data into those tables, and then doing some queries to answer some questions or export some data. The data to import for all three problems resides in the data folder in the repository, but you may have to copy this elsewhere on your system to properly import it. Also in the data folder are corresponding markdown files that briefly describe each of the columns in the CSV, which may help as you name your columns or pick your data types. For the last 3 problems, you should show the commands you used to create and populate the table, as well as whatever query commands you used to answer the individual questions. Upload your created CSV files as well as your filled out template files all back to GitHub.
Problem 1: Conscientious Eating
Suppose you wanted to track the food you purchase and eat over the course of several months, and decide that storing such information in a database might prove useful. In particular, you are interested in being able to answer questions such as:
- How much am I spending in a given week on certain types of food?
- Should I be shopping more for certain items at certain stores because they are generally cheaper there?
- How much should I be budgeting monthly for food?
Write out a possible CREATE TABLE statement that includes all the columns you think you might need to answer the above questions. Include a comment for each column about why you chose the data type that you did for that column.
Then write out an example INSERT INTO statement that adds at least 6 rows to the table. These rows do not need to contain information from your own purchase habits (though they could), but should be consistent with the table you created above.
Problem 2: Gotta Catch ’em All
The Pokemon dataset, within the data folder (pokemon.csv), contains all the available Pokemon through Gen 6 (I know this isn’t the most up to date, but other datasets had around 40-50 columns, which I didn’t want you to have to import!). Create a table within your database called pokemon and set reasonable data types for all the available columns before copying the data in from the CSV file. Sometimes it can be hard to anticipate the best data type for a column, and so occasionally your import might fail because of an illegal data type. It happens! If that occurs to you, the easiest fix is to just delete (or in SQL parlance, DROP) the table by doing DROP TABLE your_table_name;, and then recreate the table with better data types. After you have successfully gotten the table created and the data imported, use it to answer the following questions. For questions asking about the numbers of results, most clients will report back the number of rows from a query somewhere, so you can just read off those values.
- What are the names of the Pokemon have water and poison as their two types (order doesn’t matter)?
- Save to a CSV file (named
type_combos.csv) a table of all the unique combinations of dual-type Pokemon (Pokemon with two different types). Your file should include just the two types, and should be ordered alphabetically. No header line needs to be present at the top. Don’t forget to upload this file!
Problem 3: A Real Cinephile
The movies dataset (movies.csv) contains over nine thousand movies with information pulled from The Movie Database website. Create a table within your database called movies, set some reasonable data types for the columns, and then import the data from the CSV file. Once you have the data imported, answer the following questions.
- What are the top 5 movies with the highest vote average and minimum of 10000 votes to come out since the start of 2010?
- Write to a CSV (called
spider_not_man.csv) a table of all movies that reference spiders in either the title or the overview, but which are not a Spider-Man movie. Your table should include 3 columns: the release date, the title, and the popularity, and be ordered from oldest release date to newest. Include a header line at the top. Don’t forget to upload this file!
Problem 4: New Year New You
Finally, the New Year’s Resolutions dataset (resolutions.csv) contains just over 5k tweets that people made at the start of 2015 containing the hashtag #NewYearsResolution. Create a table within your database called resolutions, set some reasonable data types for the columns, and then import the data from the CSV file. (Warning: the creator of this dataset took some interesting choices in how certain columns are encoded. Be prepared to need to adjust some column types…) Once you have the data imported, answer the following questions.
- Who was the last person to post a primary topic/theme pertaining to “time management” before the start of the new year?
- Some individuals included more hashtags than just the
#NewYearsResolutiontag. Export to a CSV calledhashtag_happy.csva table of all the individuals from Oregon who included more than one hashtag in their tweet. Some clever pattern matching will be your friend here! Your table should include the username and resolution category and should be ordered alphabetically by the username. Include a header line at the top. Don’t forget to upload the file!