Grouping JSON Model Indexes

Jed Rembold

February 19, 2025

Announcements

Homework 5 due tonight!
Homework 6 due next Wednesday
No class next week!
- I’ll be in Pittsburgh for SIGCSE
- Instead, next Wednesday you’ll be sent the description and requirements of the midterm tutorial video assignment
- You’ll have until the following Wednesday (March 5th) to submit a link to that upload

Building a Model

Data Modeling

Data Modeling is the act of best deciding how to represent and store data such that it relates to the real world
As much as possible, it is generally desirable to:
- ensure the model can easily help answer the types of questions you will want to ask
- operate at as low a level of granularity as possible
- remove redundancy
- ensure referential integrity

Normalization

First introduced by Edgar Codd in the early 1970s
Codd outlined the following goals:
- to free the collection of relations from undesirable insertion, update, and deletion dependencies
- To reduce the need for restructuring the collection of relations as new types of data are introduced
- To make the relational model more informative to users
- To make the collection of relations neutral to what queries are being run on them
To assist in these efforts, Codd introduced the idea of normal forms

What is normal?

Each normal form builds on those before
The primary normal forms are:

Denormalized: No normalization. Nested and redundant data is allowed
First normal form (1NF): Each column is unique and has a single value. The table has a unique primary key.
Second normal form (2NF): Partial dependencies are removed (only necessary if compound primary key)
Third normal form (3NF): Each table contains fields only relevant to its primary key, and has no transitive dependencies

Denormalized

APIs will commonly give denormalized data, since they tend to return JSON type entries
Consider the entry to the right about a particular order

{ "OrderID": 100,
  "OrderItems": [
    {
      "sku": 1,
      "price": 10,
      "quantity": 1,
      "name": "Thingmajig"
    },
    {
      "sku": 2,
      "price": 25,
      "quantity": 2,
      "name": "Whatchamacalit"
    }],
  "CustomerID": 5,
  "CustomerName": "Jed Rembold",
  "OrderDate": "2022-11-09" }

Initial Attempt

OrderID	OrderItems	CustomerID	CustomerName	OrderDate
100	`[ { "sku": 1, "price": 10, "quantity": 1, "name": "Thingmajig" }, { "sku": 2, "price": 25, "quantity": 2, "name": "Whatchamacalit" } ]`	5	Jed Rembold	2022-11-09

1st Normal Form

OrderID	Sku	Price	Quantity	ProductName	CustomerID	CustomerName	OrderDate
100	1	50	1	Thingmajig	5	Jed Rembold	2022-11-09
100	2	25	2	Whatchamacalit	5	Jed Rembold	2022-11-09

1st Normal Form (PK)

OrderID	ItemNum	Sku	Price	Quantity	ProductName	CustomerID	CustomerName	OrderDate
100	1	1	50	1	Thingmajig	5	Jed Rembold	2022-11-09
100	2	2	25	2	Whatchamacalit	5	Jed Rembold	2022-11-09

Evaluating 2NF

To be in second normal form, there can be no partial dependencies, where columns depend on only one of the compound keys
Here though the last 3 columns all depend only on OrderID
- Solution: Split to new tables!

2nd Normal Form

OrderID	CustomerID	CustomerName	OrderDate
100	5	Jed Rembold	2022-11-09

OrderID	ItemNum	Sku	Price	Quantity	ProductName
100	1	1	50	1	Thingmajig
100	2	2	25	2	Whatchamacalit

Evaluating 3NF

The third normal form prohibits transitive dependencies, where a column depends on another (that depends on the primary key), instead of depending directly on the primary key
Here we have 2! One in each table:
- ProductName depends on Sku
- CustomerName depends on CustomerID
Solution? More tables!

3rd Normal Form

Sku	ProductName
1	Thingmajig
2	Whatchamacalit

CustomerID	CustomerName
5	Jed Rembold

OrderID	ItemNum	Sku	Price	Quantity
100	1	1	50	1
100	2	2	25	2

OrderID	CustomerID	OrderDate
100	5	2022-11-09

Visually

Beyond

Other normal forms exist (up to 6NF in the Boyce-Codd system)
Most people only care about the first 3 to get data in a “normalized” state
Unless you have specific performance reasons for wanting otherwise, you really should strive for normalized tables in your relational database, as they make maintaining, adding, updating, and adjusting your database and tables much easier

Activity

Think back to the Spotify modeling that we were playing around with last week

Suppose you have the following information you need to store for each track:

Info	Description
Track Name	The name of the track
Artists	Comma separated list of contributing artists
Album	The album this specific track appears on
Duration	The duration of the track in seconds
Genre	The general genre of the track
Track Num	The placement of this track on the album

Your task is to model this information into purely tables of 3NF

Activity Example Data

Sometimes it helps to have some example data to look at:

TName	Artists	Album	Duration	Genre	TNum
Don’t Stop Believin’	Journey	Greatest Hits	249	Rock	2
Don’t Stop Believin’	Journey	Escape	251	Rock	1
Fifteen	Taylor Swift	Fearless (Taylor’s Version)	294	Country Pop	2
Under Pressure	Queen,David Bowie	Greatest Hits	238	Rock	7

Indexing

Consulting an Index

Like in a book, an index is a precomputed guide to help find things faster
We can construct similar precomputed guides for certain columns in SQL to also help find things faster!
There are several different types of indexes
- By default, the assigned index type is a B-Tree index in Postgres
  - B-Tree stands for balanced tree and work best for orderable type data
- Index types like Generalized Inverted Index (GIN) and Generalized Search Tree (GiST) will be discussed later as they come up
- Different columns in the same table can be indexed with different methods

Climbing a B-Tree

Showing the structure and looking up a key in a B-Tree

Creating Indexes

Postgres will automatically index any column that is a primary key or which has the UNIQUE constraint
You can choose to set up indexes on other columns as well, but do so outside of the table creation
```
CREATE INDEX |||index name||| ON |||table name||| (|||column|||);
```
If you decide you want to remove an index, you do so using the index name:
```
DROP INDEX |||index name|||;
```

Benchmarking

How can we objectively test this?
- Postgres has the EXPLAIN keyword to give you information about what the database is doing in the background
- Using EXPLAIN ANALYZE will also give you timing information about how long it took a query to run

EXPLAIN always comes at the start of your query

EXPLAIN ANALYZE
SELECT * FROM |||table name|||
WHERE |||condition|||;

Other SQL variants have their own versions, but most have some method of getting information back about execution time or what is being done under the hood

Benchmarking Caveats

EXPLAIN ANALYZE reports to you the time it took the server to process the query, not necessarily the time it took your client to finish getting and rendering the response!
- Especially for queries that return a huge number of rows, you might see a significant difference
Indexes are optimized for certain types of operations. Just because you index a column doesn’t mean that Postgres will use that index if you are doing an incompatible operation!
- Check more than just the timings from EXPLAIN ANALYZE
  - See comments about Bitmaps and Heaps? Then the index was used.
  - See comments about Parallel or Seq scans? Then the index was not used.

Costs

Indexes always have a cost associated with them, both in initial setup and in every time new data is added to the column
- Also, they are stored information, so they can also inflate your database size
Consider indexing only those columns that receive the heaviest of use in filters or in joining
You never need to worry about indexing primary keys or unique columns, as those are done automatically
- Indexing of foreign keys though is not automatic, and might be a good idea for large datasets
When in doubt, benchmark your queries before and after adding an index to see if you are really gaining much from it!

Benchmarking Activity

Most likely, you already (still) have the NYC Taxi Rides data on your computer
Benchmark that data set under three different queries, for each getting a time both before creating an index and then after creating an index over the filtered column
- Select all the columns for the row where the ride id = 56789
- Select all the columns for the rows where the total amount charged was greater than $40
- Select all the columns for the rows where the total amount charged was greater than $10
Enter your values into the shared spreadsheet here (Just choose an unused trial number)

Break!

Break Time!

Grouping

Birds of a Feather

It is frequently the case that values in a particular table column all belong to a smaller subset of categories or options
- Think months of the year or political parties
With current methods, if you want to compare some sort of aggregate between those categories or options, you need to do it in multiple queries:
```
SELECT AVG(age)
FROM voters
WHERE party = 'D'
```
```
SELECT AVG(age)
FROM voters
WHERE party = 'R'
```
This rapidly becomes intractable if you want to compare across many categories

The Fix

SQL has one last fundamental trick up its sleeve with the GROUP BY command
GROUP BY gathers together all rows with matching values from a particular column
- By itself, this is basically the same as running DISTINCT on a column
- Note that once you make a grouping, you can only select entire columns that are a part of the grouping
```
SELECT |||grouped column|||
FROM |||table name|||
GROUP BY |||grouped column|||;
```
Can visualize as if many smaller tables were created, one for each grouping, which are then joined back together with each table contributing one row

Aggregating Groups

Groups by themselves are not that useful, since we already had DISTINCT

The prime use-case of GROUP BY is to be able to run aggregates across all potential groups simultaneously for comparison

SELECT 
  |||grouped column|||, 
  min(|||some other column|||), 
  avg(|||some other column|||)
FROM |||table name|||
GROUP BY |||grouped column|||;

Causes any aggregate function to just aggregate over the smaller, group tables

Group Example

Looking back at the cereal table, how can we answer:

Which manufacturer sells the most sugar-laden cereal?
What shelf has the most cereals placed on it? What about the greatest calorie average?

MULTIPLE GROUPS

Like with DISTINCT or UNIQUE, you can group by multiple columns
Essentially splits the sub-tables into even smaller tables, one for each combination

SELECT 
  |||grouped col1|||,
  |||grouped col2|||,
  min(|||other col|||), 
  avg(|||other col|||)
FROM |||table|||
GROUP BY |||grouped col1|||, |||grouped col2|||

HAVE or HAVE NOT

You can not filter based on aggregates using WHERE
- WHERE actions happen before any aggregates can be computed
If you want to filter your grouped results, you need to use HAVING
- HAVING filters take place after groups and aggregates have been computed

SELECT 
  |||grouped col|||
  min(|||other col|||), 
  avg(|||other col|||)
FROM |||table|||
GROUP BY |||grouped col|||
HAVING min(|||other col|||) > 50;

Server Ordering

Who’s Son?

JSON in Postgres

One data type that we haven’t discussed yet in Postgres is json
Postgres actually support two json types:
- json: stores data as text
  - Slower to process, but maintains whitespace and key orderings
- jsonb: stores data in a binary format
  - Slightly slow to insert, but much faster to process
  - Can have indexes defined
In general, jsonb is probably preferable unless you really need what the base json type offers

Populating JSON Tables

JSON is frequently given with the outermost type being a list, and then each record inside separated by commas
Postgres will not innately be able to import this type of document
Instead, Postgres COPY works best when the JSON is a newline delimited JSON file (sometimes called jsonlines), where each new record is on a single line, and there is no surrounding []
```
COPY |||table name|||
FROM |||your newline json file|||.json
```
Want to follow along? A jsonlines file is here!

JSON Selections

You can drill down and select specific values from json or jsonb types using two different syntaxes:
- raw_json -> 'key'/index where:
  - key is a string for key-value pairs
  - index is an integer for arrays (0-based!)
- raw_json['key'/index] with the same conventions as above
These can be “chained” to “drill down” to the desired info
```
raw_json -> 'siblings' -> 0 -> 'name'
```
These selections return more json/jsonb objects! Convert as needed
Using ->> will instead return a TEXT type object

JSON Existence

Recall that JSON does not enforce a strict schema, so some entries may have keys or values that others are missing
It can thus be useful to check for the existence of certain keys or values
You can check if an entire chunk is contained within a block of JSON using @>:
```
raw_json @> |||sub-snippet of json|||
```
You can also check if a key or value shows up in the top level of the JSON block using ?:
```
raw_json ? |||top-level key|||
```

JSON Indexes

While you can create a default B-tree index for jsonb columns, they generally wouldn’t be very useful
Instead, we can use a different type of index called a generalized inverted index, or GIN
This type of index works much better when the data is comprised of nested composite items (we’ll see it again when we look at text searching)

Just add USING GIN to your normal command to make an index

CREATE INDEX |||my index||| ON |||mytable||| USING GIN (|||colname|||);

A Question of Timing

Date-Time Reminders

We have about 4 current date or time related data types
Date time types:
- DATE: holds a single individual day
- TIME: holds a single individual time
- TIMESTAMP or variants with TIMESTAMPTZ: holds a combination of date and time, along with a potential time zone
Interval types:
- INTERVAL: holds a duration of time

Extracting Pieces

Having all the information in one value is convenient, but sometimes you only need pieces
- The hour from the time, or the month from the date
These can be particularly important with aggregates!
Two methods to extract pieces of any datetime or interval type:
- SQL standard: extract( part FROM |||datetime_value||| )
- Postgres specific: date_part( |||part|||, |||datetime_value||| )
Both will return a DOUBLE PRECISION value of whatever part was requested

Parts To Extract

You have a wide variety of what you can extract

text	Description
century	What century the date is in. 1st century starts 0001-01-01★
day	What day of the month
decade	The year divided by 10
dow	The day of the week (0-6, starting with Sunday)
doy	The day of the year
epoch	Number of seconds since 1970-01-01
hour	The current hour (0-23)
microseconds	The number of microseconds

text	Description
milliseconds	The number of milliseconds
minute	The minute
month	The month (1-12)
quarter	What quarter of the year (1-4)
second	The number of seconds
timezone	The timezone offset in seconds
timezone_hour	The timezone offset in hours
week	What week of the year. ISO weeks start on Monday
year	The year

★ – If you disagree with this, please write your complaint to: Pope, Cathedral Saint-Peter of Roma, Vatican.

Reversing it

Often times existing data sets have already separated out different aspects of the date or time
- Year, month, and day might be in different columns for example
It can be useful to “stitch” these together into an actual datetime type for further use.
Postgres gives you a handful of functions to do so:
- make_date( |||year|||, |||month|||, |||day|||): Returns a new DATE type value
- make_time( |||hour|||, |||minute|||, |||seconds||| ): Returns a new TIME type value (with no timezone)
- make_timestamptz(|||year|||,|||month|||,|||day|||,|||hour|||,|||minute|||,|||second|||,|||time zone|||): Returns a new TIMESTAMPTZ type value
- make_timestamp and make_interval also exist

Aging well

Subtracting two DATE type values will give just an INT (in days)
Subtracting two TIMESTAMP type values will give an INTERVAL, with the biggest “unit” in days
Using Postgres’s age() function can smooth over both and give units larger than days
- age( datetime1, datetime2 ): Subtracts datetime2 from datetime1
This can still give you awkward interval units at times though, so also consider using justify_interval( interval ), which breaks intervals into divisions that don’t exceed a categories max
- Hours would always be between 0 and 23 for instance, or months between 1 and 12
- Especially if you want to extract a particular part, this is highly recommended

What time is it?

Standard SQL also provides constants for grabbing the current system time and date

function	description
`current_date`	Returns the current date
`current_time`	Returns the current time with timezone
`localtime`	Returns the current time without timezone
`current_timestamp`★	Returns the current date and time with timezone
`localtimestamp`	Returns the current date and time without timezone

★ – Postgres also offers the shorter now() function to do the same thing

Time Zones

Dealing with time zones can be a headache, and it is a very nice feature that Postgres can work with them smoothly
By default, Postgres will display any timestamp with a time zone with the time as you would measure it in your current system timezone
What is your current system timezone?
- SHOW timezone;
Getting general information about timezones:
- Getting abbreviations:
  - SELECT * FROM pg_timezone_abbrevs;
- Getting full names:
  - SELECT * FROM pg_timezone_names;

Teleportation

It can sometimes be useful to switch your “current” time zone
- Maybe it is easier to compare times to someone else living in that time zone
Several methods to make the switch:
- Change your postgressql.conf file, which controls your Postgres server. Only recommended if you have permanently moved elsewhere and the database time zone has not updated appropriately.
- Set future queries in a single session to be from a new timezone:
  
  SET timezone TO time_zone_name_or_abbrv;
  - This will also adjust what values your localtime or localtimestamp report!
- Transform a single query to be reported in a different time zone:
```
SELECT |||datetime column||| AT TIME ZONE |||tz name or abbr|||
FROM |||table name|||;
```

Activity

Using the taxi rides dataset, see if you can:
- Compute the number of rides given each hour of the day
- Compute the average cost of rides over each day of the month
- Compute the median cost of rides over each day of the week
- Compute the average duration of each ride over each hour of the day