diff --git a/02_activities/assignments/DC_Cohort/Assignment2.md b/02_activities/assignments/DC_Cohort/Assignment2.md index 01f991d02..b4e92f0c9 100644 --- a/02_activities/assignments/DC_Cohort/Assignment2.md +++ b/02_activities/assignments/DC_Cohort/Assignment2.md @@ -55,9 +55,21 @@ The store wants to keep customer addresses. Propose two architectures for the CU **HINT:** search type 1 vs type 2 slowly changing dimensions. -``` -Your answer... -``` +Architecture Number 1: Overwrite (Slowly changing dimension type 1) +Customer_ID will be unique and Customer_Addresss will have one row per customer. +When a customer changes locations, we update the existing row with the new address. +The history of the old address will be lost. + +Columns could include: +Customer_ID (PK/FK), adress_1, address_2, city, province_state, postal_code, updated_at (is a timestamp that stores when the row was last edited) + +Architecture Number 2: Changes will be retained (Slowly changing dimension type 2) +If we use Customer_Address_History we can have multiple rows for each customer, each row corresponding to a unique customer address +When a customer changes locations, we add a new row and the old one is no longer used +History is retained, thus, we can report utilizing the current valid address at the time of the order + +Columns could include: +Customer_ID (PK), Customer_Address_ID (FK), adress_1, address_2, city, province_state, postal_code, country, start_date, end_date (will be NA for current address), is_current (true or false, marks whether address is customer's current one) *** @@ -189,7 +201,9 @@ Read: Boykis, V. (2019, October 16). _Neural nets are just people all the way do Consider, for example, concepts of labour, bias, LLM proliferation, moderating content, intersection of technology and society, ect. +The work by Boykis, V. (2019, October 16) discusses various key ethical issues including invisible labour and exploitation, bias in data and illusion of automation. +At first instance, the word artificial intelligence gives the idea that it is solely machine focused. What is left uncredited in this general thought is that despite AI being marketed as automated, it is contingent on large amounts of human labour that is underpaid. For example, ImageNet, a large dataset was formed through the efforts of numerous low-paid workers on platforms such as Amazon Mechanical Turk. For an unreasonably low income, these workers underwent repetitive cognitive tasks with their contributions left unrecognized, despite their efforts building the foundational ground for AI systems. Thus, this raises the ethical question, is it fair to promote AI as automated when it relies on uncredited and underpaid human efforts? +Albeit inevitable, we are all aware that since humans label the data, their biases may undoubtedly get embedded into AI systems. This includes decisions about which categories are selected. Critically, it is the structure of datasets themselves which are dependent on subjectivity such as cultural and political choices. This issue raises the idea that bias is not unintentional, rather it is found within the datasets themselves, reinforcing discrimination, stereotypes and social inequities. +Next, there are a multitude of decisions that humans make in the formation of machine learning, including data creating, labelling, categorization, and validation. All of this contradicts the idea that AI is fully autonomous, thus, this can promote a public misunderstanding in terms of AI’s capabilities and even reinforce undue trust in systems that can succumb to errors and are largely human influenced. +Overall, this reading underscores that AI is a social system that relies on human efforts through their labour, their choices and their biases. Understanding this is important for everyone to look at AI beyond just as technical system. -``` -Your thoughts... -``` diff --git a/02_activities/assignments/DC_Cohort/ERD1.png b/02_activities/assignments/DC_Cohort/ERD1.png new file mode 100644 index 000000000..8de66038d Binary files /dev/null and b/02_activities/assignments/DC_Cohort/ERD1.png differ diff --git a/02_activities/assignments/DC_Cohort/ERD2.png b/02_activities/assignments/DC_Cohort/ERD2.png new file mode 100644 index 000000000..850b4286f Binary files /dev/null and b/02_activities/assignments/DC_Cohort/ERD2.png differ diff --git a/02_activities/assignments/DC_Cohort/assignment2.sql b/02_activities/assignments/DC_Cohort/assignment2.sql index f7515f625..bdd311274 100644 --- a/02_activities/assignments/DC_Cohort/assignment2.sql +++ b/02_activities/assignments/DC_Cohort/assignment2.sql @@ -23,7 +23,11 @@ Edit the appropriate columns -- you're making two edits -- and the NULL rows wil All the other rows will remain the same. */ --QUERY 1 - +SELECT + COALESCE(product_name, '') || ', ' || + COALESCE(product_size, '') || ' (' || + COALESCE(product_qty_type, 'unit') || ')' +FROM product; --END QUERY @@ -41,7 +45,15 @@ HINT: One of these approaches uses ROW_NUMBER() and one uses DENSE_RANK(). Filter the visits to dates before April 29, 2022. */ --QUERY 2 - +SELECT + cp.*, + DENSE_RANK() OVER ( + PARTITION BY cp.customer_id + ORDER BY cp.market_date + ) AS visit_number +FROM customer_purchases cp +WHERE cp.market_date < '2022-04-29' +ORDER BY cp.customer_id, cp.market_date; --END QUERY @@ -53,6 +65,19 @@ only the customer’s most recent visit. HINT: Do not use the previous visit dates filter. */ --QUERY 3 +WITH ranked AS ( + SELECT + cp.*, + DENSE_RANK() OVER ( + PARTITION BY cp.customer_id + ORDER BY cp.market_date DESC + ) AS recent_visit_number + FROM customer_purchases cp +) +SELECT * +FROM ranked +WHERE recent_visit_number = 1 +ORDER BY customer_id, market_date DESC; @@ -66,6 +91,15 @@ You can make this a running count by including an ORDER BY within the PARTITION Filter the visits to dates before April 29, 2022. */ --QUERY 4 +SELECT + cp.*, + COUNT(*) OVER ( + PARTITION BY cp.customer_id, cp.product_id + ORDER BY cp.market_date DESC + ) AS times_customer_bought_product +FROM customer_purchases cp +WHERE cp.market_date < '2022-04-29' +ORDER BY cp.customer_id, cp.product_id, cp.market_date; @@ -85,7 +119,14 @@ Remove any trailing or leading whitespaces. Don't just use a case statement for Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR will help split the column. */ --QUERY 5 - +SELECT + p.product_name, + CASE + WHEN INSTR(p.product_name, '-') > 0 THEN + TRIM(SUBSTR(p.product_name, INSTR(p.product_name, '-') + 1)) + ELSE NULL + END AS description +FROM product p; --END QUERY @@ -95,8 +136,11 @@ Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR w --QUERY 6 - - +SELECT + p.* +FROM product p +WHERE p.product_size GLOB '*[0-9]*'; + --END QUERY @@ -111,7 +155,35 @@ HINT: There are a possibly a few ways to do this query, but if you're struggling with a UNION binding them. */ --QUERY 7 - +WITH sales_by_day AS ( + SELECT + market_date, + SUM(quantity * cost_to_customer_per_qty) AS total_sales + FROM customer_purchases + GROUP BY market_date +), + ranked AS ( + SELECT + market_date, + total_sales, + DENSE_RANK() OVER (ORDER BY total_sales DESC) AS best_rank, + DENSE_RANK() OVER (ORDER BY total_sales ASC) AS worst_rank + FROM sales_by_day +) + SELECT + market_date, + total_sales, + 'highest' AS day_type + FROM ranked + WHERE best_rank = 1 + UNION + SELECT + market_date, + total_sales, + 'lowest' AS day_type + FROM ranked + WHERE worst_rank = 1 + ORDER BY day_type, market_date; --END QUERY @@ -132,7 +204,33 @@ How many customers are there (y). Before your final group by you should have the product of those two queries (x*y). */ --QUERY 8 - +WITH vendor_products AS ( + SELECT DISTINCT + vi.vendor_id, + v.vendor_name, + vi.product_id, + p.product_name, + vi.original_price AS unit_price + FROM vendor_inventory vi + JOIN vendor v ON v.vendor_id = vi.vendor_id + JOIN product p ON p.product_id = vi.product_id +), + +customer_count AS ( + SELECT COUNT(*) AS num_customers + FROM customer +) + +SELECT + vp.vendor_name, + vp.product_name, + 5 AS units_per_customer, + cc.num_customers, + vp.unit_price, + (5 * cc.num_customers * vp.unit_price) AS total_revenue +FROM vendor_products vp +CROSS JOIN customer_count cc +ORDER BY vp.vendor_name, vp.product_name; --END QUERY @@ -145,6 +243,16 @@ It should use all of the columns from the product table, as well as a new column Name the timestamp column `snapshot_timestamp`. */ --QUERY 9 +CREATE TABLE product_units AS +SELECT + p.product_id, + p.product_name, + p.product_size, + p.product_category_id, + p.product_qty_type, + CURRENT_TIMESTAMP AS snapshot_timestamp +FROM product p +WHERE p.product_qty_type = 'unit'; @@ -155,7 +263,24 @@ Name the timestamp column `snapshot_timestamp`. */ This can be any product you desire (e.g. add another record for Apple Pie). */ --QUERY 10 - +INSERT INTO product_units ( + product_id, + product_name, + product_size, + product_category_id, + product_qty_type, + snapshot_timestamp +) + +SELECT + p.product_id, + p.product_name, + p.product_size, + p.product_category_id, + p.product_qty_type, + CURRENT_TIMESTAMP +FROM product p +WHERE p.product_id = 7; --END QUERY @@ -167,6 +292,13 @@ This can be any product you desire (e.g. add another record for Apple Pie). */ HINT: If you don't specify a WHERE clause, you are going to have a bad time.*/ --QUERY 11 +DELETE FROM product_units +WHERE product_id = 7 + AND snapshot_timestamp < ( + SELECT MAX(snapshot_timestamp) + FROM product_units + WHERE product_id = 7 + ); @@ -191,7 +323,21 @@ Finally, make sure you have a WHERE statement to update the right row, When you have all of these components, you can run the update statement. */ --QUERY 12 - +ALTER TABLE product_units +ADD current_quantity INT; + +UPDATE product_units +SET current_quantity = COALESCE ( + ( + SELECT CAST(vi.quantity AS INT) + FROM vendor_inventory vi + WHERE vi.product_id = product_units.product_id + ORDER BY vi.market_date DESC + LIMIT 1 + ), + 0 +); + --END QUERY