Entity Relationship Diagrams
An entity relationship diagram (ERD) is a common way to view data in a database. Below is the ERD for the database we will use from Parch & Posey. These diagrams help you visualize the data you are analyzing including:
- The names of the tables.
- The columns in each table.
- The way the tables work together.
You can think of each of the boxes below as a spreadsheet.
What to Notice
In the Parch & Posey database there are five tables (essentially 5 spreadsheets):
- web_events
- accounts
- orders
- sales_reps
- region
You can think of each of these tables as an individual spreadsheet. Then the columns in each spreadsheet are listed below the table name. For example, the region table has two columns: id
and name
. Alternatively the web_events table has four columns.
The "crow's foot" that connects the tables together shows us how the columns in one table relate to the columns in another table. In this first lesson, you will be learning the basics of how to work with SQL to interact with a single table. In the next lesson, you will learn more about why these connections are so important for working with SQL and relational databases.
Supporting Materials
There are some major advantages to using traditional relational databases, which we interact with using SQL. The five most apparent are:
- SQL is easy to understand.
- Traditional databases allow us to access data directly.
- Traditional databases allow us to audit and replicate our data.
- SQL is a great tool for analyzing multiple tables at once.
- SQL allows you to analyze more complex questions than dashboard tools like Google Analytics.
You will experience these advantages first hand, as we learn to write SQL to interact with data.
SQL vs. NoSQL
You may have heard of NoSQL, which stands for not only SQL. Databases using NoSQL allow for you to write code that interacts with the data a bit differently than what we will do in this course. These NoSQL environments tend to be particularly popular for web based data, but less popular for data that lives in spreadsheets the way we have been analyzing data up to this point. One of the most popular NoSQL languages is called MongoDB. Udacity has a full course on MongoDB that you can take for free here, but these will not be a focus of this program.
Why Businesses Like Databases
Data integrity is ensured - only the data you want entered is entered, and only certain users are able to enter data into the database.
Data can be accessed quickly - SQL allows you to obtain results very quickly from the data stored in a database. Code can be optimized to quickly pull results.
Data is easily shared - multiple individuals can access data stored in a database, and the data is the same for all users allowing for consistent results for anyone with access to your database.
A few key points about data stored in SQL databases:
Data in databases is stored in tables that can be thought of just like Excel spreadsheets.
For the most part, you can think of a database as a bunch of Excel spreadsheets. Each spreadsheet has rows and columns. Where each row holds data on a transaction, a person, a company, etc., while each column holds data pertaining to a particular aspect of one of the rows you care about like a name, location, a unique id, etc.All the data in the same column must match in terms of data type.
An entire column is considered quantitative, discrete, or as some sort of string. This means if you have one row with a string in a particular column, the entire column might change to a text data type. This can be very bad if you want to do math with this column!Consistent column types are one of the main reasons working with databases is fast.
Often databases hold a LOT of data. So, knowing that the columns are all of the same type of data means that obtaining data from a database can still be fast.
Types of Databases
SQL Databases
There are many different types of SQL databases designed for different purposes. In this course we will use Postgres within the classroom, which is a popular open-source database with a very complete library of analytical functions.
Some of the most popular databases include:
- MySQL
- Access
- Oracle
- Microsoft SQL Server
- Postgres
You can also write SQL within other programming frameworks like Python, Scala, and HaDoop.
Small Differences
Each of these SQL databases may have subtle differences in syntax and available functions -- for example, MySQL doesn’t have some of the functions for modifying dates as Postgres. Most of what you see with Postgres will be directly applicable to using SQL in other frameworks and database environments. For the differences that do exist, you should check the documentation. Most SQL environments have great documentation online that you can easily access with a quick Google search.
The article here compares three of the most common types of SQL: SQLite, PostgreSQL, and MySQL.
You will use PostgreSQL for the lessons in this course, but you are not required to download it to your machine. We provide SQL workspaces in the classroom throughout the lessons. You may download PostgreSQL here if you'd like though.
The key to SQL is understanding statements. A few statements include:
- CREATE TABLE is a statement that creates a new table in a database.
- DROP TABLE is a statement that removes a table in a database.
- SELECT allows you to read data and display it. This is called a query.
The SELECT statement is the common statement used by analysts, and you will be learning all about them throughout this course!
SQL statements are code that can read and manipulate data. Basic syntax reminders: SQL isn't case sensitive - meaning you can write upper and lower case anywhere in the code. Additionally, you can end SQL statements with a semicolon, but some SQL environments don't require a semicolon at the end.
We will be using the SELECT statement, which is called a query. The DROP and CREATE statements actually change the data in the database. In most companies, analysts are not given permission to use these types of statements. This a good thing - ACTUALLY changing the data in the database is a powerful thing. This is generally reserved for database administrators exclusively.
Here you were introduced to the SQL command that will be used in every query you write: SELECT ... FROM ....
- SELECT indicates which column(s) you want to be given the data for.
- FROM specifies from which table(s) you want to select the columns. Notice the columns need to exist in this table.
If you want to be provided with the data from all columns in the table, you use "*", like so:
- SELECT * FROM orders
Note that using SELECT does not create a new table with these columns in the database, it just provides the data to you as the results, or output, of this command.
You will use this SQL SELECT statement in every query in this course, but you will be learning a few additional statements and operators that can be used along with them to ask more advanced questions of your data.
The SELECT statement is where you put the columns for which you would like to show the data. The FROM statement is where you put the tables from which you would like to pull data.
Formatting Your Queries
Using Upper and Lower Case in SQL
SQL queries can be run successfully whether characters are written in upper- or lower-case. In other words, SQL queries are not case-sensitive.
it is common and best practice to capitalize all SQL commands, like SELECT and FROM, and keep everything else in your query lower case.
Capitalizing command words makes queries easier to read, which will matter more as you write more complex queries. For now, it is just a good habit to start getting into, to make your SQL queries more readable.
One other note: The text data stored in SQL tables can be either upper or lower case, and SQL is case-sensitive in regard to this text data.
Avoid Spaces in Table and Variable Names
It is common to use underscores and avoid spaces in column names. It is a bit annoying to work with spaces in SQL. In Postgres if you have spaces in column or table names, you need to refer to these columns/tables with double quotes around them (Ex: FROM "Table Name" as opposed to FROM table_name). In other environments, you might see this as square brackets instead (Ex: FROM [Table Name]).
Use White Space in Queries
SQL queries ignore spaces, so you can add as many spaces and blank lines between code as you want, and the queries are the same.
Semicolons
Depending on your SQL environment, your query may need a semicolon at the end to execute. Other environments are more flexible in terms of this being a "requirement." It is considered best practice to put a semicolon at the end of each statement, which also allows you to run multiple queries at once if your environment allows this.
The LIMIT statement is useful when you want to see just the first few rows of a table. This can be much faster for loading than if we load the entire dataset.
The LIMIT command is always the very last part of a query.
The ORDER BY statement allows us to sort our results using the data in any column. If you are familiar with Excel or Google Sheets, using ORDER BY is similar to sorting a sheet using a column. A key difference, however, is that using ORDER BY in a SQL query only has temporary effects, for the results of that query, unlike sorting a sheet by column in Excel or Sheets.
In other words, when you use ORDER BY in a SQL query, your output will be sorted that way, but then the next query you run will encounter the unsorted data again. It's important to keep in mind that this is different than using common spreadsheet software, where sorting the spreadsheet by column actually alters the data in that sheet until you undo or change that sorting. This highlights the meaning and function of a SQL "query."
The ORDER BY statement always comes in a query after the SELECT and FROM statements, but before the LIMIT statement. If you are using the LIMIT statement, it will always appear last. As you learn additional commands, the order of these statements will matter more.
Pro Tip
Remember DESC
can be added after the column in your ORDER BY statement to sort in descending order, as the default is to sort in ascending order.
Here, we saw that we can ORDER BY more than one column at a time. When you provide a list of columns in an ORDER BY command, the sorting occurs using the leftmost column in your list first, then the next column from the left, and so on. We still have the ability to flip the way we order using DESC.
Using the WHERE statement, we can display subsets of tables based on conditions that must be met. You can also think of the WHERE command as filtering the data.
The WHERE statement can also be used with non-numeric data. We can use the =
and !=
operators here. You need to be sure to use single quotes (just be careful if you have quotes in the original text) with the text data, not double quotes.
Commonly when we are using WHERE with non-numeric data fields, we use the LIKE, NOT, or IN operators.
Derived column - a new column that is a combination of existing columns. This derived column, and its alias, are generally only temporary, existing just for the duration of your query. The next time you run a query and access this table, the new column will not be there.
Remember PEMDAS from math class to help remember the order of operations
Logical Operators include:
- LIKE This allows you to perform operations similar to using WHERE and
=
, but for cases when you might not know exactly what you are looking for.
- IN This allows you to perform operations similar to using WHERE and
=
, but for more than one condition.
- NOT This is used with IN and LIKE to select all of the rows NOT LIKE or NOT IN a certain condition.
- AND & BETWEEN These allow you to combine operations where all combined conditions must be true.
- OR This allows you to combine operations where at least one of the combined conditions must be true.
%
. The %
tells us that we might want any number of characters leading up to a particular set of characters or following a certain set of characters.=
, but for more than one item of that particular column. We can check one, two or many column values for which we want to pull data, but all within the same query. Allows you to filter data based on several possible values.+
, *
, -
, /
). LIKE, IN, and NOT logic can also be linked together using the AND operator.+
, *
, -
, /
), LIKE, IN, NOT, AND, and BETWEEN logic can all be linked together using the OR operator.Database Normalization
When creating a database, it is really important to think about how data will be stored. This is known as normalization, and it is a huge part of most SQL classes. If you are in charge of setting up a new database, it is important to have a thorough understanding of database normalization.
There are essentially three ideas that are aimed at database normalization:
- Are the tables storing logical groupings of the data?
- Can I make changes in a single location, rather than in many tables for the same information?
- Can I access and manipulate data quickly and efficiently?
This is discussed in detail here.
However, most analysts are working with a database that was already set up with the necessary properties in place. As analysts of data, you don't really need to think too much about data normalization. You just need to be able to pull the data from the database, so you can start making insights.
https://www.itprotoday.com/sql-server/sql-design-why-you-need-database-normalization
The whole purpose of JOIN statements is to allow us to pull data from more than one table at a time. We use ON clause to specify a JOIN condition which is a logical statement to combine the table in FROM
and JOIN
statements. The ON clause specifies the column on which you'd like to merge the two tables together.
If there is not matching information in the JOINed table, then you will have columns with empty cells. These empty cells introduce a new data type called NULL. You will learn about NULLs in detail in the next lesson, but for now you have a quick introduction as you can consider any cell without data as NULL.
OUTER JOINS
The last type of join is a full outer join. This will return the inner join result set, as well as any unmatched rows from either of the two tables being joined.
Again this returns rows that do not match one another from the two tables. The use cases for a full outer join are very rare.
You can see examples of outer joins at the link here and a description of the rare use cases here. We will not spend time on these given the few instances you might need to use them.
Similar to the above, you might see the language FULL OUTER JOIN, which is the same as OUTER JOIN.
https://www.w3resource.com/sql/joins/perform-a-full-outer-join.php
https://stackoverflow.com/questions/2094793/when-is-a-good-situation-to-use-a-full-outer-join
https://stackoverflow.com/questions/7339143/why-no-many-to-many-relationships
SELECT c.countryid, c.countryName, s.stateName
FROM Country c
JOIN State s
ON c.countryid = s.countryid;
a JOIN (INNER JOIN technically), we only get rows that show up in both tables.
QUIZ
I recommend testing your queries with the environment below, and then saving them to a file. Then compare your file to my solutions on the next concept!
If you have two or more columns in your SELECT that have the same name after the table name such as accounts.name and sales_reps.name you will need to alias them. Otherwise it will only show one of the columns. You can alias them like accounts.name AS AcountName, sales_rep.name AS SalesRepName
Questions
- Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for the
Midwest
region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.SELECT r.name region, s.name rep, a.name account FROM sales_reps s JOIN region r ON s.region_id = r.id JOIN accounts a ON a.sales_rep_id = s.id WHERE r.name = 'Midwest' ORDER BY a.name;
- Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for accounts where the sales rep has a first name starting with
S
and in theMidwest
region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.SELECT r.name region, s.name rep, a.name account FROM sales_reps s JOIN region r ON s.region_id = r.id JOIN accounts a ON a.sales_rep_id = s.id WHERE r.name = 'Midwest' AND s.name LIKE 'S%' ORDER BY a.name;
- Provide a table that provides the region for each sales_rep along with their associated accounts. This time only for accounts where the sales rep has a last name starting with
K
and in theMidwest
region. Your final table should include three columns: the region name, the sales rep name, and the account name. Sort the accounts alphabetically (A-Z) according to account name.SELECT r.name region, s.name rep, a.name account FROM sales_reps s JOIN region r ON s.region_id = r.id JOIN accounts a ON a.sales_rep_id = s.id WHERE r.name = 'Midwest' AND s.name LIKE '% K%' ORDER BY a.name;
- Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds
100
. Your final table should have 3 columns: region name, account name, and unit price. In order to avoid a division by zero error, adding .01 to the denominator here is helpful total_amt_usd/(total+0.01).SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price FROM region r JOIN sales_reps s ON s.region_id = r.id JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id WHERE o.standard_qty > 100;
- Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds
100
and the poster order quantity exceeds50
. Your final table should have 3 columns: region name, account name, and unit price. Sort for the smallest unit price first. In order to avoid a division by zero error, adding .01 to the denominator here is helpful (total_amt_usd/(total+0.01).SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price FROM region r JOIN sales_reps s ON s.region_id = r.id JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id WHERE o.standard_qty > 100 AND o.poster_qty > 50 ORDER BY unit_price;
- Provide the name for each region for every order, as well as the account name and the unit price they paid (total_amt_usd/total) for the order. However, you should only provide the results if the standard order quantity exceeds
100
and the poster order quantity exceeds50
. Your final table should have 3 columns: region name, account name, and unit price. Sort for the largest unit price first. In order to avoid a division by zero error, adding .01 to the denominator here is helpful (total_amt_usd/(total+0.01).SELECT r.name region, a.name account, o.total_amt_usd/(o.total + 0.01) unit_price FROM region r JOIN sales_reps s ON s.region_id = r.id JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id WHERE o.standard_qty > 100 AND o.poster_qty > 50 ORDER BY unit_price DESC;
- What are the different channels used by account id
1001
? Your final table should have only 2 columns: account name and the different channels. You can try SELECT DISTINCT to narrow down the results to only the unique values.SELECT DISTINCT a.name, w.channel FROM accounts a JOIN web_events w ON a.id = w.account_id WHERE a.id = '1001';
- Find all the orders that occurred in
2015
. Your final table should have 4 columns: occurred_at, account name, order total, and order total_amt_usd.
SELECT o.occurred_at, a.name, o.total, o.total_amt_usd
FROM accounts a
JOIN orders o
ON o.account_id = a.id
WHERE o.occurred_at BETWEEN '01-01-2015' AND '01-01-2016'
ORDER BY o.occurred_at DESC;
The three JOIN statements you are most likely to use are:
- JOIN - an INNER JOIN that only pulls data that exists in both tables.
- LEFT JOIN - pulls all the data that exists in both tables, as well as all of the rows from the table in the FROM even if they do not exist in the JOIN statement.
- RIGHT JOIN - pulls all the data that exists in both tables, as well as all of the rows from the table in the JOIN even if they do not exist in the FROM statement.
There are a few more advanced JOINs that we did not cover here, and they are used in very specific use cases. UNION and UNION ALL, CROSS JOIN, and the tricky SELF JOIN. These are more advanced than this course will cover, but it is useful to be aware that they exist, as they are useful in special cases.
SQL AGGREGATIONS
NULLs are a datatype that specifies where no data exists in SQL. They are often ignored in our aggregation functions, which you will get a first look at in the next concept using COUNT.
Notice that NULLs are different than a zero - they are cells where data does not exist.
When identifying NULLs in a WHERE clause, we write IS NULL or IS NOT NULL. We don't use =
, because NULL isn't considered a value in SQL. Rather, it is a property of the data.
NULLs - Expert Tip
There are two common ways in which you are likely to encounter NULLs:
NULLs frequently occur when performing a LEFT or RIGHT JOIN. You saw in the last lesson - when some rows in the left table of a left join are not matched with rows in the right table, those rows will contain some NULL values in the result set.
NULLs can also occur from simply missing data in our database.
Aggregation Reminder
An important thing to remember: aggregators only aggregate vertically - the values of a column. If you want to perform a calculation across rows, you would do this with simple arithmetic.
Similar to other software AVG returns the mean of the data - that is the sum of all of the values in the column divided by the number of values in a column. This aggregate function again ignores the NULL values in both the numerator and the denominator.
If you want to count NULLs as zero, you will need to use SUM and COUNT. However, this is probably not a good idea if the NULL values truly just represent unknown values for a cell.
MEDIAN - Expert Tip
One quick note that a median might be a more appropriate measure of center for this data, but finding the median happens to be a pretty difficult thing to get using SQL alone — so difficult that finding a median is occasionally asked as an interview question.
The key takeaways here:
GROUP BY can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.
Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.
The GROUP BY always goes between WHERE and ORDER BY.
ORDER BY works like SORT in spreadsheet software.
GROUP BY - Expert Tip
Before we dive deeper into aggregations using GROUP BY statements, it is worth noting that SQL evaluates the aggregations before the LIMIT clause. If you don’t group by any columns, you’ll get a 1-row result—no problem there. If you group by a column with enough unique values that it exceeds the LIMIT number, the aggregates will be calculated, and then some rows will simply be omitted from the results.
You can GROUP BY multiple columns at once. The same applies to ORDER BY
The order in ORDER BY determines which column is ordered first. Yu can order desc for any column in your order by
HAVING - Expert Tip
HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.
Solutions: HAVING
How many of the sales reps have more than 5 accounts that they manage?
SELECT s.id, s.name, COUNT(*) num_accounts FROM accounts a JOIN sales_reps s ON s.id = a.sales_rep_id GROUP BY s.id, s.name HAVING COUNT(*) > 5 ORDER BY num_accounts;
and technically, we can get this using a SUBQUERY as shown below. This same logic can be used for the other queries, but this will not be shown.
SELECT COUNT(*) num_reps_above5 FROM(SELECT s.id, s.name, COUNT(*) num_accounts FROM accounts a JOIN sales_reps s ON s.id = a.sales_rep_id GROUP BY s.id, s.name HAVING COUNT(*) > 5 ORDER BY num_accounts) AS Table1;
How many accounts have more than 20 orders?
SELECT a.id, a.name, COUNT(*) num_orders FROM accounts a JOIN orders o ON a.id = o.account_id GROUP BY a.id, a.name HAVING COUNT(*) > 20 ORDER BY num_orders;
Which account has the most orders?
SELECT a.id, a.name, COUNT(*) num_orders FROM accounts a JOIN orders o ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY num_orders DESC LIMIT 1;
How many accounts spent more than 30,000 usd total across all orders?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent FROM accounts a JOIN orders o ON a.id = o.account_id GROUP BY a.id, a.name HAVING SUM(o.total_amt_usd) > 30000 ORDER BY total_spent;
How many accounts spent less than 1,000 usd total across all orders?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent FROM accounts a JOIN orders o ON a.id = o.account_id GROUP BY a.id, a.name HAVING SUM(o.total_amt_usd) < 1000 ORDER BY total_spent;
Which account has spent the most with us?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent FROM accounts a JOIN orders o ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY total_spent DESC LIMIT 1;
Which account has spent the least with us?
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent FROM accounts a JOIN orders o ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY total_spent LIMIT 1;
Which accounts used
facebook
as a channel to contact customers more than 6 times?SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel FROM accounts a JOIN web_events w ON a.id = w.account_id GROUP BY a.id, a.name, w.channel HAVING COUNT(*) > 6 AND w.channel = 'facebook' ORDER BY use_of_channel;
Which account used
facebook
most as a channel?SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel FROM accounts a JOIN web_events w ON a.id = w.account_id WHERE w.channel = 'facebook' GROUP BY a.id, a.name, w.channel ORDER BY use_of_channel DESC LIMIT 1;
Note: This query above only works if there are no ties for the account that used facebook the most. It is a best practice to use a larger limit number first such as 3 or 5 to see if there are ties before using LIMIT 1.
- Which channel was most frequently used by most accounts?
All of the top 10 areSELECT a.id, a.name, w.channel, COUNT(*) use_of_channel FROM accounts a JOIN web_events w ON a.id = w.account_id GROUP BY a.id, a.name, w.channel ORDER BY use_of_channel DESC LIMIT 10;
direct
.
GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second. Keeping date information at such a granular data is both a blessing and a curse, as it gives really precise information (a blessing), but it makes grouping information together directly difficult (a curse).
Lucky for us, there are a number of built in SQL functions that are aimed at helping us improve our experience in working with dates.
The first function you are introduced to in working with dates is DATE_TRUNC.
DATE_TRUNC allows you to truncate your date to a particular part of your date-time column. Common trunctions are day
, month
, and year
. Here is a great blog post by Mode Analytics on the power of this function.
DATE_PART can be useful for pulling a specific portion of a date, but notice pulling month
or day of the week (dow
)
means that you are no longer keeping the years in order. Rather you
are grouping for certain components regardless of which year they
belonged in.
'DOW' pulls the day of the week with 0 as Sunday and 6 as Saturday.
For additional functions you can use with dates, check out the documentation here, but the DATE_TRUNC and DATE_PART functions definitely give you a great start!
You can reference the columns in your select statement in GROUP BY and ORDER BY clauses with numbers that follow the order they appear in the select statement. For example
SELECT standard_qty, COUNT(*)
FROM orders
GROUP BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)
ORDER BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)
Solutions: Working With DATEs
Find the sales in terms of total dollars for all orders in each
year
, ordered from greatest to least. Do you notice any trends in the yearly sales totals?SELECT DATE_PART('year', occurred_at) ord_year, SUM(total_amt_usd) total_spent FROM orders GROUP BY 1 ORDER BY 2 DESC;
When we look at the yearly totals, you might notice that 2013 and 2017 have much smaller totals than all other years. If we look further at the monthly data, we see that for
2013
and2017
there is only one month of sales for each of these years (12 for 2013 and 1 for 2017). Therefore, neither of these are evenly represented. Sales have been increasing year over year, with 2016 being the largest sales to date. At this rate, we might expect 2017 to have the largest sales.Which month did Parch & Posey have the greatest sales in terms of total dollars? Are all months evenly represented by the dataset?
In order for this to be 'fair', we should remove the sales from 2013 and 2017. For the same reasons as discussed above.SELECT DATE_PART('month', occurred_at) ord_month, SUM(total_amt_usd) total_spent FROM orders WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01' GROUP BY 1 ORDER BY 2 DESC;
The greatest sales amounts occur in December (12).
Which year did Parch & Posey have the greatest sales in terms of total number of orders? Are all years evenly represented by the dataset?
SELECT DATE_PART('year', occurred_at) ord_year, COUNT(*) total_sales FROM orders GROUP BY 1 ORDER BY 2 DESC;
Again, 2016 by far has the most amount of orders, but again 2013 and 2017 are not evenly represented to the other years in the dataset.
- Which month did Parch & Posey have the greatest sales in terms of total number of orders? Are all months evenly represented by the dataset?
December still has the most sales, but interestingly, November has the second most sales (but not the most dollar sales. To make a fair comparison from one month to another 2017 and 2013 data were removed.SELECT DATE_PART('month', occurred_at) ord_month, COUNT(*) total_sales FROM orders WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01' GROUP BY 1 ORDER BY 2 DESC;
- In which month of which year did
Walmart
spend the most on gloss paper in terms of dollars?
May 2016 was when Walmart spent the most on gloss paper.SELECT DATE_TRUNC('month', o.occurred_at) ord_date, SUM(o.gloss_amt_usd) tot_spent FROM orders o JOIN accounts a ON a.id = o.account_id WHERE a.name = 'Walmart' GROUP BY 1 ORDER BY 2 DESC LIMIT 1;
Solutions: CASE
- Write a query to display for each order, the account ID, total amount of the order, and the level of the order - ‘Large’ or ’Small’ - depending on if the order is $3000 or more, or less than $3000.
SELECT account_id, total_amt_usd, CASE WHEN total_amt_usd > 3000 THEN 'Large' ELSE 'Small' END AS order_level FROM orders;
Write a query to display the number of orders in each of three categories, based on the
total
number of items in each order. The three categories are: 'At Least 2000', 'Between 1000 and 2000' and 'Less than 1000'.SELECT CASE WHEN total >= 2000 THEN 'At Least 2000' WHEN total >= 1000 AND total < 2000 THEN 'Between 1000 and 2000' ELSE 'Less than 1000' END AS order_category, COUNT(*) AS order_count FROM orders GROUP BY 1;
We would like to understand 3 different branches of customers based on the amount associated with their purchases. The top branch includes anyone with a Lifetime Value (total sales of all orders)
greater than 200,000
usd. The second branch is between200,000 and 100,000
usd. The lowest branch is anyoneunder 100,000
usd. Provide a table that includes the level associated with each account. You should provide the account name, the total sales of all orders for the customer, and the level. Order with the top spending customers listed first.SELECT a.name, SUM(total_amt_usd) total_spent, CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top' WHEN SUM(total_amt_usd) > 100000 THEN 'middle' ELSE 'low' END AS customer_level FROM orders o JOIN accounts a ON o.account_id = a.id GROUP BY a.name ORDER BY 2 DESC;
We would now like to perform a similar calculation to the first, but we want to obtain the total amount spent by customers only in
2016
and2017
. Keep the same levels as in the previous question. Order with the top spending customers listed first.SELECT a.name, SUM(total_amt_usd) total_spent, CASE WHEN SUM(total_amt_usd) > 200000 THEN 'top' WHEN SUM(total_amt_usd) > 100000 THEN 'middle' ELSE 'low' END AS customer_level FROM orders o JOIN accounts a ON o.account_id = a.id WHERE occurred_at > '2015-12-31' GROUP BY 1 ORDER BY 2 DESC;
- We would like to identify top performing sales reps, which are sales reps associated with more than 200 orders. Create a table with the sales rep name, the total number of orders, and a column with
top
ornot
depending on if they have more than 200 orders. Place the top sales people first in your final table.
It is worth mentioning that this assumes each name is unique - which has been done a few times. We otherwise would want to break by the name and the id of the table.SELECT s.name, COUNT(*) num_ords, CASE WHEN COUNT(*) > 200 THEN 'top' ELSE 'not' END AS sales_rep_level FROM orders o JOIN accounts a ON o.account_id = a.id JOIN sales_reps s ON s.id = a.sales_rep_id GROUP BY s.name ORDER BY 2 DESC;
- The previous didn't account for the middle, nor the dollar amount associated with the sales. Management decides they want to see these characteristics represented as well. We would like to identify top performing sales reps, which are sales reps associated with more than
200
orders or more than750000
in total sales. Themiddle
group has any rep with more than 150 orders or500000
in sales. Create a table with the sales rep name, the total number of orders, total sales across all orders, and a column withtop
,middle
, orlow
depending on this criteria. Place the top sales people based on dollar amount of sales first in your final table.
You might see a few upset sales people by this criteria!SELECT s.name, COUNT(*), SUM(o.total_amt_usd) total_spent, CASE WHEN COUNT(*) > 200 OR SUM(o.total_amt_usd) > 750000 THEN 'top' WHEN COUNT(*) > 150 OR SUM(o.total_amt_usd) > 500000 THEN 'middle' ELSE 'low' END AS sales_rep_level FROM orders o JOIN accounts a ON o.account_id = a.id JOIN sales_reps s ON s.id = a.sales_rep_id GROUP BY s.name ORDER BY 3 DESC;
Solutions to Your First Subquery
- First, we needed to group by the day and channel. Then ordering by the number of events (the third column) gave us a quick way to answer the first question.
SELECT DATE_TRUNC('day',occurred_at) AS day, channel, COUNT(*) as events FROM web_events GROUP BY 1,2 ORDER BY 3 DESC;
- Here you can see that to get the entire table in question 1 back, we included an
*
in our SELECT statement. You will need to be sure to alias your table.SELECT * FROM (SELECT DATE_TRUNC('day',occurred_at) AS day, channel, COUNT(*) as events FROM web_events GROUP BY 1,2 ORDER BY 3 DESC) sub;
- Finally, here we are able to get a table that shows the average number of events a day for each channel.
SELECT channel, AVG(events) AS average_events FROM (SELECT DATE_TRUNC('day',occurred_at) AS day, channel, COUNT(*) as events FROM web_events GROUP BY 1,2) sub GROUP BY channel ORDER BY 2 DESC;
Subquery Formatting
When writing Subqueries, it is easy for your query to look incredibly complex. In order to assist your reader, which is often just yourself at a future date, formatting SQL will help with understanding your code.
The important thing to remember when using subqueries is to provide some way for the reader to easily determine which parts of the query will be executed together. Most people do this by indenting the subquery in some way - you saw this with the solution blocks in the previous concept.
The examples in this class are indented quite far—all the way to the parentheses. This isn’t practical if you nest many subqueries, but in general, be thinking about how to write your queries in a readable way.
Well Formatted Query
Now for a well formatted example, you can see the table we are pulling from much easier than in the previous queries.
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC) sub;
Additionally, if we have a GROUP BY, ORDER BY, WHERE, HAVING, or any other statement following our subquery, we would then indent it at the same level as our outer query.
The query below is similar to the above, but it is applying additional statements to the outer query, so you can see there are GROUP BY and ORDER BY statements used on the output are not tabbed. The inner query GROUP BY and ORDER BY statements are indented to match the inner table.
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC) sub
GROUP BY day, channel, events
ORDER BY 2 DESC;
These final two queries are so much easier to read!
Subqueries Part II
if you are only returning a single value, you might use that value in a logical statement like WHERE, HAVING, or even SELECT - the value could be nested within a CASE statement.
Expert Tip
Note that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.
If we returned an entire column IN would need to be used to perform a logical argument. If we are returning an entire table, then we must use an ALIAS for the table, and perform additional logic on the entire table.
The average amount of standard paper, gloss paper and poster paper sold on the first month that any order was placed in the orders table (in terms of quantity).
SELECT AVG(standard_qty) avg_std, AVG(gloss_qty) avg_gls, AVG(poster_qty) avg_pst
FROM orders
WHERE DATE_TRUNC('month', occurred_at) =
(SELECT DATE_TRUNC('month', MIN(occurred_at)) FROM orders);
The total amount spent on all orders on the first month that any order was placed in the orders table (in terms of usd).
SELECT SUM(total_amt_usd)
FROM orders
WHERE DATE_TRUNC('month', occurred_at) =
(SELECT DATE_TRUNC('month', MIN(occurred_at)) FROM orders);
Above is the ERD for the database again - it might come in handy as you tackle the quizzes below. You should write your solution as a subquery or subqueries, not by finding one solution and copying the output. The importance of this is that it allows your query to be dynamic in answering the question - even if the data changes, you still arrive at the right answer.
Provide the name of the sales_rep in each region with the largest amount of total_amt_usd sales.
First, I wanted to find the total_amt_usd totals associated with each sales rep, and I also wanted the region in which they were located. The query below provided this information.
Next, I pulled the max for each region, and then we can use this to pull those rows in our final result.SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY 1,2 ORDER BY 3 DESC;
Essentially, this is a JOIN of these two tables, where the region and amount match.SELECT region_name, MAX(total_amt) total_amt FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY 1, 2) t1 GROUP BY 1;
SELECT t3.rep_name, t3.region_name, t3.total_amt FROM(SELECT region_name, MAX(total_amt) total_amt FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY 1, 2) t1 GROUP BY 1) t2 JOIN (SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY 1,2 ORDER BY 3 DESC) t3 ON t3.region_name = t2.region_name AND t3.total_amt = t2.total_amt;
For the region with the largest (sum) of sales total_amt_usd, how many total (count) orders were placed?
The first query I wrote was to pull the total_amt_usd for each region.SELECT r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY r.name;
Then we just want the region with the max amount from this table. There are two ways I considered getting this amount. One was to pull the max using a subquery. Another way is to order descending and just pull the top value.
SELECT MAX(total_amt) FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY r.name) sub;
Finally, we want to pull the total orders for the region with this amount:
SELECT r.name, COUNT(o.total) total_orders FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY r.name HAVING SUM(o.total_amt_usd) = ( SELECT MAX(total_amt) FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt FROM sales_reps s JOIN accounts a ON a.sales_rep_id = s.id JOIN orders o ON o.account_id = a.id JOIN region r ON r.id = s.region_id GROUP BY r.name) sub);
This provides the Northeast with 2357 orders.
How many accounts had more total purchases than the account name which has bought the most standard_qty paper throughout their lifetime as a customer?
First, we want to find the account that had the most standard_qty paper. The query here pulls that account, as well as the total amount:
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total FROM accounts a JOIN orders o ON o.account_id = a.id GROUP BY 1 ORDER BY 2 DESC LIMIT 1;
Now, I want to use this to pull all the accounts with more total sales:
SELECT a.name FROM orders o JOIN accounts a ON a.id = o.account_id GROUP BY 1 HAVING SUM(o.total) > (SELECT total FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std, SUM(o.total) total FROM accounts a JOIN orders o ON o.account_id = a.id GROUP BY 1 ORDER BY 2 DESC LIMIT 1) sub);
This is now a list of all the accounts with more total orders. We can get the count with just another simple subquery.
SELECT COUNT(*) FROM (SELECT a.name FROM orders o JOIN accounts a ON a.id = o.account_id GROUP BY 1 HAVING SUM(o.total) > (SELECT total FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std, SUM(o.total) total FROM accounts a JOIN orders o ON o.account_id = a.id GROUP BY 1 ORDER BY 2 DESC LIMIT 1) inner_tab) ) counter_tab;
For the customer that spent the most (in total over their lifetime as a customer) total_amt_usd, how many web_events did they have for each channel?
Here, we first want to pull the customer with the most spent in lifetime value.
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent FROM orders o JOIN accounts a ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY 3 DESC LIMIT 1;
Now, we want to look at the number of events on each channel this company had, which we can match with just the id.
SELECT a.name, w.channel, COUNT(*) FROM accounts a JOIN web_events w ON a.id = w.account_id AND a.id = (SELECT id FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent FROM orders o JOIN accounts a ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY 3 DESC LIMIT 1) inner_table) GROUP BY 1, 2 ORDER BY 3 DESC;
I added an ORDER BY for no real reason, and the account name to assure I was only pulling from one account.
What is the lifetime average amount spent in terms of total_amt_usd for the top 10 total spending accounts?
First, we just want to find the top 10 accounts in terms of highest total_amt_usd.SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent FROM orders o JOIN accounts a ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY 3 DESC LIMIT 10;
Now, we just want the average of these 10 amounts.
SELECT AVG(tot_spent) FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent FROM orders o JOIN accounts a ON a.id = o.account_id GROUP BY a.id, a.name ORDER BY 3 DESC LIMIT 10) temp;
What is the lifetime average amount spent in terms of total_amt_usd, including only the companies that spent more per order, on average, than the average of all orders.
First, we want to pull the average of all accounts in terms of total_amt_usd:
SELECT AVG(o.total_amt_usd) avg_all
FROM orders o
Then, we want to only pull the accounts with more than this average amount.
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o);
Finally, we just want the average of these values.
SELECT AVG(avg_amt)
FROM (SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o)) temp_table;
https://www.r-users.com/jobs/programming-languages-r-python-and-others-specialist/
https://www.toptal.com/talent/apply
https://webapp.theaudiobee.com/users/register?rid=594351
Comments
Post a Comment
Your input is valued. Please type something....