Authored by Ben Clothier
The other day Juan needed my help with a query that required a special technique commonly known as the Strawberry query. He needed a way to match one group of records with the first occurrence of a item in another table. From that I decided to write this post to share the technique.
There are times where we want to not just know that a maximum or minimum value in a table but also know which row contains that maximum value. This can be a challenge as SQL does not make it easy to express such a question, but strawberry query is a wonderful answer to that question. To help appreciate the technique, we’ll be using a running example. Suppose we have a company that collects and sells rare & highly prized collector items and we want to know who is the best performing salesperson. Here’s our Sales data for that hypotehical company.
Full Name Product Date Sales Amount Ed William 1957 Nourve de Champenge 7/12/11 $ 3,820 Ed William Prince Edward's Spare Tire 7/21/11 $ 8,120 Ed William 12th Centrury Ming Vase 7/31/11 $21,100 Ed William Bruce Springsteen's wristwatch 8/13/11 $ 790 Ed William Cracked Darth Vader Helmet 8/20/11 $15,210 Ed William Tamara de Lempicka's Amythyst 9/3/11 $ 8,500 Zooey Yoko 1855 S & W Blue Colt Revolver 7/20/11 $ 4,250 Zooey Yoko Grizzly Bear Taxidermy 8/1/11 $18,300 Zooey Yoko Ringo Starr's hair lock 8/20/11 $28,100 Zooey Yoko Potrait of unknown woman 9/7/11 $ 230 Zooey Yoko 1790 American Plow 9/12/11 $ 7,000 Rolf Smith Giuseppe's Nutcracker 7/20/11 $12,730 Rolf Smith Mother Goddess Figurine 7/27/11 $19,100 Rolf Smith Fragment of Dead Sea Scrolls 8/3/11 $11,500 Rolf Smith Shroud of Turin 8/13/11 $ 160 Rolf Smith 1962 Corvette 9/2/11 $42,200
Typically, When we want to know what was the largest sales amount in a given month, we can ask:
SELECT Month(s.SalesDate) AS SalesMonth, Max(s.AgreedPrice) AS SalesAmount FROM tblSales AS s GROUP BY Month(s.SalesDate);
This query would give us something like this:
Sales Month Sales Amount 7 $21,100.00 8 $28,100.00 9 $42,200.00
But let’s ask – which salesman had the biggest sales each month? This wouldn’t work:
SELECT Month(s.SalesDate) AS SalesMonth, sp.FullName AS SalesPerson, Max(s.AgreedPrice) AS SalesAmount FROM tblSales AS s INNER JOIN tblSalesPeople AS sp ON s.SalesPersonID = sp.SalesPersonID GROUP BY Month(s.SalesDate), sp.FullName ;
The query is actually equivalent to «give me all sales staff’s largest sales per month». That’s a different question from «who was the salesperson that made the biggest sales in this month?».
Sales Month Sales Person Sales Amount 7 Ed William $21,100 7 Rolf Smith $19,100 7 Zooey Yoko $ 4,250 8 Ed William $15,210 8 Rolf Smith $11,500 8 Zooey Yoko $28,100 9 Ed William $ 8,500 9 Rolf Smith $42,200 9 Zooey Yoko $ 7,000
Instead of getting only a row per month, we got a row for numbers of month times numbers of salesperson and thus 6 rows too many for this example. Fortunately, we can express this question using the ‘strawberry query.’
Strawberry Query to the rescue
A strawberry query has these essential components:
1) it is usually a self-join within one table
2) the join criteria has more than one column being compared
3) one of those join criteria is an inequality operator.
4) the outer table’s column must evaluate to NULL.
So, to get the SalesPerson, (Full Name field below), from tblSales where the sales amount was largest for the month, we can use this query:
SELECT Month(s.SalesDate) AS SalesMonth, s.SalesDate, p.FullName, s.AgreedPrice AS SalesAmount FROM (tblSalesPeople AS p INNER JOIN tblSales AS s ON p.SalesPersonID = s.SalesPersonID) LEFT JOIN tblSales AS c ON (s.AgreedPrice < c.AgreedPrice) AND (Month(s.SalesDate) = Month(c.SalesDate)) WHERE c.AgreedPrice IS NULL;
The query will produce this result:
Sales Month Sales Date Full Name Sales Amount 7 7/31/2011 Ed William $21,100 8 8/20/2011 Zooey Yoko $28,100 9 9/2/2011 Rolf Smith $42,200
Because the query did not use any aggregate function neither did it have a GROUP BY clause, we could then join the SalesPerson table, include the full name for the display as well as the actual Sales Date from the Sales table which would have been invalid in an aggregate query.
How does the strawberry query work?
The expression «Month(s.SalesDate) = Month(c.SalesDate)» is functionally equivalent to doing a «GROUP BY Month(SalesDate)», and that is how we’d define our «grouping» when we use a strawberry query. The second expression, «s.Amount < c.Amount» in conjunction with the «WHERE c.Amount IS NULL;» is equivalent to doing a «MAX(Amount)». If you’re wondering how the «c.Amount IS NULL» expression could help us extract the minimum or maximum amount, it may help to look at the query if we removed the WHERE clause.
First, we’ll reduce the query to essential skeleton of the strawberry, removing the tblSalesPerson table and other columns then include the outer columns.
SELECT Month(s.SalesDate) AS SalesMonth, s.AgreedPrice AS SalesAmount, c.AgreedPrice AS OtherSalesAmount FROM tblSales AS s LEFT JOIN tblSales AS c ON (s.AgreedPrice < c.AgreedPrice) AND (Month(s.SalesDate) = Month(c.SalesDate)) ;
Here is the listing of the result query would return without the WHERE clause: (lower sales were omitted to reduce the number of rows shown)
SalesMonth SalesAmount OtherSalesAmount ... ... ... 7 $8,120.00 $19,100.00 7 $21,100.00 NULL ... ... ... 7 $12,730.00 $21,100.00 7 $12,730.00 $19,100.00 7 $19,100.00 $21,100.00 8 $15,210.00 $18,300.00 ... ... ... 8 $11,500.00 $28,100.00 8 $11,500.00 $18,300.00 8 $11,500.00 $15,210.00 ... ... ... 8 $18,300.00 $28,100.00 8 $15,210.00 $28,100.00 8 $28,100.00 NULL 9 $7,000.00 $8,500.00 9 $8,500.00 $42,200.00 9 $7,000.00 $42,200.00 9 $42,200.00 NULL ... ... ...
A row from left table will be matched with any other rows in right table where the amount is greater. You can see how all sales that weren’t the largest sales get matches with each other’s sales that had a larger sales amount. The only row that can’t match any other sales is when it is the largest amount, and we would get a NULL from the outer sales table’s column. Hence the WHERE clause filters the set down to only those rows and we get our maximum row.
Closing Notes
Use SQL View for Strawberry Queries
Because the strawberry query uses an inequality in the join criteria, Access Query Designer cannot represent so if you attempt to view the query in design, you’ll get an error. For those who love QBE, you can add all the tables you need, drop in fields, then give the tables an alias so that it’s a simple editing when you switch to SQL and add that inequality criteria.
Optimization
In the example, I used «Month(SalesDate)» which is not sargable and therefore the query isn’t as optimized as it could be. A sargable criteria would be among the lines of «SalesDate BETWEEN #yyyy-mm-dd# AND #yyyy-mm-dd#» but that wasn’t appropriate in this situation because that’d have limited us to only a single month when we want to query for each month for all possible sales. If one wanted to optimize the performance further using «group by months», it may be necessary to consider approaches such as creating denormalized column to store the month number & indexing upon it or break the SalesDate column into SalesYear, SalesMonth, and SalesDay column. If we’re using different backends such as SQL Server, we can then use additional features such as persisted computed column or similar tools.
Having index on the column where you apply the criteria in the strawberry query will ensure that the query runs fast, handy when you start to deal with larger datasets.
If there’s any other query patterns you have noticed or used, we’d love it if you can share with us.
Deja tu comentario