Authored by Ben Clothier

The other day Juan needed my help with a query that required a special technique commonly known as the Strawberry query. He needed a way to match one group of records with the first occurrence of a item in another table. From that I decided to write this post to share the technique.

Access can work great in large companies, call us to discuss

There are times where we want to not just know that a maximum or minimum value in a table but also know which row contains that maximum value. This can be a challenge as SQL does not make it easy to express such a question, but strawberry query is a wonderful answer to that question. To help appreciate the technique, we’ll be using a running example. Suppose we have a company that collects and sells rare & highly prized collector items and we want to know who is the best performing salesperson. Here’s our Sales data for that hypotehical company.

 

Full Name	Product               	        Date	Sales Amount
Ed William	1957 Nourve de Champenge	7/12/11	 $ 3,820
Ed William	Prince Edward's Spare Tire	7/21/11	 $ 8,120
Ed William	12th Centrury Ming Vase  	7/31/11	 $21,100
Ed William	Bruce Springsteen's wristwatch	8/13/11	 $   790
Ed William	Cracked Darth Vader Helmet	8/20/11	 $15,210
Ed William	Tamara de Lempicka's Amythyst	9/3/11	 $ 8,500
Zooey Yoko	1855 S & W Blue Colt Revolver	7/20/11	 $ 4,250
Zooey Yoko	Grizzly Bear Taxidermy  	8/1/11	 $18,300
Zooey Yoko	Ringo Starr's hair lock  	8/20/11	 $28,100
Zooey Yoko	Potrait of unknown woman	9/7/11	 $   230
Zooey Yoko	1790 American Plow      	9/12/11	 $ 7,000
Rolf Smith	Giuseppe's Nutcracker   	7/20/11	 $12,730
Rolf Smith	Mother Goddess Figurine 	7/27/11	 $19,100
Rolf Smith	Fragment of Dead Sea Scrolls	8/3/11	 $11,500
Rolf Smith	Shroud of Turin         	8/13/11	 $   160
Rolf Smith	1962 Corvette           	9/2/11	 $42,200

 

Typically, When we want to know what was the largest sales amount in a given month, we can ask:

SELECT Month(s.SalesDate) AS SalesMonth,
       Max(s.AgreedPrice) AS SalesAmount
FROM tblSales AS s
GROUP BY Month(s.SalesDate);

 

This query would give us something like this:

Sales Month	Sales Amount
    7   	 $21,100.00
    8   	 $28,100.00
    9   	 $42,200.00

But let’s ask – which salesman had the biggest sales each month? This wouldn’t work:

SELECT Month(s.SalesDate) AS SalesMonth,
       sp.FullName AS SalesPerson,
       Max(s.AgreedPrice) AS SalesAmount
FROM tblSales AS s
INNER JOIN tblSalesPeople AS sp
  ON s.SalesPersonID = sp.SalesPersonID
GROUP BY Month(s.SalesDate),
         sp.FullName
;

The query is actually equivalent to «give me all sales staff’s largest sales per month». That’s a different question from «who was the salesperson that made the biggest sales in this month?».

Sales Month	Sales Person	Sales Amount
    7   	Ed William	 $21,100
    7   	Rolf Smith	 $19,100
    7   	Zooey Yoko	 $ 4,250
    8   	Ed William	 $15,210
    8   	Rolf Smith	 $11,500
    8   	Zooey Yoko	 $28,100
    9   	Ed William	 $ 8,500
    9   	Rolf Smith	 $42,200
    9   	Zooey Yoko	 $ 7,000

Instead of getting only a row per month, we got a row for numbers of month times numbers of salesperson and thus 6 rows too many for this example. Fortunately, we can express this question using the ‘strawberry query.’

Strawberry Query to the rescue

A strawberry query has these essential components:

1) it is usually a self-join within one table
2) the join criteria has more than one column being compared
3) one of those join criteria is an inequality operator.
4) the outer table’s column must evaluate to NULL.

So, to get the SalesPerson, (Full Name field below), from tblSales where the sales amount was largest for the month, we can use this query:

SELECT Month(s.SalesDate) AS SalesMonth,
       s.SalesDate,
       p.FullName,
       s.AgreedPrice AS SalesAmount
FROM (tblSalesPeople AS p
INNER JOIN tblSales AS s
  ON p.SalesPersonID = s.SalesPersonID)
LEFT JOIN tblSales AS c
  ON (s.AgreedPrice < c.AgreedPrice)
  AND (Month(s.SalesDate) = Month(c.SalesDate))
WHERE c.AgreedPrice IS NULL;

The query will produce this result:

Sales Month	Sales Date	Full Name	Sales Amount
    7   	 7/31/2011	Ed William	 $21,100
    8    	 8/20/2011	Zooey Yoko	 $28,100
    9   	  9/2/2011	Rolf Smith	 $42,200

 

Because the query did not use any aggregate function neither did it have a GROUP BY clause, we could then join the SalesPerson table, include the full name for the display as well as the actual Sales Date from the Sales table which would have been invalid in an aggregate query.

How does the strawberry query work?

The expression «Month(s.SalesDate) = Month(c.SalesDate)» is functionally equivalent to doing a «GROUP BY Month(SalesDate)», and that is how we’d define our «grouping» when we use a strawberry query. The second expression, «s.Amount < c.Amount» in conjunction with the «WHERE c.Amount IS NULL;» is equivalent to doing a «MAX(Amount)». If you’re wondering how the «c.Amount IS NULL» expression could help us extract the minimum or maximum amount, it may help to look at the query if we removed the WHERE clause.

First, we’ll reduce the query to essential skeleton of the strawberry, removing the tblSalesPerson table and other columns then include the outer columns.

SELECT Month(s.SalesDate) AS SalesMonth,
       s.AgreedPrice AS SalesAmount,
       c.AgreedPrice AS OtherSalesAmount
FROM  tblSales AS s
LEFT JOIN tblSales AS c
  ON (s.AgreedPrice < c.AgreedPrice) AND (Month(s.SalesDate) = Month(c.SalesDate))
;

 

Here is the listing of the result query would return without the WHERE clause: (lower sales were omitted to reduce the number of rows shown)

SalesMonth	SalesAmount	OtherSalesAmount
  ...    ...     ...
7	$8,120.00	$19,100.00
7	$21,100.00	 NULL
  ...    ...    ...
7	$12,730.00	$21,100.00
7	$12,730.00	$19,100.00
7	$19,100.00	$21,100.00
8	$15,210.00	$18,300.00
  ...    ...    ...
8	$11,500.00	$28,100.00
8	$11,500.00	$18,300.00
8	$11,500.00	$15,210.00
  ...    ...   ...
8	$18,300.00	$28,100.00
8	$15,210.00	$28,100.00
8	$28,100.00	 NULL
9	$7,000.00	$8,500.00
9	$8,500.00	$42,200.00
9	$7,000.00	$42,200.00
9	$42,200.00	 NULL
  ...    ...    ...

A row from left table will be matched with any other rows in right table where the amount is greater. You can see how all sales that weren’t the largest sales get matches with each other’s sales that had a larger sales amount. The only row that can’t match any other sales is when it is the largest amount, and we would get a NULL from the outer sales table’s column. Hence the WHERE clause filters the set down to only those rows and we get our maximum row.

Closing Notes

Use SQL View for Strawberry Queries

Because the strawberry query uses an inequality in the join criteria, Access Query Designer cannot represent so if you attempt to view the query in design, you’ll get an error. For those who love QBE, you can add all the tables you need, drop in fields, then give the tables an alias so that it’s a simple editing when you switch to SQL and add that inequality criteria.

Optimization

In the example, I used «Month(SalesDate)» which is not sargable and therefore the query isn’t as optimized as it could be. A sargable criteria would be among the lines of «SalesDate BETWEEN #yyyy-mm-dd# AND #yyyy-mm-dd#» but that wasn’t appropriate in this situation because that’d have limited us to only a single month when we want to query for each month for all possible sales. If one wanted to optimize the performance further using «group by months», it may be necessary to consider approaches such as creating denormalized column to store the month number & indexing upon it or break the SalesDate column into SalesYear, SalesMonth, and SalesDay column. If we’re using different backends such as SQL Server, we can then use additional features such as persisted computed column or similar tools.

Having index on the column where you apply the criteria in the strawberry query will ensure that the query runs fast, handy when you start to deal with larger datasets.

If there’s any other query patterns you have noticed or used, we’d love it if you can share with us.