Dominick’s Finer Foods Cereal Sales Data Analysis using SQL

 

Abstract

The University of Chicago published a dataset comprising of sales data for various product categories sold by the Dominick’s Finer Foods supermarket chain from 1989-1994 (James M. Kilts Center, University of Chicago Booth School of Business, 2018).

Given the various datasets’ large amount of entries (millions of rows per category) and varying table structures in the database, the aim of this project is to use SQL to explore a specific product category’s sales data, cereals, and prepare a small business report. This will also help set up the structure for a larger project that would combine all sales data and store-level data.

Through this data exploration, the project will ultimately present a summary of the sales data in the cereal category and a visualization of said data done in Python. The relevant descriptive statistics calculated were average profit by product, gross revenue by product, total units sold by product, and total average profit + total gross revenue.

 
 

Introduction


From 1984-1994, the University Chicago partnered with the Dominick’s Finer Foods supermarket chain in order to aggregate sales data on a variety of product categories, ranging from cigarettes, to soap, to cheese (James M. Kilts Center, University of Chicago Booth School of Business, 2018). This database would prove a perfect use case for data exploration and cleaning in SQL: given that the data records millions upon millions of entries, and has several tables of asymmetric shape containing critical information (namely divided into: sales data, product info, and store information.).


The ultimate aim of this project is to use SQL to query, clean, and calculate simple but relevant descriptive statistics to perform a small business report on a single category: in this case cereal data. Then, take the exported data and visualize findings.


This would be a common use-case in any industry, taking a mass of sales data and then using a database programming language to organize said data in order to draw insights or prepare descriptive statistics and visualizations to inform business decisions.

 
 

Data Exploration and Cleaning in SQL


Before sorting out the data in SQL, it is always worth while to examine the data’s codebook and see what is actually contained inside. Given that we ultimately want to describe sales data, and preferably at the product level to see which have done well or not, the following tables are of our interest:

  • Cereal UPC Dataset- Containing UPCs (Universal Product Code) of the products, product names, and other identifying information for products

    • Renamed to cerealID

  • Cereal Movement Dataset- Essentially the sales data recorded for the cereal product category

    • Renamed to cerealSales


And from these tables, the following variables are of most immediate interest:

  • UPC- Universal Product Code

  • DESCRIP[TION]- The actual name of the product

  • MOVE- Number of units sold (there can be multiple individual units of cereal sold at a sale in a bundle)

  • QTY- Number of individual product items sold in each overall unit sold (actual number of individual boxes sold)

  • Price

    • Not adjusted for inflation in USD

  • PROFIT- The “…gross margin in percent that DFF makes on a sale…,“ or the amount per dollar that DFF makes in profit per sale (James M. Kilts Center, University of Chicago Booth School of Business, 2018)

(James M. Kilts Center, University of Chicago Booth School of Business, 2018)


 

Given that the variables of interest (namely the product specifications and sales data) were recorded in different tables, SQL will prove an effective manner to start producing sales data by product, using the UPC code as a common ID to link the tables. Running the below code yields all the relevant variables into one neat, joined table for each sale of cereal:

Query of Data for Product Info by Sale Data

Here, now we can see the UPC and item name, along with its quantity, movement, gross profit, and price at sale. From here, we can simply use this query as a sort of base and start calculating other major datapoints in one go.

We will calculate: gross revenue by product (calculated as SUM(PRICE)*SUM(MOVE) / SUM(QTY)) (James M. Kilts Center, University of Chicago Booth School of Business, 2018), total units sold, and average gross profit by product as well, using a GROUP BY to generate datapoints per unique product.

Worth nothing is that UPC’s may not be the better variable to group by, it was the case that the same exact brand of cereal could be assigned multiple UPCs to reflect different product sizes (ex: 8 oz vs. 12 oz). Here we can generate data by individual brand of cereal:

Summary of Sales Data by Product

This dataset will be exported and used in the final visualization.

Moreover, we also can calculate the total average profit and total gross revenue just using the sales data table:

Total Average Profit

Total Gross Revenue

And with all of this, we can finally move the relevant data for visualization in Python.

 
 

Visualizing Sales Data in Python (Seaborn)


Now, the aim will be to take the dataset which describes total units sold, average profit, and total gross revenue by product all in one visualization. A visualization that immediately comes to mind is to simply chart a scatterplot similarly to a price X qty. graph, or the supply curve of the product. Moreover, then we can use color for each point to describe the average profit. This will present several data points and trends in a single visualization. Generating the visualization, we end up with:

 

Figure 1- Scatterplot of Units Sold X Total Gross Revenue, Colored by Average Gross Profit by Cereal Product

 

And from this visualization we can observe a few things immediately:

  • Average profit by product did not stray very far from the 0%-30% range for most products.

  • The majority of products did not sell many units and thus did not general much revenue relatively.

    • And likewise, only a few cereals performed extremely well, generating high revenue and units sold.

Moreover a few more descriptive stats of interest:

  • The product with the highest revenue and units sold was:

    • Cheerios at ~14.0 Million USD with 4.44 Million Units Sold

  • The product with the highest average profit was:

    • Life Cereals at a ~98.27% gain with 1,844 Units Sold

  • The product with the least revenue (and unsurprisingly only 1 unit sold) was:

    • Kretschmer Original Toasted Wheat Germ at 1.82 USD

  • And the product with the most negative profit was:

    • Golden Crisp at a -65.55% loss with 74 Units Sold

 
 

Concluding Remarks


From this small project, we were able to use SQL to aggregate a large database into a more useful dataset that one could use for a small business report with some descriptive statistics, and from that data generate a visualization to summarize the data. While fairly straight forward, the learnings I’ve gained from this project in writing SQL queries, working with multiple tables, joining them, and proper filtering will be used going forward for a larger project. This would aim to examine the entire available database for Dominick’s Finer Foods regardless of product category, and present information at the store-level.

Moreover, future projects would perform a deeper dive into the economics/finances by calculating marginal revenue and cost for the products, seeing where ideal price points are.


Works Cited:

Dominick’s Data Manual. James M. Kilts Center, University of Chicago Booth School of Business. Chicago 2018.

Data Source:

James M. Kilts Center, University of Chicago Booth School of Business. Dominick’s Dataset. Chicago. https://www.chicagobooth.edu/research/kilts/datasets/dominicks