Using MySQL to generate daily sales reports with filled gaps

This is a classic problem but I was surprised to see that there’s not a great deal of info on the web on how to do a proper daily report – one that doesn’t have any gaps when data is missing. In this post I’ll guide you through building such a report using MySQL and PHP for presentation. Obviously, you will need to adapt the code for your needs, but the concepts should be valid for a wide range of similar needs.

So, one sunny morning your boss comes in and says “The system we currently have for handling online transactions has a big shortcoming; it has no reporting capabilities. I want to see how our widget sales are doing. Can I have it done by noon?

Looking at the database, you will most likely have something similar to the sample orders table below with (some of) the columns & associated data looking like this:

id order_date product_id quantity customer_id
1 2009-08-15 12:20:20 1 2 123
2 2009-08-15 12:20:20 2 2 123
3 2009-08-17 16:43:09 1 1 456
4 2009-08-18 09:21:43 1 5 789
5 2009-08-18 14:23:11 3 7 123
6 2009-08-21 08:34:21 1 1 456

If you don’t have any table even remotely similar to this, but wish to follow this article, you can create your own:

CREATE TABLE orders
(
  id INT PRIMARY KEY AUTO_INCREMENT,
  order_date DATETIME,
  product_id INT,
  quantity INT,
  customer_id INT
);
INSERT INTO orders (order_date, product_id, quality, customer_id)
  VALUES
  ('2009-08-15 12:20:20', '1', '2', '123'),
  ('2009-08-15 12:20:20', '2', '2', '123'),
  ('2009-08-17 16:43:09', '1', '1', '456'),
  ('2009-08-18 09:21:43', '1', '5', '789'),
  ('2009-08-18 14:23:11', '3', '7', '123'),
  ('2009-08-21 08:34:21', '1', '1', '456');

The simple query

Obtaining the total sales for each day is very easy:

SELECT DATE(order_date) AS date, SUM(quantity) AS total_sales
FROM orders
GROUP BY date

The result (based on the table above, would be

date total_sales
2009-08-15 4
2009-08-17 1
2009-08-18 12
2009-08-21 1

As you prepare to run to your boss and tell him that the problem is solved, you might notice the gap between some of the dates. What’s happening? Well, in this particular example, there is no data for the date 2009-08-16; on the other hand, you have to show that for some dates there were zero sales, so what do you do?

Instinctively, the answer seems to be generating a range of values (dates) on the fly via something like GENERATE_RANGE(start_value, end_value) and JOIN in with the dates in the table. I know quite many people new to MySQL think this way. Let me give you the bad new first: there is no such function in MySQL. There are ways to do this in MSSQL and Oracle, but even then it’s not straightforward.

The good news is that you don’t have to generate the dates on the fly – in fact you shouldn’t even want to. Think this way: why generate the same range of dates each time you run the query? Cache. Reuse. Create a calendar table. Not a temporary one; a clean honest calendar table. The benefit is that not only it’s significantly faster, but you can also add additional information to it – for example you could add a bit for holidays and some notes, like the holiday names. You can go further and mark the dates where you started special discounts to see if they had any impact on the sales. Endless possibilities.

Creating and filling the calendar table

In this article we’ll create a calendar table with just one column – the date, but as I’ve said, there’s nothing stopping you from adding more information to it.

CREATE TABLE calendar (datefield DATE)

We now need a way to fill the calendar. For this, you can use a simple stored procedure:

DELIMITER |
CREATE PROCEDURE fill_calendar(start_date DATE, end_date DATE)
BEGIN
  DECLARE crt_date DATE;
  SET crt_date=start_date;
  WHILE crt_date < end_date DO
    INSERT INTO calendar VALUES(crt_date);
    SET crt_date = ADDDATE(crt_date, INTERVAL 1 DAY);
  END WHILE;
END |
DELIMITER ;

The code is very simple. I’ve seen other (faster but less intuitive) approaches. Considering that you’ll run this sproc just once or maybe once every couple of years, saving a few miliseconds is not a big deal.

To call it, just enter the date ranges you want created:

CALL fill_calendar('2008-01-01', '2011-12-31');

Now we can redo our SELECT query, JOIN-ing the dates in the orders with the ones from the calendar.

The complete query

SELECT calendar.datefield AS DATE, SUM(orders.quantity) AS total_sales
  FROM orders RIGHT JOIN calendar ON (DATE(orders.order_date) = calendar.datefield)
GROUP BY DATE

If you run this query, you’ll notice two flaws:

  1. it includes all dates in the calendar, so even those in the future will be included;
  2. for dates not present in orders, the query will show total_sales as NULL instead of 0.

This is very easy to fix, but as usual I wanted to show you why I’m building a query in a particular way:

SELECT calendar.datefield AS DATE,
       IFNULL(SUM(orders.quantity),0) AS total_sales
FROM orders RIGHT JOIN calendar ON (DATE(orders.order_date) = calendar.datefield)
WHERE (calendar.datefield BETWEEN (SELECT MIN(DATE(order_date)) FROM orders) AND (SELECT MAX(DATE(order_date)) FROM orders))
GROUP BY DATE

So, we’re using IFNULL function to replace NULLs with zeroes and two subselects to restrict the interval.

The result is:

date total_sales
2009-08-15 4
2009-08-16 0
2009-08-17 1
2009-08-18 12
2009-08-19 0
2009-08-20 0
2009-08-21 1

Don’t make views

At this point, if you are coming from a MSSQL, Oracle or even Access background, you might be tempted into turning the query above into a view. It makes sense to aggregate some info into a view (toal sales, total income, sales for each widget) and then query the view to focus on particular areas.

Don’t.

Views are horribly slow in MySQL. Without going in detail, views in MySQL offer nowhere near the sophistication you might be used to from other databases so it’s best to just avoid them.

Charting the results

At this point, you can display the aggregated data in a nice format. Ideally you’d use a flash charting solution, which has the benefit of looking great (always a bonus for the boss) and being very efficient (you just output some XML from PHP and the flash chart does the rest).

However, in this tutorial I’ll show you how to generate a cahrt using plain html. It’s not very efficient, but it works.

First, lets obtain the data:

1
2
3
4
5
6
7
8
9
10
11
<?
	$db_connect = mysql_connect($DB_SERVER, $DB_USERNAME, $DB_PASSWORD) or die("Could not connect");
	mysql_select_db($DB_DBName) or die("Could not select database");
 
	$dailySalesQuery  = "SELECT calendar.datefield AS date,
                                IFNULL(SUM(orders.quantity),0) AS total_sales
                           FROM orders RIGHT JOIN calendar ON (DATE(orders.order_date) = calendar.datefield)
                          WHERE (calendar.datefield BETWEEN (SELECT MIN(DATE(order_date)) FROM orders) AND (SELECT MAX(DATE(order_date)) FROM orders))
                          GROUP BY date";
	$dailySalesResult = mysql_query($dailySalesQuery, $db_connect) or die(mysql_error());
?>

Now that you have the data, it’s time to build the chart:

<? while ($line = mysql_fetch_array($dailySalesResult, MYSQL_ASSOC)) { $v = $line['total_sales']; ?> <? } ?>

<? if ($v>0) echo “$v
” ?>“>

You will need a simple colored graphic file for the bar (in this example bar.gif)

If you want to output the dates below each bar, you’ll have use mysql_data_seek($dailySalesResult, 0) to move the pointer of the results back to the beginning and output another table row. It’s not very efficient, but I couldn’t find a better solution with tables.

Below you can see a real-world example of the chart:

Chart

15 replies
  1. Ralf Neubauer
    Ralf Neubauer says:

    The problem with calendar tables: they get out of date. As most people will pre-fill them some years into the future, this wil happen, when no-one even remembers the table and how to add dates. What is the best strategy to avoid this?

    • Armand Niculescu
      Armand Niculescu says:

      This goes beyond the scope of this article; each project would probably implement this in its own way.
      For example, the report could check every now and then (say once per year) if the last date in the calendar table is at least one year in the future and if not, it could add more data.

  2. Philippe
    Philippe says:

    Very illustrative article.
    Best techniques are sometimes so simple and obvious that we do forget they exist !
    Thanks a lot for your article.

    But, there is a “but”. The output is buggy and, with data provided above you don’t obtain the good result.
    Instead, you obtain this :
    date_order, total_sales
    ’2009-08-16′, 0
    ’2009-08-17′, 1
    ’2009-08-18′, 12
    ’2009-08-19′, 0
    ’2009-08-20′, 0
    ’2009-08-21′, 1

    As you can see, the two first rows are not taken in account :
    (’2009-08-15 12:20:20′, ’1′, ’2′, ’123′),
    (’2009-08-15 12:20:20′, ’2′, ’2′, ’123′),
    Why ? Because of this part of the query : SELECT MIN(order_date) FROM orders.
    In fact, SQL returns “2009-08-15 12:20:20″ which is greater than “2009-08-15 00:00:00″ (the value for a date field only).
    Using the keyword “BETWEEN”, you say SQL retrieve values that are greater or equal then “2009-08-15 00:00:00″.
    So, the correct query would be :
    SELECT MIN(DATE(order_date)) FROM orders
    In integrality :
    SELECT calendar.datefield AS date_order,
    Ifnull(Sum(orders.quantity),0) AS total_sales
    FROM orders
    RIGHT JOIN calendar
    ON Date(orders.order_date) = calendar.datefield
    WHERE calendar.datefield BETWEEN (SELECT Min(Date(order_date))
    FROM orders) AND (SELECT Max(order_date)
    FROM orders)
    GROUP BY date_order
    And the result is :
    ’2009-08-15′, 4
    ’2009-08-16′, 0
    ’2009-08-17′, 1
    ’2009-08-18′, 12
    ’2009-08-19′, 0
    ’2009-08-20′, 0
    ’2009-08-21′, 1

    Have a nice day ;)

    • Armand Niculescu
      Armand Niculescu says:

      Hi Philippe,

      On what version of MySQL have you tested? My code works perfectly on MySQL 5.0.22, I just retested it.

      Your code does make sense, not relying on MySQL implicit casting to DATE, but I tried to keep the code to a minimum (as a sidenote, the PHP code used to generate the chart is more complex if you want to show the dates and months below, but I omitted it too for brevity).

  3. Philippe
    Philippe says:

    To respond to ralph neubauer, why not create a stored proc who creates automatically a memory table with the dates needed for the job ?
    There is the source code I suggest for the stored proct (PL/SQL for MYsql )

    DELIMITER //

    DROP PROCEDURE IF EXISTS sales_by_period//

    CREATE PROCEDURE sales_by_period(IN start_date DATE, IN end_date DATE)
    COMMENT ‘Returns a dataset for sales between start_date and end_date’
    BEGIN
    — Drop and creates a temporary table in memory for the time of the session
    DROP TABLE IF EXISTS mem_calendar;
    CREATE TEMPORARY TABLE mem_calendar (
    datefield DATE
    ) ENGINE=MEMORY;

    — fills temporary table with dates
    SET @temp_date := start_date;
    WHILE @temp_date <= end_date DO
    INSERT INTO mem_calendar VALUES(@temp_date);
    SET @temp_date = ADDDATE(@temp_date, INTERVAL 1 DAY);
    END WHILE;

    — retieves and returns data
    SELECT
    mem_calendar.datefield AS date_order,
    IFNULL(SUM(orders.quantity),0) AS total_sales
    FROM orders
    RIGHT JOIN mem_calendar ON DATE(orders.order_date) = mem_calendar.datefield
    WHERE mem_calendar.datefield BETWEEN start_date AND end_date
    GROUP BY date_order;
    END //

    DELIMITER ;

    As you can see, instead of using a physical table I use a memory table wich is faster.
    But you need to activate the "memory" engine which is not activated by default (it seems to me…).
    With such a stored proc, you prevent every situation.

    Hope this helps.
    Regards,
    Philippe

    • Armand Niculescu
      Armand Niculescu says:

      I briefly mentioned the possibility of a temporary table in the article, it was actually my first thought when I had to deal with the situation.

      I chose to use a real table instead because I can effectively “cache” the results. A temporary table has to be created and filled every time you do the query. I’ve done my share of work with sprocs, cursors and temporary tables (in TSQL mostly) and I developed a mild hatred for them :)

  4. Philippe
    Philippe says:

    Hi Armand,

    “On what version of MySQL have you tested? My code works perfectly on MySQL 5.0.22, I just retested it.”
    Response : MySql 5.1.36
    I agree for the inconvenience of temporary tables so I used a table in memory to get better perfromances.
    Plus, using a stored proc permits to delimit the period to examine instead of querying for all records contained in orders table.
    This is another approach to extend the point of view and to give more ideas for members looking at your great article. That was just my intention.
    Regards.

  5. Mikolaj
    Mikolaj says:

    Hello
    I thinkyour exaple will miss first days if no orders ware made this days.
    I mean I yu have table orders filled with dates starting from 3rd of November 
    calendar.datefield BETWEEN (SELECT Min(Date(order_date))
    wil still miss 1st and 2nd of November.

    Regards
    MJ

  6. Mikolaj
    Mikolaj says:

    OK, so query should look like this:

    SELECT calendar.datefield AS date, 
           IFNULL(SUM(orders.quantity),0) AS total_sales
    FROM calendar LEFT JOIN orders ON (calendar.datefield = DATE(orders.order_date))
    WHERE (calendar.datefield BETWEEN ('2009-10-01') AND ('2009-10-31') 
    GROUP BY date;
    And one mote thing... If you would like to get sales info per product_id or customer_id, please remember to add condition to JOIN part and not into WHERE part like:
    SELECT calendar.datefield AS date, 
           IFNULL(SUM(orders.quantity),0) AS total_sales,
           customer_id
    FROM calendar LEFT JOIN orders ON (calendar.datefield = DATE(orders.order_date)) AND customer_id=1
    WHERE (calendar.datefield BETWEEN ('2009-10-01') AND ('2009-10-31') 
    GROUP BY date;
    otherwise you will not get all NULLs from the join.
    Regards
    MJ
    
    
    
    
    
  7. Jose
    Jose says:

    Ok..Here’s my try..I think this problem is easier to solve if we think in terms of unix timestamps:
    I first create a “numbers” table, with a single field (id) that goes from 0 to <max number of dates that may get involved in the query>. The query then is this:

    select @row+numbers.id*86400 , orders.quantity  FROM (select @row:=(select UNIX_TIMESTAMP(min(order_date)) FROM orders), @end:=(select UNIX_TIMESTAMP(max
    (order_date)) FROM orders)) r , numbers LEFT JOIN orders ON UNIX_TIMESTAMP(order_date)=@row+numbers.id*86400 WHERE @row+numbers.id*86400 < @end;

    @row is initialized as the min date.The 86400 are the seconds in a day.The entries in the “numbers” table (0..1..2….n) makes the expression (@row+numbers.id*86400) jump one day at a time, which is then left joined to dates in the orders table.
    In the case you’d need hours, or minutes, instead of days, changing the constant should be enough.
    I’ve just crafted this query, so i dont know if it’s bug-free! 

  8. ipkwena
    ipkwena says:

    Thanks for the article/tutorial. Its simplicity and level of detail makes it of particular relevance to a small project I am working on.

Comments are closed.