-- Wrong! SELECT date_trunc('minute', timestamp) AS period, status_code, count(*) AS entries FROM server_log GROUP BY period, status_code; -- Correct! WITH axis AS ( SELECT status_code,...
Note that the table has a row for every minute, even if the status code was never returned in that minute. Given a table of statuses, it's very tempting to do something like this:
This is a common mistake and it can leave you with gaps in the data. Zero is a value, and it holds a significant meaning. A better approach is to create an "axis", and join to it:
If I understand correctly, the problem the author is highlighting here is that in the "wrong" result set you'll skip over time frames without any data. Wouldn't this just come down to what tool you use to visualize your data? I would expect many visualizers would fill in the gaps with 0s.
Right, the problem with the first query is that if you don't have an entry in server_log from every minute, you could get results like this: Time Entries 12:00 3 12:01 2 12:04 1 12:07 2 Some...
Right, the problem with the first query is that if you don't have an entry in server_log from every minute, you could get results like this:
Time
Entries
12:00
3
12:01
2
12:04
1
12:07
2
Some visualization systems could definitely handle that and know that it should insert fake rows with 0 entries for the missing minutes, but not always and it usually requires some extra configuration. It can also cause subtle errors if you're doing math to figure out something like "average entries per minute" and you divide by the number of rows, without accounting for all the zero rows that aren't included.
The second one is much more complex, but makes sure that the data set is actually complete instead of relying on whatever's processing it to fill in the gaps.
If I understand correctly, the problem the author is highlighting here is that in the "wrong" result set you'll skip over time frames without any data. Wouldn't this just come down to what tool you use to visualize your data? I would expect many visualizers would fill in the gaps with 0s.
Right, the problem with the first query is that if you don't have an entry in
server_log
from every minute, you could get results like this:Some visualization systems could definitely handle that and know that it should insert fake rows with 0 entries for the missing minutes, but not always and it usually requires some extra configuration. It can also cause subtle errors if you're doing math to figure out something like "average entries per minute" and you divide by the number of rows, without accounting for all the zero rows that aren't included.
The second one is much more complex, but makes sure that the data set is actually complete instead of relying on whatever's processing it to fill in the gaps.