4 votes

Simple anomaly detection using plain SQL

Posted September 30, 2020 by Deimos

Tags: sql, databases

https://hakibenita.com/sql-anomaly-detection

Link information

This data is scraped automatically and may be incorrect.

Authors: Haki Benita
Published: Sep 21 2020
Word count: 2998 words

2 comments

[2]
teaearlgraycold
September 30, 2020
Link
-- Wrong! SELECT date_trunc('minute', timestamp) AS period, status_code, count(*) AS entries FROM server_log GROUP BY period, status_code; -- Correct! WITH axis AS ( SELECT status_code,...
Note that the table has a row for every minute, even if the status code was never returned in that minute. Given a table of statuses, it's very tempting to do something like this:
```
-- Wrong!
SELECT
   date_trunc('minute', timestamp) AS period,
   status_code,
   count(*) AS entries
FROM
   server_log
GROUP BY
   period,
   status_code;
```
This is a common mistake and it can leave you with gaps in the data. Zero is a value, and it holds a significant meaning. A better approach is to create an "axis", and join to it:
```
-- Correct!
WITH axis AS (
   SELECT
       status_code,
       generate_series(
           date_trunc('minute', now()),
           date_trunc('minute', now() - interval '1 hour'),
           interval '1 minute' * -1
       ) AS period
   FROM (
       VALUES (200), (400), (404), (500)
   ) AS t(status_code)
)
SELECT
   a.period,
   a.status_code,
   count(*) AS entries
FROM
   axis a
   LEFT JOIN server_log l ON (
       date_trunc('minute', l.timestamp) = a.period
       AND l.status_code = a.status_code
   )
GROUP BY
   period,
   status_code;
```
If I understand correctly, the problem the author is highlighting here is that in the "wrong" result set you'll skip over time frames without any data. Wouldn't this just come down to what tool you use to visualize your data? I would expect many visualizers would fill in the gaps with 0s.
1 vote
1. Deimos (OP)
  October 1, 2020
  Link Parent
  Right, the problem with the first query is that if you don't have an entry in server_log from every minute, you could get results like this: Time Entries 12:00 3 12:01 2 12:04 1 12:07 2 Some...
  
  Right, the problem with the first query is that if you don't have an entry in server_log from every minute, you could get results like this:
  
  Time Entries
  
  12:00 3
  
  12:01 2
  
  12:04 1
  
  12:07 2
  
  Some visualization systems could definitely handle that and know that it should insert fake rows with 0 entries for the missing minutes, but not always and it usually requires some extra configuration. It can also cause subtle errors if you're doing math to figure out something like "average entries per minute" and you divide by the number of rows, without accounting for all the zero rows that aren't included.
  
  The second one is much more complex, but makes sure that the data set is actually complete instead of relying on whatever's processing it to fill in the gaps.
  
  2 votes

Time	Entries
12:00	3
12:01	2
12:04	1
12:07	2