4 votes

Simple anomaly detection using plain SQL

Tags: sql, databases

2 comments

  1. [2]
    teaearlgraycold
    Link
    -- Wrong! SELECT date_trunc('minute', timestamp) AS period, status_code, count(*) AS entries FROM server_log GROUP BY period, status_code; -- Correct! WITH axis AS ( SELECT status_code,...

    Note that the table has a row for every minute, even if the status code was never returned in that minute. Given a table of statuses, it's very tempting to do something like this:

    -- Wrong!
    SELECT
       date_trunc('minute', timestamp) AS period,
       status_code,
       count(*) AS entries
    FROM
       server_log
    GROUP BY
       period,
       status_code;
    

    This is a common mistake and it can leave you with gaps in the data. Zero is a value, and it holds a significant meaning. A better approach is to create an "axis", and join to it:

    -- Correct!
    WITH axis AS (
       SELECT
           status_code,
           generate_series(
               date_trunc('minute', now()),
               date_trunc('minute', now() - interval '1 hour'),
               interval '1 minute' * -1
           ) AS period
       FROM (
           VALUES (200), (400), (404), (500)
       ) AS t(status_code)
    )
    SELECT
       a.period,
       a.status_code,
       count(*) AS entries
    FROM
       axis a
       LEFT JOIN server_log l ON (
           date_trunc('minute', l.timestamp) = a.period
           AND l.status_code = a.status_code
       )
    GROUP BY
       period,
       status_code;
    

    If I understand correctly, the problem the author is highlighting here is that in the "wrong" result set you'll skip over time frames without any data. Wouldn't this just come down to what tool you use to visualize your data? I would expect many visualizers would fill in the gaps with 0s.

    1 vote
    1. Deimos
      Link Parent
      Right, the problem with the first query is that if you don't have an entry in server_log from every minute, you could get results like this: Time Entries 12:00 3 12:01 2 12:04 1 12:07 2 Some...

      Right, the problem with the first query is that if you don't have an entry in server_log from every minute, you could get results like this:

      Time Entries
      12:00 3
      12:01 2
      12:04 1
      12:07 2

      Some visualization systems could definitely handle that and know that it should insert fake rows with 0 entries for the missing minutes, but not always and it usually requires some extra configuration. It can also cause subtle errors if you're doing math to figure out something like "average entries per minute" and you divide by the number of rows, without accounting for all the zero rows that aren't included.

      The second one is much more complex, but makes sure that the data set is actually complete instead of relying on whatever's processing it to fill in the gaps.

      2 votes