-
31 votes
-
Why a big box store started solving violent crimes
9 votes -
The fascinating history behind a set of miniature murder scenes
4 votes -
The hidden background noise that can catch criminals
12 votes -
Forensic science can be a powerful crime-fighting tool, but misdeeds, dubious methodologies, and bogus claims threaten its reputation—and the reputation of science as a whole
7 votes -
A few easy linux commands, and a real-world example on how to use them in a pinch
This below is a summary of some real-world performance investigation I recently went through. The tools I used are installed on all linux systems, but I know some people don't know them and would...
This below is a summary of some real-world performance investigation I recently went through. The tools I used are installed on all linux systems, but I know some people don't know them and would straight up jump to heavyweight log analysis services and what not, or writing their own solution.
Let's say you have request log sampling in a bunch of log files that contain lines like these:
127.0.0.1 [2021-05-27 23:28:34.460] "GET /static/images/flags/2/54@3x.webp HTTP/2" 200 1806 TLSv1.3 HIT-CLUSTER SessionID:(null) Cache:max-age=31536000
127.0.0.1 [2021-05-27 23:51:22.019] "GET /pl/player/123456/changelog/ HTTP/1.1" 200 16524 TLSv1.2 MISS-CLUSTER SessionID:(null) Cache:
You might recognize Fastly logs there (IP anonymized). Now, there's a lot you might care about in this log file, but in my case, I wanted to get a breakdown of hits vs misses by URL.
So, first step, let's concatenate all the log files with
cat *.log > all.txt
, so we can work off a single file.Then, let's split the file in two: hits and misses. There are a few different values for them, the majority are covered by either
HIT-CLUSTER
orMISS-CLUSTER
. We can do this by just grepping for them like so:grep HIT-CLUSTER all.txt > hits.txt; grep MISS-CLUSTER all.txt > misses.txt
However, we only care about url and whether it's a hit or a miss. So let's clean up those hits and misses with
cut
. The way cut works, it takes a delimiter (-d
) and cuts the input based on that; you then give it a range of "fields" (-f
) that you want.In our case, if we cut based on spaces, we end up with for example:
127.0.0.1
[2021-05-27
23:28:34.460]
"GET
/static/images/flags/2/54@3x.webp
HTTP/2"
200
1806
TLSv1.3
HIT-CLUSTER
SessionID:(null)
Cache:max-age=31536000
.We care about the 5th value only. So let's do:
cut -d" " -f5
to get that. We will alsosort
the result, because future operations will require us to work on a sorted list of values.cut -d" " -f5 hits.txt | sort > hits-sorted.txt; cut -d" " -f5 misses.txt | sort > misses-sorted.txt
Now we can start doing some neat stuff.
wc
(wordcount) is an awesome utility, it lets you count characters, words or lines very easily.wc -l
counts lines in an input, since we're operating with one value per line we can easily count our hits and misses already:$ wc -l hits-sorted.txt misses-sorted.txt 132523 hits-sorted.txt 220779 misses-sorted.txt 353302 total
220779 / 132523 is a 1:1.66 ratio of hits to misses. That's not great…
Alright, now I'm also interested in how many unique URLs are hit versus missed.
uniq
tool deduplicates immediate sequences, so the input has to be sorted in order to deduplicate our entire file. We already did that. We can now count our urls withuniq < hits-sorted.txt | wc -l; uniq < misses-sorted.txt | wc -l
. We get49778
and201178
, respectively. It's to be expected that most of our cache misses would be in "rarer" urls; this gives us a 1:4 ratio of cached to uncached URL.Let's say we want to dig down further into which URLs are most often hitting the cache, specifically. We can add
-c
touniq
in order to get a duplicate count in front of our URLs. To get the top ones at the top, we can then usesort
, in reverse sort mode (-r
), and it also needs to be numeric sort, not alphabetic (-n
).head
lets us get the top 10.$ uniq -c < hits-sorted.txt | sort -nr | head 815 /static/app/webfonts/fa-solid-900.woff2?d720146f1999 793 /static/app/images/1.png 786 /static/app/fonts/nunito-v9-latin-ext_latin-regular.woff2?d720146f1999 760 /static/CACHE/js/output.cee5c4089626.js 758 /static/images/crest/3/light/notfound.png 757 /static/CACHE/css/output.4f2b59394c83.css 756 /static/app/webfonts/fa-regular-400.woff2?d720146f1999 754 /static/app/css/images/loading.gif?d720146f1999 750 /static/app/css/images/prev.png?d720146f1999 745 /static/app/css/images/next.png?d720146f1999
And same for misses:
$ uniq -c < misses-sorted.txt | sort -nr | head 56 / 14 /player/237678/ 13 /players/ 12 /teams/ 11 /players/top/ <snip>
So far this tells us static files are most often hit, and for misses it also tells us… something, but we can't quite track it down yet (and we won't, not in this post). We're not adjusting for how often the page is hit as a whole, this is still just high-level analysis.
One last thing I want to show you! Let's take everything we learned and analyze those URLs by prefix instead. We can cut our URLs again by slash with
cut -d"/"
. If we want the first prefix, we can do-f1-2
, or-f1-3
for the first two prefixes. Let's look!cut -d'/' -f1-2 < hits-sorted.txt | uniq -c | sort -nr | head 100189 /static 5948 /es 3069 /player 2480 /fr 2476 /es-mx 2295 /pt-br 2094 /tr 1939 /it 1692 /ru 1626 /de
cut -d'/' -f1-2 < misses-sorted.txt | uniq -c | sort -nr | head 66132 /static 18578 /es 17448 /player 17064 /tr 11379 /fr 9624 /pt-br 8730 /es-mx 7993 /ru 7689 /zh-hant 7441 /it
This gives us hit-miss ratios by prefix. Neat, huh?
13 votes -
Forensic reconstruction of the Beirut explosion
10 votes -
The remarkable life of Roxie Laybourne, the world’s first forensic ornithologist at the Smithsonian Institution
6 votes -
The rape kit’s secret history - This is the story of the woman who forced the police to start treating sexual assault like a crime
8 votes -
Pair Locking your iPhone to prevent law enforcement forensic imaging with Configurator 2
10 votes -
Detecting deepfakes by looking closely reveals a way to protect against them
14 votes -
Nifty US govt chemical disaster investigation YouTube channel
4 votes -
This genealogy database helped solve dozens of crimes. But its new privacy rules will restrict access by cops.
4 votes -
Bad evidence: Ten years after a landmark study blew the whistle on junk science, the fight over forensics rages on
7 votes -
Exposing the secret Office 365 forensics tool
10 votes -
Was Kevin Cooper framed for murder?
3 votes