[SOLVED] Tech Support Request: Finding the biggest files of a specific type
Hey Tildes!
I need some help with a specific tech issue, and I'm sure someone here can help me do it way quicker than I would be able to on my own.
General Request
I'd like to be able to scan a directory and find all of the largest files of a specific type (e.g. the largest .jpg files). I'm running Pop!_OS and I'm assuming there's some way to do this in the terminal, or alternately some utility I could use.
More Specific Details
I'm cleaning up my digital music library, and I realized in setting it up I made some errors by saving some very high res cover art. Many of my Bandcamp purchases come with a cover.jpg
or cover.png
file that is several megabytes large. I made the mistake of writing these into the files (adding, for some albums, an extra, say, 100 MB across all tracks). They also take a lot longer to load when I pull them up in my cloud music player. I'd like to be able to identify the albums with the largest cover.*
files so that I can go in and replace the album art with a lower res version and gain back all that wasted space lost to unnecessary duplication.
I could go folder by folder and take a look at the sizes of each, but I figure there's an easier way to surface the ones that need my attention. Everything I've looked at online so far has helped me figure out how to identify the biggest files in general, but all that will do is surface the actual audio files, when it's the cover art that needs the specific attention.
Also, in case it's necessary information, the directory structure is Music/[artist]/[album]/cover.*
Any help will be very appreciated!
I'm (very) far from a Bash expert, but here's the somewhat hacky solution I came up with:
I think it's always good to have some of these Unix commands in your toolbelt, so here's a quick breakdown:
find . -type f -printf '%s %p\n'
Recursively lists all the files in nested directories. Usually this will only print out the filename, but we can make it print filesize and filename by passing a custom format string with the
-printf
flag. Here,%s
is the size in bytes of the file,%p
is the filename, and\n
will just make sure each file is on one line. Here's an example output:Notice that this isn't sorted by filesize yet, and it's not filtering by filetype yet.
grep -Ei "(png|jpg|jpeg)"
Now we pipe the output from
find
intogrep
and use a regex to filter the filetype.-E
lets you pass a regex without having to escape a bunch of characters, and-i
does a case-insensitive match. Now the output will only contain image files:sort -n -r
Finally we can sort the filtered list of files. This is simple, since each file's size is at the start of each line.
-n
tellssort
to do a numerical sort (as opposed to alphabetical), and-r
reverses the sort so that it goes from largest -> smallest.Like other commenters posted, a better solution is to just install a tool that will get this done quicker (and probably better), but hacking together Bash scripts can be fun :)
(btw if anyone has any recommendations or tricks that could improve feel free to add them below!)
Edit: fix typo
This is one of these things where portability is not as nice as it could be. It's really unfortunate that different systems will have different
find
behavior. macOS'sfind
unfortunately does not have the-printf
option (I assume because BSD'sfind
doesn't?).On macOS I have to combine
find
andstat
as such:Extending on this a little bit:
This recursively searches the current directory for files or symlinked files and runs
stat
on them to get their size (since macOSfind
can't report file sizes).Filter the lines to those files whose size in bytes (
$1
refers to the first column, which is the size in bytes or%z
fromstat
in the previous part of the pipeline) is greater than 1.0 megabytes (1024 ^ 2 or 2 ^ 20 = 1,048,576).Sort the lines in reverse numerical order (largest sized files first).
Convert the number of bytes to the International Electrochemical Commission (IEC) standard binary prefix format. In other words, convert numbers like:
Unix is a hell of a drug!
Output from my music (restricting to
.mp3
files):(These are all long multi-track vocal trance mixes, so they're much larger than typical mp3s.)
Yep. --printf is not part of the
find
program as defined by POSIX, and not something any of the BSD userland finds have.You can, of course, install the GNU versions of the POSIX userland in macOS, which considering how prevalent GNU/Linux (thanks stallman, finally a time where that's useful to specify) is on the server, a very likely place to use the command line, may be useful, so you can have portability between them.
Yeah I have many GNU coreutils installed via homebrew. I just noticed that
gfind
is not included in thecoreutils
package, it's in a separatefindutils
package. (Homebrew kindly installs these withg-
prefixes by default so there are no name conflicts, but you can install them with their normal names if you never want to use the macOS/BSD defaults.)As mostly an aside, if you're piping the output of
find
intoxargs
, you should almost always use the-print0
and-0
arguments to each, respectively, to null-delimit records. Anyone who actually puts newlines in a filename has earned themselves a spot on the express elevator to a very special hell, but it is legal, and it is therefore ideal to handle.(
find -exec
never has to send its records over a pipe, so no special treatment is required there; and dealing with null delimiters with any tool other thanfind
andxargs
, which have convenient flags for it, is a godawful pain, so outside that one specific context, I almost never bother.)+1. When working on my own personal machine I think I'm safe there (I'm not a monster), but definitely a good habit if you're ever working on a shared system or with files someone else named.
And this is the reason why every system of mine now gets ripgrep, fd, bat and broot, no exceptions. It can get tricky if you're working on remote servers where you're not allowed to install extra packages, but for a problem like OP's, the new tools are always better.
Try installing
fd
, and runningfd cover.* --size +10m Music/
.If you need to narrow it down more, run
man fd
, and take a look at the options there. The--size
section is particularly helpful (to jump to it, type/size
).edit: the binary may be called
fdfind
instead offd
.This worked PERFECTLY. Thank you so much! Exactly the kind of solution I was looking for.
Lazy bash expert here, for an additional option:
du
.When I have to play "which folder is hogging all the space," I'll often run
du -max-depth=1|sort -n
to find the biggest folder in a directory and drill down.For your case, I'd probably use
du | sort -n | grep -i jpg
. Lists all the jpg, puts the biggest at the bottom. You can get fancy using -E like @what showed, but I'll often just pass through grep again for subsequent filters as needed.grep -v
is also great for excluding results.imagemagick is also a great tool to help you on this quest. Makes short work of shrinking large images from the CLI, no downloading needed.
Hey, I was going to bring up du!
I'll also mention vips/libvips as an alternative to ImageMagick - it's apparently quite a bit faster, and definitely a fair bit less memory intensive than ImageMagick. It's not particularly important for most command line work, but would definitely be a boon working with many or very large files.
du
is my go-to, I also recommend tossing in-h
for the human-readable output. It just presents the size to you as 2T or 3G or 456M or 78K which is handy if you're looking for a quick summary of the space used in a bunch of directories.Yeah, having flags for human readable numerical figures is super handy, and something I learned in the past year that I wish I knew sooner is that
sort
also has a-h
flag so that it can sort numerical values formatted as human readable strings! (And also I wish I knew aboutnumfmt
sooner, too.)I didn't ever check
sort
for that, live and learn, ty!Might not be useful here, but this is straightforward in
tree
:tree -shaP "*.png|*.jpeg" --sort=size
gives you a directory tree of your current directory with any .png/.jpeg files listed as well, sorted by filesize. I like tree when grokking small to medium sized directories but depending on how much music you have there are probably better options.UPDATE:
Just in case anyone is curious, the most egregious offender in my library was Savant's Void. It is 20 tracks long, and the album art from Bandcamp is 10.2 MB (its resolution is 4415 x 4415). So, I saved ~200MB from fixing that one album alone. Plus my cloud music player shouldn't have to download a 10.2 MB file just to display a tiny square of cover art.
Nice! A real takeaway here is that embedding album art into individual track files, while possibly convenient, is a large duplication of data. It would be nice if music software, or even file systems themselves, could be intelligent enough to see this sort of duplication and deduplicate it automatically. When you explicitly conflate the audio data with the album art image data in a single file, though, it becomes rather messy to transparently deduplicate.
Y'all have wayyyyyyy overcomplicated this:
find . -name "cover.*" -size +500K
Edit: Reduce
5M
to500K
.You might also want to replace
-name
with-iname
to make it case-insensitive.This is equivalent to the
fd
solution. The only reason for the extra complications that have been suggested is with-size
,find
doesn't report the sizes, just filters results based on size. That may be sufficient sometimes, but is generally less useful than reporting the sizes along with the files as you'll probably end up doing some sort of binary search to tune your threshold (like your change from5M
to500k
).Its not equivalent: the
fd
solution requires you install an additional package.Sure, but
find
sucks in so many ways that it's really worth installingfd
and saving yourself the hassle later.I tend to agree, however one doesn't always have that luxury.
If you're working on systems for a company and you find yourself in solaris, aix, hp-ux, bsd, or some other random unix variant, many of those systems don't come with the incredibly rich ecosystem of programs common with almost any linux distribution. Even if you want to add some new userland tools, you may have to go through some paperwork just to get permission to introduce these programs into that system depending on how that organization handles changes to servers that have to comply with draconian security or multiple government regulations like hipaa or sarbox. It can sometimes not be worth the effort.
Honestly, the only reason I learned
vi
is because it's simply always there on any unix derivative.Sure, but in terms of the features of
fd
andfind
being used they're equivalent.@petrichor’s suggestion is good because it is very specific to your needs. I might also suggest
ncdu
as I find it is very useful for finding what is hogging disk space and drilling down into directories that contain the most amount of accumulated junk when you don’t know exactly what is hogging disk space (so it can be used in a more exploratory fashion). You provide a-x
exclude pattern toncdu
to tell it not to size things that match your pattern, so you can focus on everything else.If you want something with a gui, fsearch works well. Once it indexes, I have ctrl+alt+f set to bring up a search window and any mp3 is a click away. It has a standard file browser interface and can sort by size, type, modification date, etc. Angrysearch works also but just a bit slower in my experience.
Another handy set of GUI tools for figuring out just what the hell is eating up your space are storage analyzers. These are so popular there are open source apps for any platform, and they make short work of filesystem analysis. I'll link one of the better tools for each platform. ;)
Windows: WinDirStat
Linux: QDirStat
Mac: Disk Inventory X