12 votes

[SOLVED] Tech Support Request: Finding the biggest files of a specific type

Posted April 9, 2021 by kfwyre (edited April 10, 2021)

Tags: linux, ask.advice, tech support, pop os, solved

Hey Tildes!

I need some help with a specific tech issue, and I'm sure someone here can help me do it way quicker than I would be able to on my own.

General Request

I'd like to be able to scan a directory and find all of the largest files of a specific type (e.g. the largest .jpg files). I'm running Pop!_OS and I'm assuming there's some way to do this in the terminal, or alternately some utility I could use.

More Specific Details

I'm cleaning up my digital music library, and I realized in setting it up I made some errors by saving some very high res cover art. Many of my Bandcamp purchases come with a cover.jpg or cover.png file that is several megabytes large. I made the mistake of writing these into the files (adding, for some albums, an extra, say, 100 MB across all tracks). They also take a lot longer to load when I pull them up in my cloud music player. I'd like to be able to identify the albums with the largest cover.* files so that I can go in and replace the album art with a lower res version and gain back all that wasted space lost to unnecessary duplication.

I could go folder by folder and take a look at the sizes of each, but I figure there's an easier way to surface the ones that need my attention. Everything I've looked at online so far has helped me figure out how to identify the biggest files in general, but all that will do is surface the actual audio files, when it's the cover art that needs the specific attention.

Also, in case it's necessary information, the directory structure is Music/[artist]/[album]/cover.*

Any help will be very appreciated!

26 comments

[7]
what
April 9, 2021 (edited April 10, 2021)
Link
I'm (very) far from a Bash expert, but here's the somewhat hacky solution I came up with: find . -type f -printf '%s %p\n' | grep -Ei "(png|jpg|jpeg)" | sort -n -r I think it's always good to have...
I'm (very) far from a Bash expert, but here's the somewhat hacky solution I came up with:
```
find . -type f -printf '%s %p\n' | grep -Ei "(png|jpg|jpeg)" | sort -n -r
```
I think it's always good to have some of these Unix commands in your toolbelt, so here's a quick breakdown:

find . -type f -printf '%s %p\n'

Recursively lists all the files in nested directories. Usually this will only print out the filename, but we can make it print filesize and filename by passing a custom format string with the -printf flag. Here, %s is the size in bytes of the file, %p is the filename, and \n will just make sure each file is on one line. Here's an example output:
```
204164852 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/3 Giorgio By Moroder.flac
3289016 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/DPramHQcover.png
104411141 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/1 Give Life Back To Music.flac
[...]
```
Notice that this isn't sorted by filesize yet, and it's not filtering by filetype yet.

grep -Ei "(png|jpg|jpeg)"

Now we pipe the output from find into grep and use a regex to filter the filetype. -E lets you pass a regex without having to escape a bunch of characters, and -i does a case-insensitive match. Now the output will only contain image files:
```
3289016 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/DPramHQcover.png
89040 ./Daft Punk/Daft Punk - Discovery (2001) [24bit]/folder.jpg
1640619 ./fusq/fusq - Lost Station/cover.png
[...]
```
sort -n -r

Finally we can sort the filtered list of files. This is simple, since each file's size is at the start of each line. -n tells sort to do a numerical sort (as opposed to alphabetical), and -r reverses the sort so that it goes from largest -> smallest.

Like other commenters posted, a better solution is to just install a tool that will get this done quicker (and probably better), but hacking together Bash scripts can be fun :)

(btw if anyone has any recommendations or tricks that could improve feel free to add them below!)

Edit: fix typo
14 votes
1. [6]
  onyxleopard
  April 10, 2021 (edited April 10, 2021)
  Link Parent
  This is one of these things where portability is not as nice as it could be. It's really unfortunate that different systems will have different find behavior. macOS's find unfortunately does not...
  
  This is one of these things where portability is not as nice as it could be. It's really unfortunate that different systems will have different find behavior. macOS's find unfortunately does not have the -printf option (I assume because BSD's find doesn't?).
  
  On macOS I have to combine find and stat as such:
  
  find . -type f -exec stat -L -f '%z %N' {} \; | sort -nr
  
  Extending on this a little bit:
  
  find . -type f -exec stat -L -f '%z %N' {} \; | awk '$1+0 > 1048576' | sort -nr | numfmt --to=iec
  
  find . -type f -exec stat -L -f '%z %N' {} \;
  
  This recursively searches the current directory for files or symlinked files and runs stat on them to get their size (since macOS find can't report file sizes).
  
  | awk '$1+0 > 1048576'
  
  Filter the lines to those files whose size in bytes ($1 refers to the first column, which is the size in bytes or %z from stat in the previous part of the pipeline) is greater than 1.0 megabytes (1024 ^ 2 or 2 ^ 20 = 1,048,576).
  
  | sort -nr
  
  Sort the lines in reverse numerical order (largest sized files first).
  
  | numfmt --to=iec
  
  Convert the number of bytes to the International Electrochemical Commission (IEC) standard binary prefix format. In other words, convert numbers like:
  
  $ while read n; do echo "$n -> $(numfmt --to=iec $n)"; done 1024 1024 -> 1.0K 1048576 1048576 -> 1.0M 1073741824 1073741824 -> 1.0G
  
  Unix is a hell of a drug!
  
  Output from my music (restricting to .mp3 files):
  
  $ find . -type f -name '*.mp3' -exec stat -L -f '%z %N' {} \; | awk '$1+0 > 1048576' | sort -nr | numfmt --to=iec 116M ./DJ Doboy/Vocal Edition Volume 14/Vocal Edition Volume 14.mp3 110M ./DJ Doboy/Vocal Edition Volume 08/Vocal Edition Volume 08.mp3 110M ./DJ Doboy/Vocal Edition Volume 18/Vocal Edition Volume 18.mp3 109M ./DJ Doboy/Vocal Edition Volume 10/Vocal Edition Volume 10.mp3 108M ./DJ Doboy/Vocal Edition Volume 15/Vocal Edition Volume 15.mp3 106M ./DJ Doboy/Vocal Edition Volume 09/Vocal Edition Volume 09.mp3 105M ./DJ Doboy/Vocal Edition Volume 13/Vocal Edition Volume 13.mp3 105M ./DJ Doboy/Vocal Edition Volume 05/Vocal Edition Volume 05.mp3 104M ./DJ Doboy/Vocal Edition Volume 12/Vocal Edition Volume 12.mp3 104M ./DJ Doboy/Vocal Edition Volume 16/Vocal Edition Volume 16.mp3 ...
  
  (These are all long multi-track vocal trance mixes, so they're much larger than typical mp3s.)
  
  8 votes
  1. [2]
    stu2b50
    April 10, 2021
    Link Parent
    Yep. --printf is not part of the find program as defined by POSIX, and not something any of the BSD userland finds have. You can, of course, install the GNU versions of the POSIX userland in...
    
    Yep. --printf is not part of the find program as defined by POSIX, and not something any of the BSD userland finds have.
    
    You can, of course, install the GNU versions of the POSIX userland in macOS, which considering how prevalent GNU/Linux (thanks stallman, finally a time where that's useful to specify) is on the server, a very likely place to use the command line, may be useful, so you can have portability between them.
    
    5 votes
    
    onyxleopard
    April 10, 2021
    Link Parent
    Yeah I have many GNU coreutils installed via homebrew. I just noticed that gfind is not included in the coreutils package, it's in a separate findutils package. (Homebrew kindly installs these...
    
    Yeah I have many GNU coreutils installed via homebrew. I just noticed that gfind is not included in the coreutils package, it's in a separate findutils package. (Homebrew kindly installs these with g- prefixes by default so there are no name conflicts, but you can install them with their normal names if you never want to use the macOS/BSD defaults.)
    
    4 votes
  2. [2]
    whbboyd
    April 10, 2021
    Link Parent
    As mostly an aside, if you're piping the output of find into xargs, you should almost always use the -print0 and -0 arguments to each, respectively, to null-delimit records. Anyone who actually...
    
    As mostly an aside, if you're piping the output of find into xargs, you should almost always use the -print0 and -0 arguments to each, respectively, to null-delimit records. Anyone who actually puts newlines in a filename has earned themselves a spot on the express elevator to a very special hell, but it is legal, and it is therefore ideal to handle.
    
    (find -exec never has to send its records over a pipe, so no special treatment is required there; and dealing with null delimiters with any tool other than find and xargs, which have convenient flags for it, is a godawful pain, so outside that one specific context, I almost never bother.)
    
    5 votes
    
    onyxleopard
    April 10, 2021
    Link Parent
    +1. When working on my own personal machine I think I'm safe there (I'm not a monster), but definitely a good habit if you're ever working on a shared system or with files someone else named.
    
    +1. When working on my own personal machine I think I'm safe there (I'm not a monster), but definitely a good habit if you're ever working on a shared system or with files someone else named.
    
    2 votes
  3. helloworld
    April 10, 2021
    Link Parent
    And this is the reason why every system of mine now gets ripgrep, fd, bat and broot, no exceptions. It can get tricky if you're working on remote servers where you're not allowed to install extra...
    
    And this is the reason why every system of mine now gets ripgrep, fd, bat and broot, no exceptions. It can get tricky if you're working on remote servers where you're not allowed to install extra packages, but for a problem like OP's, the new tools are always better.
    
    3 votes
[2]
petrichor
April 9, 2021 (edited April 9, 2021)
Link
Try installing fd, and running fd cover.* --size +10m Music/. If you need to narrow it down more, run man fd, and take a look at the options there. The --size section is particularly helpful (to...

Try installing fd, and running fd cover.* --size +10m Music/.

If you need to narrow it down more, run man fd, and take a look at the options there. The --size section is particularly helpful (to jump to it, type /size).

edit: the binary may be called fdfind instead of fd.

11 votes
1. kfwyre (OP)
  April 9, 2021
  Link Parent
  This worked PERFECTLY. Thank you so much! Exactly the kind of solution I was looking for.
  
  This worked PERFECTLY. Thank you so much! Exactly the kind of solution I was looking for.
  
  5 votes
[5]
vord
April 10, 2021 (edited April 10, 2021)
Link
Lazy bash expert here, for an additional option: du. When I have to play "which folder is hogging all the space," I'll often run du -max-depth=1|sort -n to find the biggest folder in a directory...

Lazy bash expert here, for an additional option: du.

When I have to play "which folder is hogging all the space," I'll often run du -max-depth=1|sort -n to find the biggest folder in a directory and drill down.

For your case, I'd probably use du | sort -n | grep -i jpg. Lists all the jpg, puts the biggest at the bottom. You can get fancy using -E like @what showed, but I'll often just pass through grep again for subsequent filters as needed. grep -v is also great for excluding results.

imagemagick is also a great tool to help you on this quest. Makes short work of shrinking large images from the CLI, no downloading needed.

10 votes
1. anothersimulacrum
  April 10, 2021
  Link Parent
  Hey, I was going to bring up du! I'll also mention vips/libvips as an alternative to ImageMagick - it's apparently quite a bit faster, and definitely a fair bit less memory intensive than...
  
  Hey, I was going to bring up du!
  
  I'll also mention vips/libvips as an alternative to ImageMagick - it's apparently quite a bit faster, and definitely a fair bit less memory intensive than ImageMagick. It's not particularly important for most command line work, but would definitely be a boon working with many or very large files.
  
  5 votes
2. [3]
  Amarok
  April 10, 2021
  Link Parent
  du is my go-to, I also recommend tossing in -h for the human-readable output. It just presents the size to you as 2T or 3G or 456M or 78K which is handy if you're looking for a quick summary of...
  
  du is my go-to, I also recommend tossing in -h for the human-readable output. It just presents the size to you as 2T or 3G or 456M or 78K which is handy if you're looking for a quick summary of the space used in a bunch of directories.
  
  5 votes
  1. [2]
    onyxleopard
    April 10, 2021
    Link Parent
    Yeah, having flags for human readable numerical figures is super handy, and something I learned in the past year that I wish I knew sooner is that sort also has a -h flag so that it can sort...
    
    Yeah, having flags for human readable numerical figures is super handy, and something I learned in the past year that I wish I knew sooner is that sort also has a -h flag so that it can sort numerical values formatted as human readable strings! (And also I wish I knew about numfmt sooner, too.)
    
    $ shuf sizes.txt 116K 0B 4.2M 1.1G 4.1M 512B 196K $ shuf sizes.txt | sort -hr 1.1G 4.2M 4.1M 196K 116K 512B 0B
    
    4 votes
    
    vord
    April 10, 2021
    Link Parent
    I didn't ever check sort for that, live and learn, ty!
    
    I didn't ever check sort for that, live and learn, ty!
    
    1 vote
streblo
April 10, 2021
Link
Might not be useful here, but this is straightforward intree : tree -shaP "*.png|*.jpeg" --sort=size gives you a directory tree of your current directory with any .png/.jpeg files listed as well,...

Might not be useful here, but this is straightforward intree :

tree -shaP "*.png|*.jpeg" --sort=size gives you a directory tree of your current directory with any .png/.jpeg files listed as well, sorted by filesize. I like tree when grokking small to medium sized directories but depending on how much music you have there are probably better options.

5 votes
[2]
kfwyre (OP)
April 10, 2021
Link
UPDATE: Just in case anyone is curious, the most egregious offender in my library was Savant's Void. It is 20 tracks long, and the album art from Bandcamp is 10.2 MB (its resolution is 4415 x...

UPDATE:

Just in case anyone is curious, the most egregious offender in my library was Savant's Void. It is 20 tracks long, and the album art from Bandcamp is 10.2 MB (its resolution is 4415 x 4415). So, I saved ~200MB from fixing that one album alone. Plus my cloud music player shouldn't have to download a 10.2 MB file just to display a tiny square of cover art.

5 votes
1. onyxleopard
  April 10, 2021
  Link Parent
  Nice! A real takeaway here is that embedding album art into individual track files, while possibly convenient, is a large duplication of data. It would be nice if music software, or even file...
  
  Nice! A real takeaway here is that embedding album art into individual track files, while possibly convenient, is a large duplication of data. It would be nice if music software, or even file systems themselves, could be intelligent enough to see this sort of duplication and deduplicate it automatically. When you explicitly conflate the audio data with the album art image data in a single file, though, it becomes rather messy to transparently deduplicate.
  
  3 votes
[6]
pArSeC
April 10, 2021 (edited April 10, 2021)
Link
Y'all have wayyyyyyy overcomplicated this: find . -name "cover.*" -size +500K Edit: Reduce 5M to 500K. You might also want to replace -name with -iname to make it case-insensitive.

Y'all have wayyyyyyy overcomplicated this:

find . -name "cover.*" -size +500K

Edit: Reduce 5M to 500K.
You might also want to replace -name with -iname to make it case-insensitive.

5 votes
1. [5]
  onyxleopard
  April 10, 2021
  Link Parent
  This is equivalent to the fd solution. The only reason for the extra complications that have been suggested is with -size, find doesn't report the sizes, just filters results based on size. That...
  
  This is equivalent to the fd solution. The only reason for the extra complications that have been suggested is with -size, find doesn't report the sizes, just filters results based on size. That may be sufficient sometimes, but is generally less useful than reporting the sizes along with the files as you'll probably end up doing some sort of binary search to tune your threshold (like your change from 5M to 500k).
  
  3 votes
  1. [4]
    pArSeC
    April 10, 2021
    Link Parent
    Its not equivalent: the fd solution requires you install an additional package.
    
    Its not equivalent: the fd solution requires you install an additional package.
    
    2 votes
    
    [2]
    petrichor
    April 10, 2021
    Link Parent
    Sure, but find sucks in so many ways that it's really worth installing fd and saving yourself the hassle later.
    
    Sure, but find sucks in so many ways that it's really worth installing fd and saving yourself the hassle later.
    
    2 votes
    
    Amarok
    April 10, 2021
    Link Parent
    I tend to agree, however one doesn't always have that luxury. If you're working on systems for a company and you find yourself in solaris, aix, hp-ux, bsd, or some other random unix variant, many...
    
    I tend to agree, however one doesn't always have that luxury.
    
    If you're working on systems for a company and you find yourself in solaris, aix, hp-ux, bsd, or some other random unix variant, many of those systems don't come with the incredibly rich ecosystem of programs common with almost any linux distribution. Even if you want to add some new userland tools, you may have to go through some paperwork just to get permission to introduce these programs into that system depending on how that organization handles changes to servers that have to comply with draconian security or multiple government regulations like hipaa or sarbox. It can sometimes not be worth the effort.
    
    Honestly, the only reason I learned vi is because it's simply always there on any unix derivative.
    
    2 votes
    
    onyxleopard
    April 10, 2021
    Link Parent
    Sure, but in terms of the features of fd and find being used they're equivalent.
    
    Sure, but in terms of the features of fd and find being used they're equivalent.
onyxleopard
April 9, 2021
Link
@petrichor’s suggestion is good because it is very specific to your needs. I might also suggest ncdu as I find it is very useful for finding what is hogging disk space and drilling down into...

@petrichor’s suggestion is good because it is very specific to your needs. I might also suggest ncdu as I find it is very useful for finding what is hogging disk space and drilling down into directories that contain the most amount of accumulated junk when you don’t know exactly what is hogging disk space (so it can be used in a more exploratory fashion). You provide a -x exclude pattern to ncdu to tell it not to size things that match your pattern, so you can focus on everything else.

4 votes
[2]
the_funky_buddha
April 10, 2021
Link
If you want something with a gui, fsearch works well. Once it indexes, I have ctrl+alt+f set to bring up a search window and any mp3 is a click away. It has a standard file browser interface and...

If you want something with a gui, fsearch works well. Once it indexes, I have ctrl+alt+f set to bring up a search window and any mp3 is a click away. It has a standard file browser interface and can sort by size, type, modification date, etc. Angrysearch works also but just a bit slower in my experience.

3 votes
1. Amarok
  April 10, 2021
  Link Parent
  Another handy set of GUI tools for figuring out just what the hell is eating up your space are storage analyzers. These are so popular there are open source apps for any platform, and they make...
  
  Another handy set of GUI tools for figuring out just what the hell is eating up your space are storage analyzers. These are so popular there are open source apps for any platform, and they make short work of filesystem analysis. I'll link one of the better tools for each platform. ;)
  
  Windows: WinDirStat
  
  Linux: QDirStat
  
  Mac: Disk Inventory X
  
  2 votes