12 votes

[SOLVED] Tech Support Request: Finding the biggest files of a specific type

Hey Tildes!

I need some help with a specific tech issue, and I'm sure someone here can help me do it way quicker than I would be able to on my own.

General Request

I'd like to be able to scan a directory and find all of the largest files of a specific type (e.g. the largest .jpg files). I'm running Pop!_OS and I'm assuming there's some way to do this in the terminal, or alternately some utility I could use.

More Specific Details

I'm cleaning up my digital music library, and I realized in setting it up I made some errors by saving some very high res cover art. Many of my Bandcamp purchases come with a cover.jpg or cover.png file that is several megabytes large. I made the mistake of writing these into the files (adding, for some albums, an extra, say, 100 MB across all tracks). They also take a lot longer to load when I pull them up in my cloud music player. I'd like to be able to identify the albums with the largest cover.* files so that I can go in and replace the album art with a lower res version and gain back all that wasted space lost to unnecessary duplication.

I could go folder by folder and take a look at the sizes of each, but I figure there's an easier way to surface the ones that need my attention. Everything I've looked at online so far has helped me figure out how to identify the biggest files in general, but all that will do is surface the actual audio files, when it's the cover art that needs the specific attention.

Also, in case it's necessary information, the directory structure is Music/[artist]/[album]/cover.*

Any help will be very appreciated!

26 comments

  1. [7]
    what
    (edited )
    Link
    I'm (very) far from a Bash expert, but here's the somewhat hacky solution I came up with: find . -type f -printf '%s %p\n' | grep -Ei "(png|jpg|jpeg)" | sort -n -r I think it's always good to have...

    I'm (very) far from a Bash expert, but here's the somewhat hacky solution I came up with:

    find . -type f -printf '%s %p\n' | grep -Ei "(png|jpg|jpeg)" | sort -n -r
    

    I think it's always good to have some of these Unix commands in your toolbelt, so here's a quick breakdown:

    find . -type f -printf '%s %p\n'

    Recursively lists all the files in nested directories. Usually this will only print out the filename, but we can make it print filesize and filename by passing a custom format string with the -printf flag. Here, %s is the size in bytes of the file, %p is the filename, and \n will just make sure each file is on one line. Here's an example output:

    204164852 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/3 Giorgio By Moroder.flac
    3289016 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/DPramHQcover.png
    104411141 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/1 Give Life Back To Music.flac
    [...]
    

    Notice that this isn't sorted by filesize yet, and it's not filtering by filetype yet.

    grep -Ei "(png|jpg|jpeg)"

    Now we pipe the output from find into grep and use a regex to filter the filetype. -E lets you pass a regex without having to escape a bunch of characters, and -i does a case-insensitive match. Now the output will only contain image files:

    3289016 ./Daft Punk/Daft Punk - R.A.M. (Half-Inch 15 IPS Master Reel)/DPramHQcover.png
    89040 ./Daft Punk/Daft Punk - Discovery (2001) [24bit]/folder.jpg
    1640619 ./fusq/fusq - Lost Station/cover.png
    [...]
    

    sort -n -r

    Finally we can sort the filtered list of files. This is simple, since each file's size is at the start of each line. -n tells sort to do a numerical sort (as opposed to alphabetical), and -r reverses the sort so that it goes from largest -> smallest.

    Like other commenters posted, a better solution is to just install a tool that will get this done quicker (and probably better), but hacking together Bash scripts can be fun :)

    (btw if anyone has any recommendations or tricks that could improve feel free to add them below!)

    Edit: fix typo

    14 votes
    1. [6]
      onyxleopard
      (edited )
      Link Parent
      This is one of these things where portability is not as nice as it could be. It's really unfortunate that different systems will have different find behavior. macOS's find unfortunately does not...

      This is one of these things where portability is not as nice as it could be. It's really unfortunate that different systems will have different find behavior. macOS's find unfortunately does not have the -printf option (I assume because BSD's find doesn't?).

      On macOS I have to combine find and stat as such:

      find . -type f -exec stat -L -f '%z %N' {} \; | sort -nr
      

      Extending on this a little bit:

      find . -type f -exec stat -L -f '%z %N' {} \; | awk '$1+0 > 1048576' | sort -nr | numfmt --to=iec
      
      find . -type f -exec stat -L -f '%z %N' {} \;
      

      This recursively searches the current directory for files or symlinked files and runs stat on them to get their size (since macOS find can't report file sizes).

      | awk '$1+0 > 1048576'
      

      Filter the lines to those files whose size in bytes ($1 refers to the first column, which is the size in bytes or %z from stat in the previous part of the pipeline) is greater than 1.0 megabytes (1024 ^ 2 or 2 ^ 20 = 1,048,576).

      | sort -nr
      

      Sort the lines in reverse numerical order (largest sized files first).

      | numfmt --to=iec
      

      Convert the number of bytes to the International Electrochemical Commission (IEC) standard binary prefix format. In other words, convert numbers like:

      $ while read n; do echo "$n -> $(numfmt --to=iec $n)"; done         
      1024
      1024 -> 1.0K
      1048576
      1048576 -> 1.0M
      1073741824
      1073741824 -> 1.0G
      

      Unix is a hell of a drug!

      Output from my music (restricting to .mp3 files):

      $ find . -type f -name '*.mp3' -exec stat -L -f '%z %N' {} \; | awk '$1+0 > 1048576' | sort -nr | numfmt --to=iec  
      116M ./DJ Doboy/Vocal Edition Volume 14/Vocal Edition Volume 14.mp3
      110M ./DJ Doboy/Vocal Edition Volume 08/Vocal Edition Volume 08.mp3
      110M ./DJ Doboy/Vocal Edition Volume 18/Vocal Edition Volume 18.mp3
      109M ./DJ Doboy/Vocal Edition Volume 10/Vocal Edition Volume 10.mp3
      108M ./DJ Doboy/Vocal Edition Volume 15/Vocal Edition Volume 15.mp3
      106M ./DJ Doboy/Vocal Edition Volume 09/Vocal Edition Volume 09.mp3
      105M ./DJ Doboy/Vocal Edition Volume 13/Vocal Edition Volume 13.mp3
      105M ./DJ Doboy/Vocal Edition Volume 05/Vocal Edition Volume 05.mp3
      104M ./DJ Doboy/Vocal Edition Volume 12/Vocal Edition Volume 12.mp3
      104M ./DJ Doboy/Vocal Edition Volume 16/Vocal Edition Volume 16.mp3
      ...
      

      (These are all long multi-track vocal trance mixes, so they're much larger than typical mp3s.)

      8 votes
      1. [2]
        stu2b50
        Link Parent
        Yep. --printf is not part of the find program as defined by POSIX, and not something any of the BSD userland finds have. You can, of course, install the GNU versions of the POSIX userland in...

        Yep. --printf is not part of the find program as defined by POSIX, and not something any of the BSD userland finds have.

        You can, of course, install the GNU versions of the POSIX userland in macOS, which considering how prevalent GNU/Linux (thanks stallman, finally a time where that's useful to specify) is on the server, a very likely place to use the command line, may be useful, so you can have portability between them.

        5 votes
        1. onyxleopard
          Link Parent
          Yeah I have many GNU coreutils installed via homebrew. I just noticed that gfind is not included in the coreutils package, it's in a separate findutils package. (Homebrew kindly installs these...

          Yeah I have many GNU coreutils installed via homebrew. I just noticed that gfind is not included in the coreutils package, it's in a separate findutils package. (Homebrew kindly installs these with g- prefixes by default so there are no name conflicts, but you can install them with their normal names if you never want to use the macOS/BSD defaults.)

          4 votes
      2. [2]
        whbboyd
        Link Parent
        As mostly an aside, if you're piping the output of find into xargs, you should almost always use the -print0 and -0 arguments to each, respectively, to null-delimit records. Anyone who actually...

        As mostly an aside, if you're piping the output of find into xargs, you should almost always use the -print0 and -0 arguments to each, respectively, to null-delimit records. Anyone who actually puts newlines in a filename has earned themselves a spot on the express elevator to a very special hell, but it is legal, and it is therefore ideal to handle.

        (find -exec never has to send its records over a pipe, so no special treatment is required there; and dealing with null delimiters with any tool other than find and xargs, which have convenient flags for it, is a godawful pain, so outside that one specific context, I almost never bother.)

        5 votes
        1. onyxleopard
          Link Parent
          +1. When working on my own personal machine I think I'm safe there (I'm not a monster), but definitely a good habit if you're ever working on a shared system or with files someone else named.

          +1. When working on my own personal machine I think I'm safe there (I'm not a monster), but definitely a good habit if you're ever working on a shared system or with files someone else named.

          2 votes
      3. helloworld
        Link Parent
        And this is the reason why every system of mine now gets ripgrep, fd, bat and broot, no exceptions. It can get tricky if you're working on remote servers where you're not allowed to install extra...

        And this is the reason why every system of mine now gets ripgrep, fd, bat and broot, no exceptions. It can get tricky if you're working on remote servers where you're not allowed to install extra packages, but for a problem like OP's, the new tools are always better.

        3 votes
  2. [2]
    petrichor
    (edited )
    Link
    Try installing fd, and running fd cover.* --size +10m Music/. If you need to narrow it down more, run man fd, and take a look at the options there. The --size section is particularly helpful (to...

    Try installing fd, and running fd cover.* --size +10m Music/.

    If you need to narrow it down more, run man fd, and take a look at the options there. The --size section is particularly helpful (to jump to it, type /size).

    edit: the binary may be called fdfind instead of fd.

    11 votes
    1. kfwyre
      Link Parent
      This worked PERFECTLY. Thank you so much! Exactly the kind of solution I was looking for.

      This worked PERFECTLY. Thank you so much! Exactly the kind of solution I was looking for.

      5 votes
  3. [5]
    vord
    (edited )
    Link
    Lazy bash expert here, for an additional option: du. When I have to play "which folder is hogging all the space," I'll often run du -max-depth=1|sort -n to find the biggest folder in a directory...

    Lazy bash expert here, for an additional option: du.

    When I have to play "which folder is hogging all the space," I'll often run du -max-depth=1|sort -n to find the biggest folder in a directory and drill down.

    For your case, I'd probably use du | sort -n | grep -i jpg. Lists all the jpg, puts the biggest at the bottom. You can get fancy using -E like @what showed, but I'll often just pass through grep again for subsequent filters as needed. grep -v is also great for excluding results.

    imagemagick is also a great tool to help you on this quest. Makes short work of shrinking large images from the CLI, no downloading needed.

    10 votes
    1. anothersimulacrum
      Link Parent
      Hey, I was going to bring up du! I'll also mention vips/libvips as an alternative to ImageMagick - it's apparently quite a bit faster, and definitely a fair bit less memory intensive than...

      Hey, I was going to bring up du!

      I'll also mention vips/libvips as an alternative to ImageMagick - it's apparently quite a bit faster, and definitely a fair bit less memory intensive than ImageMagick. It's not particularly important for most command line work, but would definitely be a boon working with many or very large files.

      5 votes
    2. [3]
      Amarok
      Link Parent
      du is my go-to, I also recommend tossing in -h for the human-readable output. It just presents the size to you as 2T or 3G or 456M or 78K which is handy if you're looking for a quick summary of...

      du is my go-to, I also recommend tossing in -h for the human-readable output. It just presents the size to you as 2T or 3G or 456M or 78K which is handy if you're looking for a quick summary of the space used in a bunch of directories.

      5 votes
      1. [2]
        onyxleopard
        Link Parent
        Yeah, having flags for human readable numerical figures is super handy, and something I learned in the past year that I wish I knew sooner is that sort also has a -h flag so that it can sort...

        Yeah, having flags for human readable numerical figures is super handy, and something I learned in the past year that I wish I knew sooner is that sort also has a -h flag so that it can sort numerical values formatted as human readable strings! (And also I wish I knew about numfmt sooner, too.)

        $ shuf sizes.txt 
        116K
        0B
        4.2M
        1.1G
        4.1M
        512B
        196K
        $ shuf sizes.txt | sort -hr
        1.1G
        4.2M
        4.1M
        196K
        116K
        512B
        0B
        
        4 votes
        1. vord
          Link Parent
          I didn't ever check sort for that, live and learn, ty!

          I didn't ever check sort for that, live and learn, ty!

          1 vote
  4. streblo
    Link
    Might not be useful here, but this is straightforward intree : tree -shaP "*.png|*.jpeg" --sort=size gives you a directory tree of your current directory with any .png/.jpeg files listed as well,...

    Might not be useful here, but this is straightforward intree :

    tree -shaP "*.png|*.jpeg" --sort=size gives you a directory tree of your current directory with any .png/.jpeg files listed as well, sorted by filesize. I like tree when grokking small to medium sized directories but depending on how much music you have there are probably better options.

    5 votes
  5. [2]
    kfwyre
    Link
    UPDATE: Just in case anyone is curious, the most egregious offender in my library was Savant's Void. It is 20 tracks long, and the album art from Bandcamp is 10.2 MB (its resolution is 4415 x...

    UPDATE:

    Just in case anyone is curious, the most egregious offender in my library was Savant's Void. It is 20 tracks long, and the album art from Bandcamp is 10.2 MB (its resolution is 4415 x 4415). So, I saved ~200MB from fixing that one album alone. Plus my cloud music player shouldn't have to download a 10.2 MB file just to display a tiny square of cover art.

    5 votes
    1. onyxleopard
      Link Parent
      Nice! A real takeaway here is that embedding album art into individual track files, while possibly convenient, is a large duplication of data. It would be nice if music software, or even file...

      Nice! A real takeaway here is that embedding album art into individual track files, while possibly convenient, is a large duplication of data. It would be nice if music software, or even file systems themselves, could be intelligent enough to see this sort of duplication and deduplicate it automatically. When you explicitly conflate the audio data with the album art image data in a single file, though, it becomes rather messy to transparently deduplicate.

      3 votes
  6. [6]
    pArSeC
    (edited )
    Link
    Y'all have wayyyyyyy overcomplicated this: find . -name "cover.*" -size +500K Edit: Reduce 5M to 500K. You might also want to replace -name with -iname to make it case-insensitive.

    Y'all have wayyyyyyy overcomplicated this:

    find . -name "cover.*" -size +500K

    Edit: Reduce 5M to 500K.
    You might also want to replace -name with -iname to make it case-insensitive.

    5 votes
    1. [5]
      onyxleopard
      Link Parent
      This is equivalent to the fd solution. The only reason for the extra complications that have been suggested is with -size, find doesn't report the sizes, just filters results based on size. That...

      This is equivalent to the fd solution. The only reason for the extra complications that have been suggested is with -size, find doesn't report the sizes, just filters results based on size. That may be sufficient sometimes, but is generally less useful than reporting the sizes along with the files as you'll probably end up doing some sort of binary search to tune your threshold (like your change from 5M to 500k).

      3 votes
      1. [4]
        pArSeC
        Link Parent
        Its not equivalent: the fd solution requires you install an additional package.

        Its not equivalent: the fd solution requires you install an additional package.

        2 votes
        1. [2]
          petrichor
          Link Parent
          Sure, but find sucks in so many ways that it's really worth installing fd and saving yourself the hassle later.

          Sure, but find sucks in so many ways that it's really worth installing fd and saving yourself the hassle later.

          2 votes
          1. Amarok
            Link Parent
            I tend to agree, however one doesn't always have that luxury. If you're working on systems for a company and you find yourself in solaris, aix, hp-ux, bsd, or some other random unix variant, many...

            I tend to agree, however one doesn't always have that luxury.

            If you're working on systems for a company and you find yourself in solaris, aix, hp-ux, bsd, or some other random unix variant, many of those systems don't come with the incredibly rich ecosystem of programs common with almost any linux distribution. Even if you want to add some new userland tools, you may have to go through some paperwork just to get permission to introduce these programs into that system depending on how that organization handles changes to servers that have to comply with draconian security or multiple government regulations like hipaa or sarbox. It can sometimes not be worth the effort.

            Honestly, the only reason I learned vi is because it's simply always there on any unix derivative.

            2 votes
        2. onyxleopard
          Link Parent
          Sure, but in terms of the features of fd and find being used they're equivalent.

          Sure, but in terms of the features of fd and find being used they're equivalent.

  7. onyxleopard
    Link
    @petrichor’s suggestion is good because it is very specific to your needs. I might also suggest ncdu as I find it is very useful for finding what is hogging disk space and drilling down into...

    @petrichor’s suggestion is good because it is very specific to your needs. I might also suggest ncdu as I find it is very useful for finding what is hogging disk space and drilling down into directories that contain the most amount of accumulated junk when you don’t know exactly what is hogging disk space (so it can be used in a more exploratory fashion). You provide a -x exclude pattern to ncdu to tell it not to size things that match your pattern, so you can focus on everything else.

    4 votes
  8. [2]
    the_funky_buddha
    Link
    If you want something with a gui, fsearch works well. Once it indexes, I have ctrl+alt+f set to bring up a search window and any mp3 is a click away. It has a standard file browser interface and...

    If you want something with a gui, fsearch works well. Once it indexes, I have ctrl+alt+f set to bring up a search window and any mp3 is a click away. It has a standard file browser interface and can sort by size, type, modification date, etc. Angrysearch works also but just a bit slower in my experience.

    3 votes
    1. Amarok
      Link Parent
      Another handy set of GUI tools for figuring out just what the hell is eating up your space are storage analyzers. These are so popular there are open source apps for any platform, and they make...

      Another handy set of GUI tools for figuring out just what the hell is eating up your space are storage analyzers. These are so popular there are open source apps for any platform, and they make short work of filesystem analysis. I'll link one of the better tools for each platform. ;)

      Windows: WinDirStat

      Linux: QDirStat

      Mac: Disk Inventory X

      2 votes