Sat 9 Feb 2008
Don't Overflow the Command Line
Posted at 0:26 +1100
Today's mostly trivial and "well-known except if you haven't seen it before" system administration tip is prompted by one of the recent comments over at James' blog.
Be careful when using shell wildcard patterns as arguments to scripts. There are two ways to mess up and you'll usually forget to test both of them.
The Unix Way, whatever that means, when writing a script that works on filenames is to allow it to take more than one filename at a time and then operate on them one at a time. Very handy. However, this conceals a trap if you use it without thinking in combination with shell globs (wildcards). Suppose you attempt to run something like this
make_tumbnails.sh *.jpg
and let's imagine it's being run via a cronjob, so that "out of sight, out of mind" has a good chance to kick in. What could possibly go wrong?
Firstly, there's the possibility that your script is going to see the literal argument "*.jpg", which won't correspond to any file it can open. This will happen if there are no files matching the "*.jpg" glob.
Secondly, there's the exact opposite of this "no files matched problem" — too many files matched. Imagine if your directory contains thousands of files with long names. All those files are put onto the same command line and if it gets too large, the exec() call to start the subprocess will raise an error (E2BIG). Fortunately, on modern Linux systems, that limit is huge (128K), but on many other Unix systems, particularly older ones, it can be quite limiting. POSIX only requires it to be at least 4K. You can find out the maximum argument length by running
getconf ARG_MAX
at the shell prompt, for example. There's also a limit (in Linux) on the number of distinct arguments, but that's much less likely to interfere. If you don't have getconf, but do have Python, the following will also do the trick:
>>> import os
>>> os.sysconf('SC_ARG_MAX')
Avoiding both of these problems is fairly easy, providing you remember they're possible. Use the find command and, possibly, xargs. Either
find -name \*.jpg -exec make_thumbnails.sh {} +
or
find -name \*.jpg -print0 | xargs -0 make_thumbnails.sh
will do what you expect. If no files are found, the command isn't run. If more files are found than will fit on a the command line, multiple commands are run. You have to remember to use -print0 for find to avoid filenames with spaces in them, but the little appreciated '+' option to find (also read up on -execdir) can serve the same purpose. It's just not quite as portable if you have to use Ye Olde Ancient System at some point.
Finally, I guess there's actually a third common way to mess up this type of job: forgetting that files can have spaces in their names and iterating over $@ rather than "$@", for example. Remember, folks, every character except NUL and '/' are valid in filenames. Test them all.
Topics: software/linux, technology/sysadmin