When I searched for this, I found a lot of questions but no real answers. Based on a couple of hints I found online, I put these scripts together which might be of use to someone else.
First, the script to scan the cache for images we want to extract.
#!/bin/bash
# -- Edit these as needed !
cache=/var/spool/squid
output=/home/johncc/Out/
filebase=foo-XXXX
# --
match="$1"
set -u
# Limit to files above 15k size in order to skip thumbnails where possible
find "$cache" -size +15k -type f | while read file
do
if head "$file" | grep -q "$match"; then
content_type="$(head "$file" | grep -ai "Content-Type:" | cut -f 2 -d " " | tr -d '[:cntrl:]')"
echo "Matched $file ... content type $content_type" >&2
type=""
if [[ "$content_type" = 'image/jpeg' ]]; then
type=jpg
elif [[ "$content_type" = 'image/png' ]]; then
type=png
elif [[ "$content_type" = 'image/gif' ]]; then
type=gif
fi
if [[ "$type" != "" ]]; then
outfile="$output$(mktemp $filebase).$type"
echo "Converting $file as $outfile with type $type" >&2
squid_get_image -q -t $type $file $outfile < /dev/null
fi
fi
done
Save this as ~/bin/squid_scan or similar (if ~/bin is in your path) and do chmod 755 ~/bin/squid_scan
This uses some fairly rudimentary filtering, which could be improved. To use it to grab all images from mysite.com, you call: squid_scan mysite
As you can see, this calls squid_get_image to actually do the image extraction. Here’s the code:-
#!/bin/bash
set -u
usage()
{
echo "squid_get_image [-q] [-t ] infile outfile" >&2
echo >&2
}
im_type=jpg
quiet=false
while getopts ":qt:" OPTION
do
case $OPTION in
t)
im_type=$OPTARG
;;
q)
quiet=true
;;
\?)
echo "Invalid option: -$OPTARG" >&2
usage
exit 1
;;
esac
done
shift $(($OPTIND - 1))
if [[ $# != 2 ]]; then
echo "Expecting 2 parameters, got $#" >&2
usage
exit 1
fi
infile=$1
outfile=$2
$quiet || echo "Coverting $infile to $outfile" >&2
case $im_type in
jpg)
bvi -f <(echo -e '\FFD8\,$w' $outfile\\nq) "$infile" &>/dev/null
;;
png)
bvi -f <(echo -e '\89504E470D0A1A0A\,$w' $outfile\\nq) "$infile" &>/dev/null
;;
gif)
bvi -f <(echo -e '/GIF8[79]a/,$w' $outfile\\nq) "$infile" &>/dev/null
;;
*)
echo "Bad Image Type." &>2
esac
NOTE: You need the bvi utility (binary vi). In Debian/Ubuntu just sudo apt-get bvi. In Archlinux it’s in aur.
Save this as ~/bin/squid_get_image or similar and do chmod 755 ~/bin/squid_get_image
This uses bvi’s ability to run a script of ex commands. It uses the fact that the squid cache files are some binary data, the HTTP headers, and then the file itself. The gif, jpg and png formats all start with “magic numbers”, and the ex commands look for those before writing out everything that follows to the specified file name.
I hope this is useful to someone!

Hi John,
Various comments.
match=”$0″ looks wrong, do you mean $1?
It seems one way to get the HTTP headers from a squid cache file is
perl -0777pe ‘/HTTP.*?\r\n\r\n/s and ($_ = $&) =~ y/\r//d’
I tried that on all of mine here and all of them produce at least one line of output. It avoids the possibility that the ten lines head(1) is displaying doesn’t contain the pattern but later lines of the header do.
A pipeline by default has the exit status of the last command, so you can do
if perl … | grep -q “$match”; then …
Would it be better to run `file -i’ to identify the type of the HTTP reply’s body? The body can be extracted with a similar perl statement to before.
perl -0777ne ‘s/.*?HTTP.*?\r\n\r\n//s and print’ $f | file -i -
That seems less error-prone than searching for a different hex pattern depending on which kind of file the Content-Type header says it is with bvi(1).
The output filename built doesn’t seem unique enough. Two foo-1270029072.jpg could easily come along in the same second? Perhaps it should be ditch the seconds for the XXX of mktemp(1) instead? Then it’s up to mktemp to find a unique filename, which should be possible if given enough Xs.
“[ quiet ]” is testing whether “quiet” is an empty string, and it isn’t. I think you mean $quiet.
Or, as bash has true and false built-in, I tend to just use “$quiet || …” and set them to true or false.
usage() is writing to stdout but the error messages before it are to stderr.
I’ve grovelled through squid’s cache before looking for bits, so it’s handy to find this post explaining how to do it.
Cheers, Ralph.
Thanks Ralph!
I agree with all your points, and apart from the perl I’ve made the changes you suggest. $0 was a typo created while I edited the hardcoded value I had for publishing! I knew I should have re-tested it.
I think you’re right that the perl is better. Ideally I’d like to code something (probably in python) to do general manipulation of the squid archive format, including extracting the HTML headers and whatever binary gubbins is at the start. That way you could extract any file type, and also get its original name and so on. Sounds like a fun project for a quiet day.
Still, it was fun to learn about bvi which looks like useful tool.
What are you extracting the images for?
I was on a heavily capped satellite broadband link, and I’d been browsing a lot of images on flickr. I later decided I wanted to save them all, but without incurring extra traffic. Since I knew they were all on a disk on my squid box, I thought there had to be an easy way to get them off