How to extract images from your squid cache

When I searched for this, I found a lot of questions but no real answers.  Based on a couple of hints I found online, I put these scripts together which might be of use to someone else.

First, the script to scan the cache for images we want to extract.


#!/bin/bash

# -- Edit these as needed !
cache=/var/spool/squid
output=/home/johncc/Out/
filebase=foo-XXXX
# --

match="$1"

set -u

# Limit to files above 15k size in order to skip thumbnails where possible
find "$cache" -size +15k -type f | while read file
do
    if head "$file" | grep -q "$match"; then

        content_type="$(head "$file" | grep -ai "Content-Type:" | cut -f 2 -d " " | tr -d '[:cntrl:]')"
        echo "Matched $file ... content type $content_type" >&2

        type=""
        if [[ "$content_type" = 'image/jpeg' ]]; then
            type=jpg
        elif [[ "$content_type" = 'image/png' ]]; then
            type=png
        elif [[ "$content_type" = 'image/gif' ]]; then
            type=gif
        fi

        if [[ "$type" != "" ]]; then
            outfile="$output$(mktemp $filebase).$type"
            echo "Converting $file as $outfile with type $type" >&2
            squid_get_image -q -t $type $file $outfile < /dev/null
        fi

    fi
done

Save this as ~/bin/squid_scan or similar (if ~/bin is in your path) and do chmod 755 ~/bin/squid_scan

This uses some fairly rudimentary filtering, which could be improved. To use it to grab all images from mysite.com, you call: squid_scan mysite

As you can see, this calls squid_get_image to actually do the image extraction. Here’s the code:-

#!/bin/bash

set -u

usage()
{
    echo "squid_get_image [-q] [-t ] infile outfile" >&2
    echo >&2
}

im_type=jpg
quiet=false

while getopts ":qt:" OPTION
do
    case $OPTION in
    t)
        im_type=$OPTARG
            ;;
    q)
            quiet=true
            ;;
    \?)
        echo "Invalid option: -$OPTARG" >&2
        usage
        exit 1
        ;;
    esac
done

shift $(($OPTIND - 1))

if [[ $# != 2 ]]; then
    echo "Expecting 2 parameters, got $#" >&2
    usage
    exit 1
fi

infile=$1
outfile=$2

$quiet || echo "Coverting $infile to $outfile" >&2
case $im_type in

jpg)

    bvi -f <(echo -e '\FFD8\,$w' $outfile\\nq) "$infile" &>/dev/null
    ;;

png)

    bvi -f <(echo -e '\89504E470D0A1A0A\,$w' $outfile\\nq) "$infile" &>/dev/null
    ;;

gif)

    bvi -f <(echo -e '/GIF8[79]a/,$w' $outfile\\nq) "$infile" &>/dev/null
    ;;

*)

    echo "Bad Image Type." &>2

esac

NOTE: You need the bvi utility (binary vi). In Debian/Ubuntu just sudo apt-get bvi. In Archlinux it’s in aur.

Save this as ~/bin/squid_get_image or similar and do chmod 755 ~/bin/squid_get_image

This uses bvi’s ability to run a script of ex commands. It uses the fact that the squid cache files are some binary data, the HTTP headers, and then the file itself. The gif, jpg and png formats all start with “magic numbers”, and the ex commands look for those before writing out everything that follows to the specified file name.

I hope this is useful to someone!

Advertisements

7 Responses to “How to extract images from your squid cache”


  1. 1 Ralph Corderoy 31/03/2010 at 10:57

    Hi John,

    Various comments.

    match=”$0″ looks wrong, do you mean $1?

    It seems one way to get the HTTP headers from a squid cache file is

    perl -0777pe ‘/HTTP.*?\r\n\r\n/s and ($_ = $&) =~ y/\r//d’

    I tried that on all of mine here and all of them produce at least one line of output. It avoids the possibility that the ten lines head(1) is displaying doesn’t contain the pattern but later lines of the header do.

    A pipeline by default has the exit status of the last command, so you can do

    if perl … | grep -q “$match”; then …

    Would it be better to run `file -i’ to identify the type of the HTTP reply’s body? The body can be extracted with a similar perl statement to before.

    perl -0777ne ‘s/.*?HTTP.*?\r\n\r\n//s and print’ $f | file -i –

    That seems less error-prone than searching for a different hex pattern depending on which kind of file the Content-Type header says it is with bvi(1).

    The output filename built doesn’t seem unique enough. Two foo-1270029072.jpg could easily come along in the same second? Perhaps it should be ditch the seconds for the XXX of mktemp(1) instead? Then it’s up to mktemp to find a unique filename, which should be possible if given enough Xs.

    “[ quiet ]” is testing whether “quiet” is an empty string, and it isn’t. I think you mean $quiet. 🙂 Or, as bash has true and false built-in, I tend to just use “$quiet || …” and set them to true or false.

    usage() is writing to stdout but the error messages before it are to stderr.

    I’ve grovelled through squid’s cache before looking for bits, so it’s handy to find this post explaining how to do it.

    Cheers, Ralph.

  2. 2 vspike 31/03/2010 at 20:33

    Thanks Ralph!

    I agree with all your points, and apart from the perl I’ve made the changes you suggest. $0 was a typo created while I edited the hardcoded value I had for publishing! I knew I should have re-tested it.

    I think you’re right that the perl is better. Ideally I’d like to code something (probably in python) to do general manipulation of the squid archive format, including extracting the HTML headers and whatever binary gubbins is at the start. That way you could extract any file type, and also get its original name and so on. Sounds like a fun project for a quiet day.

    Still, it was fun to learn about bvi which looks like useful tool.

  3. 3 incero 23/11/2010 at 06:30

    What are you extracting the images for?

    • 4 vspike 23/11/2010 at 10:03

      I was on a heavily capped satellite broadband link, and I’d been browsing a lot of images on flickr. I later decided I wanted to save them all, but without incurring extra traffic. Since I knew they were all on a disk on my squid box, I thought there had to be an easy way to get them off 🙂

  4. 5 Glen Lund 16/08/2012 at 15:51

    Hi John, I work at a library in Zambia and I would like to extract pdfs from our squid cache. Students and staff download pdfs for their research and study and we would like to store copies in the library for the use of others without have to download them repeatedly on our limited internet. Not all our users are good about passing on their pdfs so the easiest thing to do would to be to recover them from the squid cache. I guess that your bash scripts could be modified to do this but I don’t have the skill to do it. Could you please help or advise me how to do it? Cheers, Glen

  5. 6 g-su 18/06/2014 at 19:10

    I can’t thank you enough for this! I’ve been searching all morning for a way to decode gzip encoded squid cache files!


  1. 1 Tweets that mention How to extract images from your squid cache « Technobabble -- Topsy.com Trackback on 01/04/2010 at 02:49

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




Top artists this week from Last.fm

My Twitter feed


%d bloggers like this: