Getting Clip-Art

Interoperability of software is still a major issue. Not only closed systems do not play well with others, open systems sometimes—often—have the same problems with exchanging information. One that only plays well with others when forced to is, of course, our good friend Microsoft. Sometimes they pretend to play well, and other software developers must reverse-engineer the file formats to read and write data in a compatible format.

One minor annoyance is Microsoft Office’s clip-art bundle file format that is not supported (at the time of writing, anyway) by Open Office. This means that you can download clip-art for your presentation only to discover that they are perfectly useless. Or, you can take 10 minutes and look at what the bundles actually contains!

First, you go here and download a few clip-art. You get a file ClipArt.mpf. Using any editor (but Firefox with its XML mode is a good choice) you look at its contents. The first few lines look like:

<?xml version="1.0"?>
<D:multistatus xmlns:D="DAV:" xmlns:C="urn:schemas-microsoft-com:office:clipgallery" xmlns:dt="urn:schemas-microsoft-com:datatypes">
			<D:status>HTTP/1.1 200 OK</D:status>
				<C:subject>CD drives,CD-ROM drives,CDs,computers,hands,males,men,people,photographs,technology</C:subject>
				<C:xsubject>Cd,CD Drive,CD-ROM,CD-ROM drive,cd-roms,compact disc,compact discs,compact disk,compact disks,computer,hand,male,man,j0406815</C:xsubject>
				<C:collections>/Downloaded Clips,/Downloaded Clips/People,/Downloaded Clips/Technology</C:collections>
				<D:getlastmodified>08/08/2006 00:00:00</D:getlastmodified>
					<C:contents dt:dt="bin.base64">

So clearly the format is merely an XML file. It contains some meta-information about the clip-art (such as categories and keywords) but the image itself is very conveniently encoded as a Base64 representation of the original image file.

A Bash script extracting the file contents is quick to write once you’re decided on what exactly you want to extract. I only used the file name and the image file. Using Bash’s =~ regexp comparison, the script is most straightforward:


while read line
    # check for filepath
    if [[ "$line" =~ .*filepath.* ]]
        name=$( echo $line | sed -e s/\<[^\>]\*\>//g )
        echo $name

        # hunt for <c:contents ... >
        until [[ "$line" =~ .*C:contents.* ]] 
            read line
        read data

        # extract and decode the file
        echo $data | base64 -d > $name
done < "$1"

Of course this is only a basic script. You could extract the meta-data and do something with it, like adding it in a JFIF comment tag in the file (if it’s a JPEG). Running the script in a shell would yield:

> extract-mpf ClipArt.mpf 

And so image files are extracted to current directory.

* *

I am well aware that we could do a lot more than merely extracting images from the clip-art bundle. Having a nice, plays-well-with-others, open source clip-art manager, with federated clip-art repositories, keyword searchable, etc., etc., would be quite nice. However, sometimes a minor annoyance needs only a minor fix.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: