Search all your Bibtex files

When I write papers or other things, I tend to create separate bib files, so that I don’t end with a giant unsearchable and unmaintainable blob. Moreover, topics tend to be transient, and the bibliography may or mayn’t be interesting in a few year’s time, so, if unused, it can safely sleep in a directory with the paper it’s attached to.

book_stack

But once in a while, I need one of those old references, and since they’re scatted just about everywhere… it may take a while to find them back. Unless you have a script. Scripts are nice.

So basically, a script to find back references should be able to enumerate all the .bib files and search them, at least for author names and titles. A simple grep doesn’t quite cut it, because bibtex files are structured in a format that may be reminiscent of JSON:

@book{bell-PH-1990,
	author = "Timothy C. Bell and John G. Cleary and Ian H. Witten",
	publisher = "Prentice Hall",
	title = "{Text Compression}",
	year = "1990"
}

So to be useful, the script should at least print the whole block. Fortunately, there’s a tool for that, bib2bib. But, unfortunately, it’s difficult to use so it’ll be tricky to wrap it in a script. In particular, it tends to output more than you want, exports comments and strings and preamble and other cr*p you don’t necessarily want. Some options like --quiet or --no-comment have no effect. Messages are printed to either stdout or stderr, and
bib2bib returns 0 even if it terminated with an error. Some
grep and sed magic will be needed.

*
* *

Since bib2bib isn’t really meant to be used in a script, I had to use it twice. Once to check for the output—remember, it doesn’t even return an error status… well, it always returns “success”—and standard error output at that. If no error is printed, then I parse the output to remove all the extraneous stuff.

#!/usr/bin/env bash

locate *.bib |
    (
        while read filename
        do
            # grep with -a forces interpretation as "ascii" since some
            # encodings (ex. Windows, iso-latin1) may be detected as
            # "binary" (and grep whines).

            # hack because bib2bib still returns 0 (success) even if no
            # match
            #
            nul=$(bib2bib -c 'author:"'$1'" or title:"'$1'"' \
                          < "$filename" 2>&1 \
                      | grep -i -e "no matching" \
                             -e "parse error" )

            if [ "$nul" == "" ]
            then
                echo ---- $filename
                # hacky seds because --no-comment has no effect. It also
                # exports strings, preambles, etc.
                #
                bib2bib \
                    -c 'author:"'$1'" or title:"'$1'"' \
                    < "$filename" \
                    | sed '/comment\|string\|preamble/{:1;N;s/{.*}//;T1}' \
                    | grep -a -v '^@comment\|^@string\|^@preamble' \
                    | sed '/^$/N;/^\n$/D'
            fi
        done
    ) 2> /dev/null

Some of the dark sed magic comes from here. The first sed replaces the contents of nested {curly {braces}}. The second compresses multiple empty lines into a single empty line. Invoked in a shell, the script produces the following output:

> find-bib.sh huffman
---- /home/steven/somewhere/part-ii.bib

@article{capocelli-TIT-1986,
  author = {R. M. Capocelli and R. Giancarlo and I. J. Taneja},
  journal = {IEEE Trans. Information Theory},
  month = nov,
  number = {6},
  pages = {854--857},
  title = {{Bounds on the Redundancy of Huffman Codes}},
  volume = {32},
  year = {1986}
}

*
* *

The script isn’t bullet proof. For one thing, the sed regexp doesn’t quite deals with stuff like this:

@Preamble{"\input bibnames.sty "
# "\input path.sty "
# "\ifx \k \undefined \let \k = \c
   \immediate\write16{Ogonek accent unavailable: replaced by cedilla}\fi "
# "\ifx \undefined \FEATPOST \def \FEATPOST {{\manfnt FEAT}\-{\manfnt POST}\spacefactor1000 }\fi"
# "\ifx \undefined \MP \def \MP {{\manfnt META}\-{\manfnt POST}\spacefactor1000 } \fi"
# "\ifx \undefined \Xy \def \Xy {{\sc Xy}} \fi"
# "\ifx \undefined \manfnt \font\manfnt=logo10 \fi"
# "\ifx \undefined \pdfTeX \def \pdfTeX {pdf\TeX}\fi"
# "\def \toenglish #1\endtoenglish{[{\em English:} #1\unskip]} "
# "\hyphenation{
                An-wen-der-ver-ei-ni-gung
                Bie-mes-der-fer
                Co-lo-phon
                Deutsch-spra-chi-ge
                Ge-leit-wort
                Hol-dys
                Katz-en-beiss-er
                Ko-lo-dziej-ska
                la-da-mi
                Lar-ra-bee
                Manu-scripts
                mark-up
                Rijks-uni-ver-si-teit
                South-all
                Stutt-gart
}"
}

I have no idea why this kind of stuff is necessary in a bib file. Still the script will strip everything except the hyphenation list. ¯\_(ツ)_/¯

Leave a comment