Checking a LaTeX Index

This week again, LaTeX. This time, the index. At the end of a document, you will usually find an index so that, if you don’t have a magical ctrl-f, you can find something fast in the book.

In LaTeX, creating an index is really easy. You include the package makeindex, and plant \index{topic!subtopic} tags in the text (preferably just besides the word you want to index, the \index command doesn’t understand paragraphs). You add \printindex somewhere else in your document and you run pdflatex (or just latex) to get the index generated. That’s all fine except that it doesn’t provide checks. It adds to the index whatever you typed, and doesn’t give warnings if you have an entry “compression” and an entry “compresion” (because, you know, typos happen). Let’s see how we can somewhat fix that.

The first thing is to find out how index is generated, or, at least, what part of the generated data can be used. If you’re using makeindex, you’ll find that your project is enriched by one file, yourproject.idx where are found all the tags you’ve added with \index{topic!subtopic}. Indeed, if you inspect it, you’ll find it contains something like this:

\indexentry{Maths!103|hyperpage}{1}
\indexentry{Calcul!Diff\IeC {\'e}rentiel|hyperpage}{1}
\indexentry{Maths!203|hyperpage}{1}
\indexentry{Calcul!Int\IeC {\'e}gral|hyperpage}{1}
\indexentry{Maths!105|hyperpage}{1}
\indexentry{Alg\IeC {\e}bre!Lin\IeC {\'e}aire|hyperpage}{1}
\indexentry{Parenth\IeC {\e}ses|(hyperpage}{2}
\indexentry{Crochets|(hyperpage}{2}


You see that hyperpage is inserted by the hyperref package that makes the document navigable in a PDF viewer. Let’s simplify the information. The \indexentry part is only useful to actual index generation, but not so much for human-readability. We can chop it off. The page numbers are also irrelevant, as they are generated automatically. If we’re using hyperref, everything after the | is also more or less irrelevant—we might want to cut it too.

We can do this with a string of shell command, and, why not, put them in
the Makefile:

checkindex:

cut -b 13- < $(source).idx \ | sed s/}{[0-9]*}//g \ | cut -d\| -f 1 \ | sort --ignore-case --unique \ | tee$(source).unique.idx


The first part, cut, chops the \indexentry{, the sed part eliminate page numbers. Pages ranges, such as 2–5, are generated later on: they do not appear in the indexentry file, so we needn’t worry about it. The second cut may be optional (if you want to see the see references, you may want to put something else, maybe another sed that unwraps the references?). The sort commands eliminates duplicated entries, and sorts ignoring case. The tee just serves to write to the file and the console at the same time.

The output looks like this:

~@Symboles!$\square$
~@Symboles!$\subset$
~@Symboles!$\subseteq$
~@Symboles!$\to$
~@Symboles!$\varnothing$
~@Symboles!$\vee$
~@Symboles!$\wedge$
~@Symboles!$[x]$
Syst\IeC {\e}me!Binaire
Tailles!Authentifiantes
Tailles!Romaines
tally sticks@\textit  {Tally sticks}
Texte!Repr\IeC {\'e}sentation du
Th\IeC {\'e}or\IeC {\e}me!de Pythagore
Th\IeC {\'e}orie!De l'information
Th\IeC {\'e}orie!Du Codage


Which allows us, with some patience, to find problem entries:

Nombres!Premiers
Nombres!Rationels
Nombres!Rationnels
Nombres!R\IeC {\'e}els


The only thing left to do is to go back in the LaTeX source code and fix the defective \index{} entry.

*
* *

I know it is not much an improvement, and it still asks for a human intervention that requires a lot of attention. Maybe we could output this into some other program that would compute, say,