adelton

Czech strxfrm and strcoll implementing four-pass collation

I wrote an implementation of function strxfrm that converts Czech (ISO-8859-2) text to sequence that can be compared using strcmp. The conversion is defined in such a way that it as closely matches Czech standard (ČSN 97 6030) and its interpretation by Petr Olšák. If you have some problem with the result, for example you are not happy with numbers ordered only after letters z and ž, read the standard first. The same algorithm is used in the czech character set in MySQL and in module Cz::Sort.

The file also contains function strcoll that compares two strings without a need of previous conversion, in constant memory. The file is compiled using

cc -c -o csort.o csort.c

and using

ld -shared -o csort.so csort.o

we turn it into a shared library. We then use it for example by setting environment variable

export LD_PRELOAD=/cesta/k/csort.so

which ensures that instead of the default strxfrm and strcoll the Czech ones will be used.

Included is also the definition table used in generating the code of the functions, Perl script for conversion to C source and also examples of correct Czech ordering.