44 lines
1.6 KiB
Plaintext
44 lines
1.6 KiB
Plaintext
# pyuca: Python Unicode Collation Algorithm implementation
|
|
(http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/)
|
|
|
|
This is my preliminary attempt at a Python implementation of the
|
|
[Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/).
|
|
I originally posted it to my blog in 2006 but it seems to get enough
|
|
usage it really belongs here (and in PyPI).
|
|
|
|
What do you use it for? In short, sorting non-English strings properly.
|
|
|
|
The core of the algorithm involves multi-level comparison. For example,
|
|
``café`` comes before ``caff`` because at the primary level, the accent
|
|
is ignored and the first word is treated as if it were ``cafe``.
|
|
The secondary level (which considers accents) only applies then to words
|
|
that are equivalent at the primary level.
|
|
|
|
The Unicode Collation Algorithm and pyuca also support contraction and
|
|
expansion. **Contraction** is where multiple letters are treated as a
|
|
single unit. In Spanish, ``ch`` is treated as a letter coming between
|
|
``c`` and ``d`` so that, for example, words beginning ``ch`` should
|
|
sort after all other words beginnings with ``c``. **Expansion** is where
|
|
a single letter is treated as though it were multiple letters. In German,
|
|
``ä`` is sorted as if it were ``ae``, i.e. after ``ad`` but before ``af``.
|
|
|
|
## Here is how to use the ``pyuca`` module:
|
|
``
|
|
git clone https://github.com/jtauber/pyuca.git
|
|
cd pyuca
|
|
pip install pyuca
|
|
``
|
|
|
|
**Usage example:**
|
|
``
|
|
from pyuca import Collator
|
|
c = Collator("allkeys.txt")
|
|
|
|
sorted_words = sorted(words, key=c.sort_key)
|
|
``
|
|
|
|
``allkeys.txt`` (1 MB) is available at
|
|
|
|
http://www.unicode.org/Public/UCA/latest/allkeys.txt
|
|
|