44 lines
1.6 KiB
Plaintext
44 lines
1.6 KiB
Plaintext
|
# pyuca: Python Unicode Collation Algorithm implementation
|
||
|
(http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/)
|
||
|
|
||
|
This is my preliminary attempt at a Python implementation of the
|
||
|
[Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/).
|
||
|
I originally posted it to my blog in 2006 but it seems to get enough
|
||
|
usage it really belongs here (and in PyPI).
|
||
|
|
||
|
What do you use it for? In short, sorting non-English strings properly.
|
||
|
|
||
|
The core of the algorithm involves multi-level comparison. For example,
|
||
|
``café`` comes before ``caff`` because at the primary level, the accent
|
||
|
is ignored and the first word is treated as if it were ``cafe``.
|
||
|
The secondary level (which considers accents) only applies then to words
|
||
|
that are equivalent at the primary level.
|
||
|
|
||
|
The Unicode Collation Algorithm and pyuca also support contraction and
|
||
|
expansion. **Contraction** is where multiple letters are treated as a
|
||
|
single unit. In Spanish, ``ch`` is treated as a letter coming between
|
||
|
``c`` and ``d`` so that, for example, words beginning ``ch`` should
|
||
|
sort after all other words beginnings with ``c``. **Expansion** is where
|
||
|
a single letter is treated as though it were multiple letters. In German,
|
||
|
``ä`` is sorted as if it were ``ae``, i.e. after ``ad`` but before ``af``.
|
||
|
|
||
|
## Here is how to use the ``pyuca`` module:
|
||
|
``
|
||
|
git clone https://github.com/jtauber/pyuca.git
|
||
|
cd pyuca
|
||
|
pip install pyuca
|
||
|
``
|
||
|
|
||
|
**Usage example:**
|
||
|
``
|
||
|
from pyuca import Collator
|
||
|
c = Collator("allkeys.txt")
|
||
|
|
||
|
sorted_words = sorted(words, key=c.sort_key)
|
||
|
``
|
||
|
|
||
|
``allkeys.txt`` (1 MB) is available at
|
||
|
|
||
|
http://www.unicode.org/Public/UCA/latest/allkeys.txt
|
||
|
|