SP/web2py/gluon/contrib/pyuca/README.markmin

44 lines
1.6 KiB
Plaintext
Raw Permalink Normal View History

2018-10-25 15:33:07 +00:00
# pyuca: Python Unicode Collation Algorithm implementation
(http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm/)
This is my preliminary attempt at a Python implementation of the
[Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/).
I originally posted it to my blog in 2006 but it seems to get enough
usage it really belongs here (and in PyPI).
What do you use it for? In short, sorting non-English strings properly.
The core of the algorithm involves multi-level comparison. For example,
``café`` comes before ``caff`` because at the primary level, the accent
is ignored and the first word is treated as if it were ``cafe``.
The secondary level (which considers accents) only applies then to words
that are equivalent at the primary level.
The Unicode Collation Algorithm and pyuca also support contraction and
expansion. **Contraction** is where multiple letters are treated as a
single unit. In Spanish, ``ch`` is treated as a letter coming between
``c`` and ``d`` so that, for example, words beginning ``ch`` should
sort after all other words beginnings with ``c``. **Expansion** is where
a single letter is treated as though it were multiple letters. In German,
``ä`` is sorted as if it were ``ae``, i.e. after ``ad`` but before ``af``.
## Here is how to use the ``pyuca`` module:
``
git clone https://github.com/jtauber/pyuca.git
cd pyuca
pip install pyuca
``
**Usage example:**
``
from pyuca import Collator
c = Collator("allkeys.txt")
sorted_words = sorted(words, key=c.sort_key)
``
``allkeys.txt`` (1 MB) is available at
http://www.unicode.org/Public/UCA/latest/allkeys.txt