```
mailbox(text("Test München West", charsets::UTF_8), "a@b.de").generate();
```
produces
```
=?us-ascii?Q?Test_?= =?utf-8?Q?M=C3=BCnchen?= =?us-ascii?Q?West?= <test@example.com>
```
The first space between ``Test`` and ``München`` is encoded as an
underscore along with the first word: ``Test_``. The second space
between ``München`` and ``West`` is encoded with neither of the two
words and thus lost. Decoding the text results in ``Test
MünchenWest`` instead of ``Test München West``.
This is caused by how ``vmime::text::createFromString()`` handles
transitions between 7-bit and 8-bit words: If an 8-bit word follows a
7-bit word, a space is appended to the previous word. The opposite
case of a 7-bit word following an 8-bit word *misses* this behaviour.
When one fixes this problem, a follow-up issue appears:
``text::createFromString("a b\xFFc d")`` tokenizes the input into
``m_words={word("a "), word("b\xFFc ", utf8), word("d")}``. This
"right-side alignment" nature of the whitespace is a problem for
word::generate():
As per RFC 2047, spaces between adjacent encoded words are just
separators but not meant to be displayed. A space between an encoded
word and a regular ASCII text is not just a separator but also meant
to be displayed.
When word::generate() outputs the b-word, it would have to strip one
space, but only when there is a transition from encoded-word to
unencoded word. word::generate() does not know whether d will be
encoded or unencoded.
The idea now is that we could change the tokenization of
``text::createFromString`` such that whitespace is at the *start* of
words rather than at the end. With that, word::generate() need not
know anything about the next word, but rather only the *previous*
one.
Thus, in this patch,
1. The tokenization of ``text::createFromString`` is changed to
left-align spaces and the function is fixed to account for
the missing space on transition.
2. ``word::generate`` learns how to steal a space character.
3. Testcases are adjusted to account for the shifted
position of the space.
Fixes: #283, #284
Co-authored-by: Vincent Richard <vincent@vincent-richard.net>
Switch out the byte sequence by one that is simiarly random, but one
which happens to decode as valid UTF-8, such that the expected and
actual strings are shown with reasonable characters on a terminal.
* tests: add case for getRecommendedEncoding
* vmime: avoid integer multiply wraparound in wordEncoder::guessBestEncoding
If the input string is 42949673 characters long or larger, there will
be integer overflow on 32-bit platforms when multiplying by 100.
Switch that one computation to floating point.
* vmime: update comment in wordEncoder::guessBestEncoding
* vmime: avoid changing SEVEN_BIT when encoding::decideImpl sees U+007F
Do not switch to QP/B64 when encountering U+007F.
U+007F is part of ASCII just as much as U+0001 is.
---------
Co-authored-by: Vincent Richard <vincent@vincent-richard.net>
When the display name contains an At sign, or anything of the sort,
libvmime would forcibly encode this to =?...?=, even if the line
is fine ASCII which only needs quoting.
rspamd takes excessive quoting as a sign of spam and penalizes
such mails by raising the score (rule/match: TO_EXCESS_QP et al.)
There is crap software out there that generates mails violating the
prefix ban clause from RFC 2046 §5.1 ¶2.
Switch vmime from a prefix match to an equality match, similar to
what Alpine and Thunderbird do too.
The way I read the RFC is that whitespace is not allowed before the
boundary marker, only afterwards, so the checks for leading WS are
removed, and the missing check for trailing WS is added.
See RFC 2046 §5.1.1: """The boundary delimiter line is then defined
as a line consisting entirely of two hyphen characters ("-", decimal
value 45) followed by the boundary parameter value from the
Content-Type header field, optional linear whitespace, and a
terminating CRLF."""
Spammers use "Name <addr> <addr>" to trick some parsers.
My expectations as to what the outcome should be is presented
in the updated mailboxTest.cpp.
The DFA in mailbox::parseImpl is hereby redone so as to pick the
rightmost address-looking portion as the address, rather than
something in between. While doing so, it will also no longer mangle
the name part anymore (it does this by keeping a "as_if_name"
variable around until the end).
Added test for UTF-7 encoding availability. Added test for input buffer
underflow in charsetFilteredOutputStream. Refactored charset conversion
tests and removed useless tests.