Commit 11a337f1 by Jason Rhinelander Committed by Wenzel Jakob

Unicode fixes and docs (#624)

* Propagate unicode conversion failure

If returning a std::string with invalid utf-8 data, we currently fail
with an uninformative TypeError instead of propagating the
UnicodeDecodeError that Python sets on failure.

* Add support for u16/u32strings and literals

This adds support for wchar{16,32}_t character literals and the
associated std::u{16,32}string types.  It also folds the
character/string conversion into a single type_caster template, since
the type casters for string and wstring were mostly the same anyway.

* Added too-long and too-big character conversion errors

With this commit, when casting to a single character, as opposed to a
C-style string, we make sure the input wasn't a multi-character string
or a single character with codepoint too large for the character type.

This also changes the character cast op to CharT instead of CharT& (we
need to be able to return a temporary decoded char value, but also
because there's little gained by bothering with an lvalue return here).

Finally it changes the char caster to 'has-a-string-caster' instead of
'is-a-string-caster' because, with the cast_op change above, there's
nothing at all gained from inheritance.  This also lets us remove the
`success` from the string caster (which was only there for the char
caster) into the char caster itself.  (I also renamed it to 'none' and
inverted its value to better reflect its purpose).  The None -> nullptr
loading also now takes place only under a `convert = true` load pass.
Although it's unlikely that a function taking a char also has overloads
that can take a None, it seems marginally more correct to treat it as a
conversion.

This commit simplifies the size assumptions about character sizes with
static_asserts to back them up.
parent ada763b9
......@@ -94,14 +94,26 @@ as arguments and return values, refer to the section on binding :ref:`classes`.
+------------------------------------+---------------------------+-------------------------------+
| ``char`` | Character literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``char16_t`` | UTF-16 character literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``char32_t`` | UTF-32 character literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``wchar_t`` | Wide character literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``const char *`` | UTF-8 string literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``const char16_t *`` | UTF-16 string literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``const char32_t *`` | UTF-32 string literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``const wchar_t *`` | Wide string literal | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``std::string`` | STL dynamic UTF-8 string | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``std::u16string`` | STL dynamic UTF-16 string | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``std::u32string`` | STL dynamic UTF-32 string | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``std::wstring`` | STL dynamic wide string | :file:`pybind11/pybind11.h` |
+------------------------------------+---------------------------+-------------------------------+
| ``std::pair<T1, T2>`` | Pair of two custom types | :file:`pybind11/pybind11.h` |
......
......@@ -111,6 +111,7 @@
#define PYBIND11_BYTES_FROM_STRING_AND_SIZE PyBytes_FromStringAndSize
#define PYBIND11_BYTES_AS_STRING_AND_SIZE PyBytes_AsStringAndSize
#define PYBIND11_BYTES_AS_STRING PyBytes_AsString
#define PYBIND11_BYTES_SIZE PyBytes_Size
#define PYBIND11_LONG_CHECK(o) PyLong_Check(o)
#define PYBIND11_LONG_AS_LONGLONG(o) PyLong_AsLongLong(o)
#define PYBIND11_LONG_AS_UNSIGNED_LONGLONG(o) PyLong_AsUnsignedLongLong(o)
......@@ -129,6 +130,7 @@
#define PYBIND11_BYTES_FROM_STRING_AND_SIZE PyString_FromStringAndSize
#define PYBIND11_BYTES_AS_STRING_AND_SIZE PyString_AsStringAndSize
#define PYBIND11_BYTES_AS_STRING PyString_AsString
#define PYBIND11_BYTES_SIZE PyString_Size
#define PYBIND11_LONG_CHECK(o) (PyInt_Check(o) || PyLong_Check(o))
#define PYBIND11_LONG_AS_LONGLONG(o) (PyInt_Check(o) ? (long long) PyLong_AsLong(o) : PyLong_AsLongLong(o))
#define PYBIND11_LONG_AS_UNSIGNED_LONGLONG(o) (PyInt_Check(o) ? (unsigned long long) PyLong_AsUnsignedLong(o) : PyLong_AsUnsignedLongLong(o))
......
......@@ -17,6 +17,11 @@
# include <fcntl.h>
#endif
#if defined(_MSC_VER)
# pragma warning(push)
# pragma warning(disable: 4127) // warning C4127: Conditional expression is constant
#endif
class ExamplePythonTypes {
public:
static ExamplePythonTypes *new_instance() {
......@@ -426,4 +431,41 @@ test_initializer python_types([](py::module &m) {
"l"_a=l
);
});
// Some test characters in utf16 and utf32 encodings. The last one (the 𝐀) contains a null byte
char32_t a32 = 0x61 /*a*/, z32 = 0x7a /*z*/, ib32 = 0x203d /*‽*/, cake32 = 0x1f382 /*🎂*/, mathbfA32 = 0x1d400 /*𝐀*/;
char16_t b16 = 0x62 /*b*/, z16 = 0x7a, ib16 = 0x203d, cake16_1 = 0xd83c, cake16_2 = 0xdf82, mathbfA16_1 = 0xd835, mathbfA16_2 = 0xdc00;
std::wstring wstr;
wstr.push_back(0x61); // a
wstr.push_back(0x2e18); // ⸘
if (sizeof(wchar_t) == 2) { wstr.push_back(mathbfA16_1); wstr.push_back(mathbfA16_2); } // 𝐀, utf16
else { wstr.push_back((wchar_t) mathbfA32); } // 𝐀, utf32
wstr.push_back(0x7a); // z
m.def("good_utf8_string", []() { return std::string(u8"Say utf8\u203d \U0001f382 \U0001d400"); }); // Say utf8‽ 🎂 𝐀
m.def("good_utf16_string", [=]() { return std::u16string({ b16, ib16, cake16_1, cake16_2, mathbfA16_1, mathbfA16_2, z16 }); }); // b‽🎂𝐀z
m.def("good_utf32_string", [=]() { return std::u32string({ a32, mathbfA32, cake32, ib32, z32 }); }); // a𝐀🎂‽z
m.def("good_wchar_string", [=]() { return wstr; }); // a‽𝐀z
m.def("bad_utf8_string", []() { return std::string("abc\xd0" "def"); });
m.def("bad_utf16_string", [=]() { return std::u16string({ b16, char16_t(0xd800), z16 }); });
// Under Python 2.7, invalid unicode UTF-32 characters don't appear to trigger UnicodeDecodeError
if (PY_MAJOR_VERSION >= 3)
m.def("bad_utf32_string", [=]() { return std::u32string({ a32, char32_t(0xd800), z32 }); });
if (PY_MAJOR_VERSION >= 3 || sizeof(wchar_t) == 2)
m.def("bad_wchar_string", [=]() { return std::wstring({ wchar_t(0x61), wchar_t(0xd800) }); });
m.def("u8_Z", []() -> char { return 'Z'; });
m.def("u8_eacute", []() -> char { return '\xe9'; });
m.def("u16_ibang", [=]() -> char16_t { return ib16; });
m.def("u32_mathbfA", [=]() -> char32_t { return mathbfA32; });
m.def("wchar_heart", []() -> wchar_t { return 0x2665; });
m.attr("wchar_size") = py::cast(sizeof(wchar_t));
m.def("ord_char", [](char c) -> int { return static_cast<unsigned char>(c); });
m.def("ord_char16", [](char16_t c) -> uint16_t { return c; });
m.def("ord_char32", [](char32_t c) -> uint32_t { return c; });
m.def("ord_wchar", [](wchar_t c) -> int { return c; });
});
#if defined(_MSC_VER)
# pragma warning(pop)
#endif
# Python < 3 needs this: coding=utf-8
import pytest
from pybind11_tests import ExamplePythonTypes, ConstructorStats, has_optional, has_exp_optional
......@@ -410,3 +411,93 @@ def test_implicit_casting():
'int_i1': 42, 'int_i2': 42, 'int_e': 43, 'int_p': 44
}
assert z['l'] == [3, 6, 9, 12, 15]
def test_unicode_conversion():
"""Tests unicode conversion and error reporting."""
import pybind11_tests
from pybind11_tests import (good_utf8_string, bad_utf8_string,
good_utf16_string, bad_utf16_string,
good_utf32_string, # bad_utf32_string,
good_wchar_string, # bad_wchar_string,
u8_Z, u8_eacute, u16_ibang, u32_mathbfA, wchar_heart)
assert good_utf8_string() == u"Say utf8‽ 🎂 𝐀"
assert good_utf16_string() == u"b‽🎂𝐀z"
assert good_utf32_string() == u"a𝐀🎂‽z"
assert good_wchar_string() == u"a⸘𝐀z"
with pytest.raises(UnicodeDecodeError):
bad_utf8_string()
with pytest.raises(UnicodeDecodeError):
bad_utf16_string()
# These are provided only if they actually fail (they don't when 32-bit and under Python 2.7)
if hasattr(pybind11_tests, "bad_utf32_string"):
with pytest.raises(UnicodeDecodeError):
pybind11_tests.bad_utf32_string()
if hasattr(pybind11_tests, "bad_wchar_string"):
with pytest.raises(UnicodeDecodeError):
pybind11_tests.bad_wchar_string()
assert u8_Z() == 'Z'
assert u8_eacute() == u'é'
assert u16_ibang() == u'‽'
assert u32_mathbfA() == u'𝐀'
assert wchar_heart() == u'♥'
def test_single_char_arguments():
"""Tests failures for passing invalid inputs to char-accepting functions"""
from pybind11_tests import ord_char, ord_char16, ord_char32, ord_wchar, wchar_size
def toobig_message(r):
return "Character code point not in range({0:#x})".format(r)
toolong_message = "Expected a character, but multi-character string found"
assert ord_char(u'a') == 0x61 # simple ASCII
assert ord_char(u'é') == 0xE9 # requires 2 bytes in utf-8, but can be stuffed in a char
with pytest.raises(ValueError) as excinfo:
assert ord_char(u'Ā') == 0x100 # requires 2 bytes, doesn't fit in a char
assert str(excinfo.value) == toobig_message(0x100)
with pytest.raises(ValueError) as excinfo:
assert ord_char(u'ab')
assert str(excinfo.value) == toolong_message
assert ord_char16(u'a') == 0x61
assert ord_char16(u'é') == 0xE9
assert ord_char16(u'Ā') == 0x100
assert ord_char16(u'‽') == 0x203d
assert ord_char16(u'♥') == 0x2665
with pytest.raises(ValueError) as excinfo:
assert ord_char16(u'🎂') == 0x1F382 # requires surrogate pair
assert str(excinfo.value) == toobig_message(0x10000)
with pytest.raises(ValueError) as excinfo:
assert ord_char16(u'aa')
assert str(excinfo.value) == toolong_message
assert ord_char32(u'a') == 0x61
assert ord_char32(u'é') == 0xE9
assert ord_char32(u'Ā') == 0x100
assert ord_char32(u'‽') == 0x203d
assert ord_char32(u'♥') == 0x2665
assert ord_char32(u'🎂') == 0x1F382
with pytest.raises(ValueError) as excinfo:
assert ord_char32(u'aa')
assert str(excinfo.value) == toolong_message
assert ord_wchar(u'a') == 0x61
assert ord_wchar(u'é') == 0xE9
assert ord_wchar(u'Ā') == 0x100
assert ord_wchar(u'‽') == 0x203d
assert ord_wchar(u'♥') == 0x2665
if wchar_size == 2:
with pytest.raises(ValueError) as excinfo:
assert ord_wchar(u'🎂') == 0x1F382 # requires surrogate pair
assert str(excinfo.value) == toobig_message(0x10000)
else:
assert ord_wchar(u'🎂') == 0x1F382
with pytest.raises(ValueError) as excinfo:
assert ord_wchar(u'aa')
assert str(excinfo.value) == toolong_message
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment