PEP: 393 Title: Flexible String Representation Version: $Revision$ Last-Modified: $Date$ Author: Martin v. Löwis Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 24-Jan-2010 Python-Version: 3.3 Post-History: Abstract ======== The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out. Rationale ========= There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain that non-BMP characters are not properly supported. On systems using UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP characters (as strings containing them will use 4 bytes per character). One problem with the approach is support for existing applications (e.g. extension modules). For compatibility, redundant representations may be computed. Applications are encouraged to phase out reliance on a specific internal representation if possible. As interaction with other libraries will often require some sort of internal representation, the specification choses UTF-8 as the recommended way of exposing strings to C code. For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced. Specification ============= The Unicode object structure is changed to this definition:: typedef struct { PyObject_HEAD Py_ssize_t length; void *str; Py_hash_t hash; int state; Py_ssize_t utf8_length; void *utf8; Py_ssize_t wstr_length; void *wstr; } PyUnicodeObject; These fields have the following interpretations: - length: number of code points in the string (result of sq_length) - str: shortest-form representation of the unicode string The string is null-terminated (in its respective representation). - hash: same as in Python 3.2 - state: * lowest 2 bits (mask 0x03) - interned-state (SSTATE_*) as in 3.2 * next 2 bits (mask 0x0C) - form of str: + 00 => reserved + 01 => 1 byte (Latin-1) + 10 => 2 byte (UCS-2) + 11 => 4 byte (UCS-4); * next bit (mask 0x10): 1 if str memory follows PyUnicodeObject - utf8_length, utf8: UTF-8 representation (null-terminated) - wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate pairs (in which cast wstr_length differs form length). All three representations are optional, although the str form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data. The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. The str and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). The str and wstr pointers point to the same memory if the string happens to fit exactly to the wchar_t type of the platform (i.e. uses some BMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses some non-BMP characters if sizeof(wchar_t) is 4). If the string is created directly with the canonical representation (see below), this representation doesn't take a separate memory block, but is allocated right after the PyUnicodeObject struct. String Creation --------------- The recommended way to create a Unicode object is to use the function PyUnicode_New:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar); Both parameters must denote the eventual size/range of the strings. In particular, codecs using this API must compute both the number of characters and the maximum character in advance. An string is allocated according to the specified size and character range and is null-terminated; the actual characters in it may be unitialized. PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported for processing UTF-8 input; the input is decoded, and the UTF-8 representation is not yet set for the string. PyUnicode_FromUnicode remains supported but is deprecated. If the Py_UNICODE pointer is non-null, the str representation is set. If the pointer is NULL, a properly-sized wstr representation is allocated, which can be modified until PyUnicode_Ready() is called (explicitly or implicitly). Resizing a Unicode string remains possible until it is finalized. PyUnicode_Ready() converts a string containing only a wstr representation into the canonical representation. Unless wstr and str can share the memory, the wstr representation is discarded after the conversion. String Access ------------- The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the value PyUnicode_1BYTE (1), PyUnicode_2BYTE (2), or PyUnicode_4BYTE (3). PyUnicode_Data gives the void pointer to the data, masking out the pointer kind. All these functions call PyUnicode_Ready in case the canonical representation hasn't been computed yet. A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the utf8 representation when first called. Since this representation will consume memory until the string object is released, applications should use the existing PyUnicode_AsUTF8String where possible (which generates a new string object every time). API that implicitly converts a string to a char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to compute a conversion. PyUnicode_AsUnicode is deprecated; it computes the wstr representation on first use. String Operations ----------------- Various convenience functions will be provided to deal with the canonical representation, in particular with respect to concatenation and slicing. Stable ABI ---------- None of the functions in this PEP become part of the stable ABI. GDB Debugging Hooks ------------------- Tools/gdb/libpython.py contains debugging hooks that embed knowledge about the internals of CPython's data types, include PyUnicodeObject instances. It will need to be slightly updated to track the change. Discussion ========== Several concerns have been raised about the approach presented here: It makes the implementation more complex. That's true, but considered worth given the gains. The Py_Unicode representation is not instantaneously available, slowing down applications that request it. While this is also true, applications that care about this problem can be rewritten to use the str representation. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: