Seed7 Library: Unicode

Unicode

Function Summary

string

toUtf8 (in string: stri)
	Convert a string to an UTF-8 encoded string of bytes.

string

fromUtf8 (in string: utf8)
	Convert a string with bytes in UTF-8 encoding to UTF-32.

string

toUtf16Be (in string: stri)
	Convert a string to an UTF-16BE encoded string of bytes.

string

fromUtf16Be (in string: utf16Be)
	Convert an UTF-16BE encoded string of bytes to UTF-32.

string

toUtf16Le (in string: stri)
	Convert a string to an UTF-16LE encoded string of bytes.

string

fromUtf16Le (in string: utf16Le)
	Convert an UTF-16LE encoded string of bytes to UTF-32.

string

replaceUtf16SurrogatePairs (in string: stri)
	Return string where all surrogate pairs are replaced by single chars.

string

fromNullTerminatedUtf16Be (in string: stri, in integer: startPos)
	Convert a null terminated UTF-16BE encoded string of bytes to UTF-32.

string

fromNullTerminatedUtf16Le (in string: stri, in integer: startPos)
	Convert a null terminated UTF-16LE encoded string of bytes to UTF-32.

string

getNullTerminatedUtf16Be (in string: stri, inout integer: currPos)
	Read a null terminated UTF-16BE encoded string of bytes and convert it to UTF-32.

string

getNullTerminatedUtf16Be (inout file: inFile)
	Read a null terminated UTF-16BE encoded string of bytes and convert it to UTF-32.

string

getNullTerminatedUtf16Le (in string: stri, inout integer: currPos)
	Read a null terminated UTF-16LE encoded string of bytes and convert it to UTF-32.

string

getNullTerminatedUtf16Le (inout file: inFile)
	Read a null terminated UTF-16LE encoded string of bytes and convert it to UTF-32.

string

fromUtf7 (in string: stri7)
	Convert a string from an UTF-7 encoding to UTF-32.

Function Detail

toUtf8

const func string: toUtf8 (in string: stri)

Convert a string to an UTF-8 encoded string of bytes.

toUtf8("€")          returns "â\130;¬"

Surrogate pairs are converted into a CESU-8 encoded string:

toUtf8("\16#d834;\16#dd1e;")  returns "\237;\160;\180;\237;\180;\158;"  (surrogate pair)

This function accepts unpaired surrogate characters.

toUtf8("\16#dc00;")  returns "\16#ed;\16#b0;\16#80;"  (unpaired surrogate char)

Note that a Unicode string should not contain surrogate characters. If the string contains surrogate pairs use

toUtf8(replaceUtf16SurrogatePairs(stringWithSurrogatePairs))

to create a correct (not CESU-8 encoded) UTF-8 string.

Parameters:: stri - Normal (UTF-32) string to be converted to UTF-8.

Returns:: stri converted to a string of bytes with UTF-8 encoding.

fromUtf8

const func string: fromUtf8 (in string: utf8)

Convert a string with bytes in UTF-8 encoding to UTF-32.

fromUtf8("â\130;¬")                         returns "€"

Surrogate pairs from a CESU-8 encoded string are kept intact:

fromUtf8("\237;\160;\180;\237;\180;\158;")  returns "\16#d834;\16#dd1e;" (surrogate pair)

To decode a CESU-8 encoded string use:

replaceUtf16SurrogatePairs(fromUtf8(cesu8String))

Overlong encodings and unpaired surrogate chare are accepted.

fromUtf8("\16#c0;\16#80;")                  returns "\0;"        (overlong encoding)
fromUtf8("\16#ed;\16#b0;\16#80;")           returns "\16#dc00;"  (unpaired surrogate char)

Parameters:: utf8 - String of bytes encoded with UTF-8.

Returns:: utf8 converted to a normal (UTF-32) string.

Raises:: RANGE_ERROR - If utf8 contains a char beyond '\255;' or if utf8 is not encoded with UTF-8.

toUtf16Be

const func string: toUtf16Be (in string: stri)

Convert a string to an UTF-16BE encoded string of bytes.

Parameters:: stri - Normal (UTF-32) string to be converted to UTF-16BE.

Returns:: stri converted to a string of bytes with UTF-16BE encoding.

Raises:: RANGE_ERROR - If a character is not representable as UTF-16 or a surrogate character is present.

fromUtf16Be

const func string: fromUtf16Be (in string: utf16Be)

Convert an UTF-16BE encoded string of bytes to UTF-32.

Parameters:: utf16Be - String of bytes encoded with UTF-16 in big endian byte order.

Returns:: utf16Be converted to a normal (UTF-32) string.

Raises:: RANGE_ERROR - If the length of utf16Be is odd or if utf16Be contains a char beyond '\255;' or if utf16Be contains an invalid surrogate pair.

toUtf16Le

const func string: toUtf16Le (in string: stri)

Convert a string to an UTF-16LE encoded string of bytes.

Parameters:: stri - Normal (UTF-32) string to be converted to UTF-16LE.

Returns:: stri converted to a string of bytes with UTF-16LE encoding.

Raises:: RANGE_ERROR - If a character is not representable as UTF-16 or a surrogate character is present.

fromUtf16Le

const func string: fromUtf16Le (in string: utf16Le)

Convert an UTF-16LE encoded string of bytes to UTF-32.

Parameters:: utf16Le - String of bytes encoded with UTF-16 in little endian byte order.

Returns:: utf16Le converted to a normal (UTF-32) string.

Raises:: RANGE_ERROR - If the length of utf16Le is odd or if utf16Le contains a char beyond '\255;' or if utf16Le contains an invalid surrogate pair.

replaceUtf16SurrogatePairs

const func string: replaceUtf16SurrogatePairs (in string: stri)

Return string where all surrogate pairs are replaced by single chars.

replaceUtf16SurrogatePairs("\16#d834;\16#dd1e;")  returns "\16#1d11e;"

This function can be used to decode CESU-8 encoded strings:

replaceUtf16SurrogatePairs(fromUtf8(cesu8String))

In CESU-8 an Unicode code point from the Basic Multilingual Plane (BMP) is encoded in the same way as in UTF-8. An Unicode code point outside the BMP is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8.

Parameters:: stri - String of UTF-16 or UTF-32 Unicode characters, which may contain surrogate pairs.

Returns:: stri with all surrogate pairs replaced by single UTF-32 chars.

Raises:: RANGE_ERROR - If an invalid surrogate pair is present.

fromNullTerminatedUtf16Be

const func string: fromNullTerminatedUtf16Be (in string: stri, in integer: startPos)

Convert a null terminated UTF-16BE encoded string of bytes to UTF-32. The UTF-16BE encoded string starts at startPos and ends with an UTF-16BE encoded null ('\0;') character. When there is no null character the UTF-16BE encoded string is assumed to extend to the end of stri.

Parameters:: stri - UTF-16BE encoded string of bytes (starting from startPos).; startPos - Start position for the UTF-16BE encoded null terminated string.

Returns:: the string found in UTF-32 encoding without the null ('\0;') character.

Raises:: RANGE_ERROR - If the conversion from UTF-16BE to UTF-32 fails.

fromNullTerminatedUtf16Le

const func string: fromNullTerminatedUtf16Le (in string: stri, in integer: startPos)

Convert a null terminated UTF-16LE encoded string of bytes to UTF-32. The UTF-16LE encoded string starts at startPos and ends with an UTF-16LE encoded null ('\0;') character. When there is no null character the UTF-16LE encoded string is assumed to extend to the end of stri.

Parameters:: stri - UTF-16LE encoded string of bytes (starting from startPos).; startPos - Start position for the UTF-16LE encoded null terminated string.

Returns:: the string found in UTF-32 encoding without the null ('\0;') character.

Raises:: RANGE_ERROR - If the conversion from UTF-16LE to UTF-32 fails.

getNullTerminatedUtf16Be

const func string: getNullTerminatedUtf16Be (in string: stri, inout integer: currPos)

Read a null terminated UTF-16BE encoded string of bytes and convert it to UTF-32. The UTF-16BE encoded string starts at currPos and ends with an UTF-16BE encoded null ('\0;') character. The position currPos is advanced behind the null ('\0;') character. When there is no null character the UTF-16BE encoded string is assumed to extend to the end of stri. In this case currPos is advanced beyond the length of stri.

Parameters:: stri - UTF-16BE encoded string of bytes (starting from currPos).; currPos - Start position for the UTF-16BE encoded null terminated string. The function advances currPos to refer to the position behind the terminating null ('\0;') character.

Returns:: the string found in UTF-32 encoding without the null ('\0;') character.

Raises:: RANGE_ERROR - If the conversion from UTF-16BE to UTF-32 fails.

getNullTerminatedUtf16Be

const func string: getNullTerminatedUtf16Be (inout file: inFile)

Read a null terminated UTF-16BE encoded string of bytes and convert it to UTF-32. The reading ends when an UTF-16BE encoded null ('\0;') character has been read.

Parameters:: inFile - File with UTF-16BE encoded bytes.

Returns:: the string read in UTF-32 encoding without the null ('\0;') character.

Raises:: RANGE_ERROR - If the conversion from UTF-16BE to UTF-32 fails.

getNullTerminatedUtf16Le

const func string: getNullTerminatedUtf16Le (in string: stri, inout integer: currPos)

Read a null terminated UTF-16LE encoded string of bytes and convert it to UTF-32. The UTF-16LE encoded string starts at currPos and ends with an UTF-16LE encoded null ('\0;') character. The position currPos is advanced behind the null ('\0;') character. When there is no null character the UTF-16LE encoded string is assumed to extend to the end of stri. In this case currPos is advanced beyond the length of stri.

Parameters:: stri - UTF-16LE encoded string of bytes (starting from currPos).; currPos - Start position for the UTF-16LE encoded null terminated string. The function advances currPos to refer to the position behind the terminating null ('\0;') character.

Returns:: the string found in UTF-32 encoding without the null ('\0;') character.

Raises:: RANGE_ERROR - If the conversion from UTF-16LE to UTF-32 fails.

getNullTerminatedUtf16Le

const func string: getNullTerminatedUtf16Le (inout file: inFile)

Read a null terminated UTF-16LE encoded string of bytes and convert it to UTF-32. The reading ends when an UTF-16LE encoded null ('\0;') character has been read.

Parameters:: inFile - File with UTF-16LE encoded bytes.

Returns:: the string read in UTF-32 encoding without the null ('\0;') character.

Raises:: RANGE_ERROR - If the conversion from UTF-16LE to UTF-32 fails.

fromUtf7

const func string: fromUtf7 (in string: stri7)

Convert a string from an UTF-7 encoding to UTF-32.

Parameters:: stri7 - String of bytes encoded with UTF-7.

Returns:: stri7 converted a to normal (UTF-32) string.

Raises:: RANGE_ERROR - The string is not UTF-7 encoded.