Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The 5-6 byte variants (and also 4 at the time) exist because of the need to round-trip UCS surrogate pairs through UTF-8, no? That's what I assume the "political reasons" are...


Others have already answered why surrogate pairs are irrelevant (and not UCS), but I think it's worth saying what the probable actual reason for 5-6 byte variants was. Remember that UCS and Unicode were at this point still two separate things; Unicode was supposed to be 16-bit (and later it got expanded, causing the whole surrogates mess), while UCS was supposed to be 31-bit. I assume the 5-6 byte variants were for UCS (back before it got merged with Unicode).


Surrogate pairs are only in UTF-16 so as to encode code points that require more than 16 bits. UTF-8 has no need of them because it's already a variable width encoding.

If there were no code points larger than 16 bits then UTF-8 would only need a maximum of 3 bytes per code point and UTF-16 wouldn't need surrogate pairs. Well actually UTF-16 probably wouldn't exist at all because UCS-2 would have been enough for everybody.


No. They exist to encode 31 bits of codepoint space, but later the UC decided to limit the codepoint space to only 21 bits because that is what UTF-16 is limited to, and then UTF-8 no longer needed to support sequences of 5 and 6 bytes.


I don't think so? Aren't UCS surrogate pairs at most 16bit each by their very purpose? Also, >16bit unicode code points came much later, I believe, in Unicode 2.0 in 1996 according to Wikipedia (vs UTF-8 which is from around 1992)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: