Bug #5812
openSendMessage doesn't support Unicode
0%
Description
I was wondering why SendMessage wasn't working as expected with AdiIRC - it seems that even if the Unicode flag is enabled, AdiIRC doesn't read/write to the page file in Unicode.
It looks like AdiIRC is still reading and writing in either Latin1 / UTF8 (I didn't check outside of the Latin1 character set)
Using WM_MEVALUATE with WPARAM and LPARAM with the page file containing "Client version is $version on Win $+ $os $chr(169)" (in Unicode format)
Window Title: AdiIRC
Result (UTF8): Ok("C\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0")
Result (Unicode): C
Message sent
Window Title: mIRC
Result (UTF8): Ok("C\0l\0i\0e\0n\0t\0 \0v\0e\0r\0s\0i\0o\0n\0 \0i\0s\0 \07\0.\07\08\0 \0o\0n\0")
Result (Unicode): Client version is 7.78 on Win11 ©
JB Updated by JD Byrnes 9 months ago · Edited
Fixed read length for UTF8 data in my code (doesn't change the bug, just the output)
Initial Data Sent:
Result (UTF8): Ok("C\0l\0i\0e\0n\0t\0 \0v\0e\0r\0s\0i\0o\0n\0 \0i\0s\0 \0$\0v\0e\0r\0s\0i\0o\0n\0 \0o\0n\0 \0W\0i\0n\0 \0$\0+\0 \0$\0o\0s\0")
Result (Unicode): Client version is $version on Win $+ $os
Window Title: AdiIRC
Result (UTF8): Ok("C")
Result (Unicode): C
Window Title: mIRC
Result (UTF8): Ok("C\0l\0i\0e\0n\0t\0 \0v\0e\0r\0s\0i\0o\0n\0 \0i\0s\0 \07\0.\07\08\0 \0o\0n\0 \0W\0i\0n\01\01\0")
Result (Unicode): Client version is 7.78 on Win11
JB Updated by JD Byrnes 9 months ago · Edited
Related:
It might be worth noting that without the Unicode flag set, mIRC appears to handle data as UTF-8. AdiIRC uses a different codepage, which seems to destroy emoji etc.
- Tried sending UTF8 data (non Unicode) and AdiIRC converts to ?????? while mIRC returns UTF-8 data.
OO Updated by Oui Ouims 8 months ago
Hello, you have fundamental misunderstanding about UTF8.
Utf8 doesn't use nul character (\0) to encode unicode, you're using \0 in your utf8 so that's just plain wrong, as far as utf8 is concerned.
Adiirc reports "C" correctly because a nul character is the end of the string representation, as far as utf8 is concerned.
Now, it looks like you're encoding to UTF16 instead of UTF8 (it could be ucs-2), which mIRC seems to understand correctly while adiirc doesn't, leading to mIRC writting back UTF16 while adiirc write back utf8 (so just a C).
So, there's no bug, adiirc just needs to support utf16 in the mapped file.
If mIRC decode the mapped file as utf8 when you're not using the unicode flag, this is a bug in mIRC as it should decode as ansi, and adiirc would be correct to destroy your emoji (decode as Latin-1)
JB Updated by JD Byrnes 8 months ago
Oui Ouims wrote in #note-3:
Utf8 doesn't use nul character (\0) to encode unicode, you're using \0 in your utf8 so that's just plain wrong, as far as utf8 is concerned.
I'm not using UTF8 in anything, just showing the response when decoded with UTF-8
Adiirc reports "C" correctly because a nul character is the end of the string representation, as far as utf8 is concerned.
I understand this, but expect AdiIRC to support UTF-16 as mIRC does.
Now, it looks like you're encoding to UTF16 instead of UTF8 (it could be ucs-2), which mIRC seems to understand correctly while adiirc doesn't, leading to mIRC writting back UTF16 while adiirc write back utf8 (so just a C).
Correct. It's UTF-16.
So, there's no bug, adiirc just needs to support utf16 in the mapped file.
If mIRC decode the mapped file as utf8 when you're not using the unicode flag, this is a bug in mIRC as it should decode as ansi, and adiirc would be correct to destroy your emoji (decode as Latin-1)
Are you sure that mIRC can decode a UTF-8 mapped file with the Unicode switch enabled?
Whenever mIRC mentions "Unicode" (DLL/SendMessage) it seems to refer to UTF-16. Whenever mIRC mentions "UTF"/"UTF-8" (everywhere else), it seems to refer to "UTF-8". I've just searched the help file to confirm this.
From my understanding of Latin-1, it shouldn't destroy an emoji, as it supports the full range of possible bytes (other similar codepages such as ASCII would destroy things). It's possible (and likely) mIRC is using Latin-1 encoding for SendMessage and then displaying the data received in UTF-8.
OO Updated by Oui Ouims 8 months ago
Well, maybe you're not using UTF8 in anything and that's my mistake, but you're still using the term UTF8 incorrectly in your sentences, for example "Tried sending UTF8 data (non Unicode)" UTF8 is an encoding for unicode, you can't possibly be sending utf8 without sending Unicode, that does not make sense. UTF8 is 8 bits and compatible with ascii, the letter C in ascii/ansi/latin-1/utf8 is always taking the same single byte 67.
It is true that mIRC's documentation tends to use the term unicode but sometimes it refers to it using UTF16 and sometimes UTF8.
Are you sure that mIRC can decode a UTF-8 mapped file with the Unicode switch enabled?
No. I assumed it was the case from your first couple messages, I also cannot test this at the moment, it would make sense that it only supports UTF16, in which case this reports IS a bug in adiirc.
Latin-1 is an 8 bits encoding but because it uses the full range of available bytes 0-255 doesn't mean anything in this case.
UTF8 encodes unicode character up to four bytes, and for an emoji, it would be 4 bytes. Now 4 bytes are 4 bytes and those 4 bytes can be interpreted in any way, shape, or form.
If you decode those 4 bytes using Latin-1, each individual byte then represent one character.
U+1F601    \xF0\x9F\x98\x81    GRINNING FACE WITH SMILING EYES
here \xF0\x9F\x98\x81 is the (hexadecimal) utf8 representation of that emoji, if you decode those bytes with latin-1 you get 4 characters whose bytes are 240, 159, 152, 129, resulting in visual "ð" string.
Latin-1 does not destroy your bytes no, but it just won't do what utf8 was made for. Ascii is the same shit, it doesn't use 0-255 but it also convert one byte to one character just like Latin-1.
If you're not using the unicode flag, mIRC is just not doing anything, it just write the bytes it gets from your data. The only thing is that if you don't use the unicode flag and what mIRC must write contain unicode character (codepoint > 255), mIRC does convert those unicode character to utf8 before writing because it then must be writing using an 8 bits encoding (ansi/latin-1, same shit).
When you write to the mapped file without the unicode flag set, mIRC does not decode using utf8, but with Latin-1, your emoji will result in 4 characters being displayed, not an emoji, once again, if it does not do that, it's a bug in mIRC.
JB Updated by JD Byrnes 8 months ago
At this point I feel like you're just getting argumentative for no reason. It's clear we both know how text encodings work (and popular ones).
Oui Ouims wrote in #note-5:
Well, maybe you're not using UTF8 in anything and that's my mistake, but you're still using the term UTF8 incorrectly in your sentences, for example "Tried sending UTF8 data (non Unicode)" UTF8 is an encoding for unicode, you can't possibly be sending utf8 without sending Unicode, that does not make sense. UTF8 is 8 bits and compatible with ascii, the letter C in ascii/ansi/latin-1/utf8 is always taking the same single byte 67.
I understand that UTF-8 is Unicode, Non-Unicode refers to the fact that Unicode flag is unset.
It is true that mIRC's documentation tends to use the term unicode but sometimes it refers to it using UTF16 and sometimes UTF8.
I checked both mIRC.chm and versions.txt, "UTF16"/"UTF-16" is not mentioned (ever). Unicode is mentioned in it's place as stated earlier. The UTF-8 stuff is referred to as "UTF-8" or simply "UTF".
If you're not using the Unicode flag, mIRC is just not doing anything, it just write the bytes it gets from your data. The only thing is that if you don't use the unicode flag and what mIRC must write contain unicode character (codepoint > 255), mIRC does convert those Unicode character to utf8 before writing because it then must be writing using an 8 bits encoding (ansi/latin-1, same shit).
It seems mIRC is treating text as UTF-8 when the Unicode bit is unset (or mUnicode = false for DLLs). This can be verified by sending a $chr(169) "©" - I recommending sending it via $chr(169) so you can be sure it's not the text file encoding it.
When you write to the mapped file without the unicode flag set, mIRC does not decode using utf8, but with Latin-1, your emoji will result in 4 characters being displayed, not an emoji, once again, if it does not do that, it's a bug in mIRC.
My theory was that mIRC may receive latin1, and then UTF-8 encode it to display (we used to do this in mIRC 6.xx). Tested reveals mIRC is actually receiving the UTF-8 data and decoding it before display.
-
To sum it up, if mUnicode = false, mIRC sends a UTF8 encoded PCSTR (on my system, maybe not in other locales). If mUnicode = true, mIRC sends a UTF8 encoded PCWSTR.
The same is true when mIRC is receiving data (decodes UTF8, etc)
Previous to v7 (ie. v6.35 where mUnicode didn't exist), mIRC would send a PSTR which was probably latin1.
Which to me seems a little silly to have mUnicode (UTF-16), since it's defaulting to UTF-8 anyway. This is probably related to https://forums.mirc.com/ubbthreads.php/topics/223676/mirc-v7-1-languages-and-codepages
OO Updated by Oui Ouims 8 months ago · Edited
I'm trying to be informative, not argumentative. I do understand that you know what utf8/utf16 encoding are and where they (should be) are used.
In my previous message I already explained why you think mIRC is using utf8 to encode data when the unicode flag is not set, that's because you're using unicode character in non unicode environement (unicode flag not set), which results in mIRC using utf8 regardless, because that's about its best option, the other option would be to stop the script execution and report an error (you're asking it to use unicode in non unicode env)
Testing with $chr(169) is not a good test (read: it's a incorrect test, not proving anything) as that character can fit in a single byte even in utf8, byte 169. This character is not unicode in itself and exists in Latin-1.
I'll repeat once again, if you are not using the unicode flag and you get mIRC to write back data containing unicode code point > 255, it will utf8 encode them instead of reporting an error. mIRC is not 'treating anything as utf8' (this is extremely not well said and why I originally thought you are confused).
To sum it up, you are mistesting, you should test with the character 'é', which exists in Latin-1 as byte 233, but in utf8 is encoded as two bytes é (195 169).
With the unicode flag not set, if you ask mIRC to evaluate $chr(233), it will/should write the single byte 233.
With the unicode flag set, it will send the two bytes 195 169 (é).
If you test this with an emoji with the unicode flag NOT set, because it's a codepoint > 255, you'll be in the context I described twice now, and mIRC WILL use utf8:
https://www.compart.com/en/unicode/U+1F600 it's 0xD83D 0xDE00 in utf16 aka you can evaluate $chr($base(D83D,16,10,2)) $+ $chr($base(DE00,16,10,2)) in mIRC to generate it.
Edit:
Which to me seems a little silly to have mUnicode (UTF-16), since it's defaulting to UTF-8 anyway
You are confused again, it does not default to UTF8, mUnicode is not silly.
Again, if mIRC 7.x were defaulting to utf8 with sendmessage, then asking mIRC to write to the mapped file the character é would always sends the two bytes é (195 169).
But if you don't use the unicode flag, it doesn't do that, it sends the single byte 233, so it does not default to utf8.