Author Topic: Displaying Windows Unicode in Hex View  (Read 2248 times)

storkinsj

  • Community Member
  • Posts: 11
  • Hero Points: 3
Displaying Windows Unicode in Hex View
« on: October 03, 2019, 05:03:38 AM »
Hi,
TL;DR:

    I am finding Slickedit 2018 and older do not seem to display Windows Unicode characters correctly, but only when in hex view.
    It appears that "UTF8" (selected by save dialog in notepad) works ok in Hex View.
    Windows "Unicode" (selected by save dialog in notepad which is UTF16) shows garbage characters in Hex View

Background:
    We write windows software that processes UTF16 characters on Windows. I wish it were UTF8 but it's UTF16.
    Some parts of the program may be written using legacy string processing functions and I want to catch those bugs.
    I have saved Asian characters to two file formats: UTF8, UTF16 (known as "unicode" in Notepad). I hope to search for null characters and other characters that could be mistakenly processed as control characters in the UTF16 Stream by using Slickedit Regular expressions such as \x00 . It works well, but I can't deterimine which character I found because it's garbage display for UTF16. As you can see from the image, UTF8 seems to work perfectly. LOL I wish the data was UTF8.
    Interestingly, the file displays perfectly in Slickedit when I leave Hex View! The UTF16 BOM (first bytes in some unicode files) seems intact in the UTF16 so it's a bummer the hex display is not showing the characters correctly.

    Is this a known bug or is there some way I can get these characters to display properly?

Greg

Clark

  • SlickEdit Team Member
  • Senior Community Member
  • *
  • Posts: 6944
  • Hero Points: 531
Re: Displaying Windows Unicode in Hex View
« Reply #1 on: October 03, 2019, 12:42:19 PM »
This is currently a limitation.

You've probably noticed that Utf-8 unicode characters get displayed correctly in Hex view. This is because Utf-8 is one of SlickEdit's native buffer storage formats.

I'm curious. Would it be useful to you if SlickEdit converted the Utf-16 to Utf-8 on load? This would mean that you would not be able to search for Utf-16 byte sequences. It would also mean that the BOM at the top of the file would not be displayed. In addition, seek positions on the left would not be correct. This might not matter for you though. '\x00' would still find a Utf-8 character 0. You could also type in Asian characters to search for since they would actually search for the correct Utf-8 sequence of bytes. On save, SlickEdit would have to convert the data from Utf-8 back to Utf-16. This is already done when you are not in hex mode. I hope this makes sense.

If the above is helpful, I'm not sure you need to use Hex view. You can search for '\x00' when not in Hex view. In fact, you can search for any unicode character if you know the code point. Note that turning on "View>Other Ctrl Characters" will help you view null characters and other control characters. Normal text view could work for you if the entire file is UTf-16 data or can be interpreted that way.
« Last Edit: October 03, 2019, 02:51:31 PM by Clark »

storkinsj

  • Community Member
  • Posts: 11
  • Hero Points: 3
Re: Displaying Windows Unicode in Hex View
« Reply #2 on: October 07, 2019, 04:50:12 AM »
Hi,
   Thanks very much for keeping me from thinking I'm crazy- at least in this way.  ;D
 
   I don't think your workaround would work because it may be incorrect: Glyphs (displayed chars) may not display correctly in UTF8 for the UTF16 counterparts.

   While I couldn't seem to search for the code point correctly in text view, I have worked around the issue with this workflow:
  • Switch UTF16 file to hex view
  • Search with Regex
  • Switch back to Text View and locate cursor

At first I was not sure I could properly identify the character (since 2 bytes are involved and the find could happen in first or second byte), but then I inserted the character using the hex codes then switched back to text view to see if I got it.

Overall I was able to identify several "Risky" UTF16 characters through this exercise that contained the ascii codes of Windows PATH control chars.

I hope this is useful to someone dealing with UTF16. I don't think changing the behavior as you mentioned will help but it's good to know you can switch back and forth from hex view to get the same result as the cursor is updated.

Clark

  • SlickEdit Team Member
  • Senior Community Member
  • *
  • Posts: 6944
  • Hero Points: 531
Re: Displaying Windows Unicode in Hex View
« Reply #3 on: October 07, 2019, 01:21:30 PM »
The regex syntax for searching for a unicode code point is like this:

\N{U+hhhhhhhh}

or

\x{hhhhhhhhh}

If you can figure out what the Utf-16 conversion for the surrogates is, the above will work. The other cases should be simple.
« Last Edit: October 07, 2019, 03:29:14 PM by Clark »