2020-09-27

How to write Chinese characters to file based on unicode code point in Python3

I am trying to write Chinese characters to a CSV file based on their Unicode code points found in a text file in unicode.org/Public/zipped/13.0.0/Unihan.zip. For instance, one example character is U+9109.

In the example below I can get the correct output by hard coding the value (line 8), but keep getting it wrong with every permutation I've tried at generating the bytes from the code point (lines 14-16).

I'm running this in Python 3.8.3 on a Debian-based Linux distro.

Minimal working (broken) example:

  1 #!/usr/bin/env python3
  2 
  3 def main():
  4 
  5     output = open("test.csv", "wb")
  6 
  7     # Hardcoded values work just fine
  8     output.write('\u9109'.encode("utf-8"))
  9 
 10     # Comma separation
 11     output.write(','.encode("utf-8"))
 12 
 13     # Problem is here
 14     codepoint = '9109'
 15     u_str = '\\' + 'u' + codepoint
 16     output.write(u_str.encode("utf-8"))
 17 
 18     # End with newline
 19     output.write('\n'.encode("utf-8"))
 20 
 21     output.close()
 22 
 23 if __name__ == "__main__":
 24     main()

Executing and viewing results:

example $
example $./test.py 
example $
example $cat test.csv 
鄉,\u9109
example $


The expected output would look like this (Chinese character occurring on both sides of the comma):

example $
example $./test.py 
example $cat test.csv 
鄉,鄉
example $



from Recent Questions - Stack Overflow https://ift.tt/30e5gpH
https://ift.tt/eA8V8J

No comments:

Post a Comment