Sanitizing encoding errors on input data in jruby
In my JRuby application, I get input from two sources:
- External files
- A Java program, which calls my JRuby code and passes to me data
Some of the external data is (supposed) to be encoded as ISO_8859_1, while I internally process it as UTF_8 and also produce UTF_8 as output.
Unfortunately, there are sometimes encoding errors: The data contains occasionally bytes which are not valid ISO_8859_1, and this isn't going to be fixed. The specification requires to simply throw away those illegal input bytes.
For a file, I'm reading the file using
string = File.new(filename, {external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8, converters: UTF8_CONVERTER})
The converts clause takes care, that illegal input bytes are skipped.
For a string received from the Java side, I could of course convert them to UTF_8, by doing a
string = iso_string.encode(Encoding::UTF_8)
but how could I catch here illegal characters? From my understanding of the Ruby Docs for the encode
method, the options, which can be stated after the destination encoding, don't provide a converts key.
UPDATE
Here is a simple example to demonstrate the problem:
(1) Good case (no error)
s = [49, 67].pack('C*')
put s
puts s.encoding
u = s.encode(Encoding::UTF_8)
puts u
puts u.encoding
This prints
1C
ASCII-8BIT
1C
UTF-8
(2) Error case
x = [49, 138, 67].pack('C*')
x.encode(Encoding::UTF_8)
raises, as expected, UndefinedConversionError: ""\x8A"" from ASCII-8BIT to UTF-8
What I tried (though not documented):
t = x.encode(external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8, converters: UTF8_CONVERTER)
Interestingly, this got rid of the exception, but the nevertheless the conversion did not succeed. If I do a
t.encoding
I still see ASCII-8BIT. It seems that nothing had been converted. I would like to see the illegal character to be removed, i.e. in this case t
being the empty string.
Comments
Post a Comment