2023-09-11

Sanitizing encoding errors on input data in jruby

In my JRuby application, I get input from two sources:

  • External files
  • A Java program, which calls my JRuby code and passes to me data

Some of the external data is (supposed) to be encoded as ISO_8859_1, while I internally process it as UTF_8 and also produce UTF_8 as output.

Unfortunately, there are sometimes encoding errors: The data contains occasionally bytes which are not valid ISO_8859_1, and this isn't going to be fixed. The specification requires to simply throw away those illegal input bytes.

For a file, I'm reading the file using

string = File.new(filename, {external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8, converters: UTF8_CONVERTER})

The converts clause takes care, that illegal input bytes are skipped.

For a string received from the Java side, I could of course convert them to UTF_8, by doing a

string = iso_string.encode(Encoding::UTF_8)

but how could I catch here illegal characters? From my understanding of the Ruby Docs for the encode method, the options, which can be stated after the destination encoding, don't provide a converts key.

UPDATE

Here is a simple example to demonstrate the problem:

(1) Good case (no error)

s = [49, 67].pack('C*')
put s
puts s.encoding
u = s.encode(Encoding::UTF_8)
puts u
puts u.encoding

This prints

1C    
ASCII-8BIT
1C
UTF-8

(2) Error case

x = [49, 138, 67].pack('C*')
x.encode(Encoding::UTF_8)

raises, as expected, UndefinedConversionError: ""\x8A"" from ASCII-8BIT to UTF-8

What I tried (though not documented):

t = x.encode(external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8, converters: UTF8_CONVERTER)

Interestingly, this got rid of the exception, but the nevertheless the conversion did not succeed. If I do a

t.encoding

I still see ASCII-8BIT. It seems that nothing had been converted. I would like to see the illegal character to be removed, i.e. in this case t being the empty string.



No comments:

Post a Comment