2023-02-27

In ruby, optparse raises error when filename contains certain characters

I'm using optparse in a ruby program (ruby 2.7.1p83) under Linux. If any of the command-line arguments are filenames with "special" characters in them, the parse! method fails with this error:

invalid byte sequence in UTF-8

This is the code which fails ...

parser = OptionParser.new {
  |opts|
  ... etc. ...
}
parser.parse! # error occurs here

I know about the scrub method and other ways to do encoding in ruby. However, the place where the error occurs is in a library routine (OptionParser#parse!), and I have no control over how this library routine deals with strings.

I could pre-process the command-line arguments and replace the special characters in these arguments with an acceptable encoding, but then, in the case where the argument is a file name, I will be unable to open that file later in the program, because the filename I have accepted into the program will have been altered from the file's original name.

I could do something complicated like pre-traversing the arguments, building a hashmap where the key is the encoded argument and the value is the original argument, changing the ARGV values to the encoded values, parsing the encoded arguments using OptionParser, and then going through the resulting arguments after OptionParser completes and using the hashmap to in a procedure which replaces the encoded arguments with their original values ... and then continuing with the program.

But I'm hoping that there would be a much simpler way to solve this problem in ruby.

Thank you in advance for any ideas or suggestions.

UPDATE: Here is more detailed info ...

I wrote the following minimal program called rtest.rb in order to test this:

#!/usr/bin/env run-ruby                                                                                                                               
# -*- ruby -*-                                                                                                                                        

require 'optparse'

parser = OptionParser.new {
}
parser.parse!

Process.exit(0)

I ran it as follows, with the only files present in the current directory being rtest.rb itself, and another file having this name: Äfoo ...

export LC_TYPE='en_us.UTF-8'
export LC_COLLATE='en_us.UTF-8'
./rtest.rb *

It generated the following error and stack trace ...

Traceback (most recent call last):
    7: from /home/hippo/bin/rtest.rb:8:in `<main>'
    6: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1691:in `parse!'
    5: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1666:in `permute!'
    4: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1569:in `order!'
    3: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `parse_in_order'
    2: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1575:in `catch'
    1: from /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `block in parse_in_order'
/opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb:1579:in `===': invalid byte sequence in UTF-8 (ArgumentError)

Here is what appears in the pertinent section of the file /opt/rubies/ruby-2.7.1/lib/ruby/2.7.0/optparse.rb . See line 1579...

 1572   def parse_in_order(argv = default_argv, setter = nil, &nonopt)  # :nodoc:                                                                     
 1573     opt, arg, val, rest = nil
 1574     nonopt ||= proc {|a| throw :terminate, a}
 1575     argv.unshift(arg) if arg = catch(:terminate) {
 1576       while arg = argv.shift
 1577         case arg
 1578           # long option                                                                                                                           
 1579           when /\A--([^=]*)(?:=(.*))?/m
 1580             opt, rest = $1, $2

In other words, the regex match on the argument is failing due to this encoding issue.

When I have time (not right away, unfortunately), I'll put some code into that module to do encoding of the arg variable, to see if this might fix the problem.

FURTHER UPDATE: I am running under Ubuntu 20.0.4, and the version of ruby that's offered is 2.7.0. I also managed to get 2.7.1 running on my ancient debian 8 box. This error occurs in both environments. I would have to install a newer version of ruby or compile it from source before I could try version 2.7.7 or version 3.x.

YET ANOTHER UPDATE: I had some unexpected spare time, and so I build ruby-3.3.0 from source and re-ran the test. I got the same error!

% /opt/local/rubies/ruby-3.3.0/bin/ruby ./rtest.rb *
/opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `===': invalid byte sequence in UTF-8 (ArgumentError)
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1640:in `block in parse_in_order'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `catch'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1636:in `parse_in_order'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1630:in `order!'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1739:in `permute!'
    from /opt/local/rubies/ruby-3.3.0/lib/ruby/3.3.0+0/optparse.rb:1764:in `parse!'
    from ./rtest.rb:8:in `<main>'

However, I now think the error occurs because the filename is encoded in an unusual manner. If I do echo * in that directory, I see this, which is what I expect:

% echo *
Äfoo rtest.rb

However, if I do /bin/ls in the same directory, I see this:

% /bin/ls *
''$'\304''foo'   rtest.rb

And even the OS can't recognize the file with the name specified as follows ...

% /bin/cat 'Äfoo'
/bin/cat: Äfoo: No such file or directory

But if I use the longer, encoded file name, the OS has no trouble accessing the file ...

% /bin/cat ''$'\304''foo
File contents
File contents

The ls command seems to know how to encode the Äfoo filename into ''$'\304''foo, but ruby doesn't seem to know how to do this.



No comments:

Post a Comment