2023-10-31

How to make a grammar using python textx exclude tokens

I am trying to parse a network device configuration file, and though I would be going through the whole file, I will not want to include all the terms of the file, but only a subset.

So assume that the configuration file is as follows:

bootfile abc.bin
motd "this is a
message for everyone to follow"
.
.
.
group 1
group 2
.
.
.
permit tcp ab.b.c.d b.b.b.y eq 222
permit tcp ab.b.c.d b.b.b.y eq 222
permit tcp ab.b.c.d b.b.b.y eq 222
.
.
.
interface a
  description this is interface a
  vlan 33 

interface b
  description craigs list
  no shut
  vlan 33
  no ip address
.
.
.

I am only trying to capture the interface line (as is) and the description and vlan line as is - everything else would be ignore. Contents within the interface would be broken into 2 attributes: valid and invalid

so the grammar would look something like this:

Config[noskipsp]:
  interfaces *= !InterfaceDefinition | InterfaceDefinition
;

InterfaceDefinition:
  interface = intf
  valids *= valid
  invalids *= invalid
;

intf: /^interface .*\n/;
cmds: /^ (description|vlan) .*\n/;
invalid: /^(?!(interface|description|vlan) .*\n;

The goal is to attain a python array of interfaces where each interface has 2 attributes: valids, and invalids, each are arrays. valid array would contain either description or vlan entries, and invalid would contain everything else.

There are several challenges that I can't seem to address: 1- How to ignore all the other content that is not an interface definition? 2- How to ensure that all interfaces end up as an interface and not in the invalids attribute of another interface?

Unfortunately - the grammar when parsing the text does not fail, but my understanding how the parser goes through the text appears to be at fault, since it complains the moment it tries to read any text passed the 'interface .*' section.

Additionally, currently I am testing explicitly with a file containing only interface definitions, but the goal is to process full files, targetting only the interfaces, so all other content needs from the grammar side to be able to be discarded.


Updated progress

Originally - after Igor's first answers, I was able to create a grammar that would fully parse successfully a dummy configuration file I had, though the results were not the ones desired - probably due to my ignorance. With Igor's 2nd updated answer, I have decided to refactor the original grammar and simplify it to try to match my sample dummy configuration.

My goal at the model level is to be able to have an object that would resemble something similar to the following pseudo structure

class network_config:

    def __init__(self):
        self.invalid = [] # Entries that do not match the higher level
                       # hierarchy objects
        self.interfaces = []  # Inteface definitions

class Interface:

     def __init__(self):
        self.name = ""
        self.vlans = []
        self.description = ""
        self.junk = []  # This are all other configurations
                        # within the interface that are not
                        # matching neither vlans or description

The dummy configuration file (data to be parsed) looks as follows:

junk
irrelevant configuration line
interface GigabitEthernet0/0/0
   description 3 and again
   nonsense
   vlan 33
   this and that
   vlan 16
interface GigabitEthernet0/0/2
   something for the nic
   vlan 22
   description here and there
! a simple comment
intermiediate
more nonrelated information

interface GigabitEthernet0/0/3
   this is junk too
   vlan 99
don't forget this
interface GigabitEthernet0/0/1
interface GigabitEthernet0/0/9
nothing of interest
silly stuff
some final data

And the new textx grammar that I have created is as follows:

Config:
    (
        invalid*=Junk
        | interfaces*=Interface
    )*
;

Junk:
   /(?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n/  // <- consume that is not a 'vlan', 'description', nor 'interface'
;

Interface:
   'interface' name=/[^\n]+\n/
   ( description+=Description
   | vlans*=Vlan
   | invalids*=InterfaceJunk
   )*
;

Description:
    /description[^\n]+\n/
;

Vlan:
    /vlan[^\n]+\n/
;

InterfaceJunk:
    /(?!((interface)|(vlan)|(description))).[^\n]\n/  // <- consume everything that is not an interface, vlan, or description
;

To my surprise when I tried to run against it - I noticed that it was going into an infinite loop. I also noticed that changing the root rule from

Config:
    (
        invalid*=Junk
        | interfaces*=Interface
    )*
;

*** PARSING MODEL ***
>> Matching rule Model=Sequence at position 0 => *junk irrel
   >> Matching rule Config=ZeroOrMore in Model at position 0 => *junk irrel
      >> Matching rule OrderedChoice in Config at position 0 => *junk irrel
         >> Matching rule __asgn_zeroormore=ZeroOrMore[invalid] in Config at position 0 => *junk irrel
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 0 => *junk irrel
            ++ Match 'junk
' at 0 => '*junk *irrel'
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 5 => junk *irrelevant
            ++ Match 'irrelevant configuration line
' at 5 => 'junk *irrelevant configuration line *'
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 35 => tion line *interface
            -- NoMatch at 35
         <<+ Matched rule __asgn_zeroormore=ZeroOrMore[invalid] in __asgn_zeroormore at position 35 => tion line *interface
      <<+ Matched rule OrderedChoice in Config at position 35 => tion line *interface
      >> Matching rule OrderedChoice in Config at position 35 => tion line *interface
         >> Matching rule __asgn_zeroormore=ZeroOrMore[invalid] in Config at position 35 => tion line *interface
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 35 => tion line *interface
            -- NoMatch at 35
         <<- Not matched rule __asgn_zeroormore=ZeroOrMore[invalid] in __asgn_zeroormore at position 35 => tion line *interface
      <<- Not matched rule OrderedChoice in Config at position 35 => tion line *interface
      >> Matching rule OrderedChoice in Config at position 35 => tion line *interface
         >> Matching rule __asgn_zeroormore=ZeroOrMore[invalid] in Config at position 35 => tion line *interface
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 35 => tion line *interface
            -- NoMatch at 35
         <<- Not matched rule __asgn_zeroormore=ZeroOrMore[invalid] in __asgn_zeroormore at position 35 => tion line *interface

to

Config:
    (
        invalid*=Junk interfaces*=Interface
    )*
;

*** PARSING MODEL ***
>> Matching rule Model=Sequence at position 0 => *junk irrel
   >> Matching rule Config=ZeroOrMore in Model at position 0 => *junk irrel
      >> Matching rule Sequence in Config at position 0 => *junk irrel
         >> Matching rule __asgn_zeroormore=ZeroOrMore[invalid] in Config at position 0 => *junk irrel
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 0 => *junk irrel
            ++ Match 'junk
' at 0 => '*junk *irrel'
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 5 => junk *irrelevant
            ++ Match 'irrelevant configuration line
' at 5 => 'junk *irrelevant configuration line *'
            ?? Try match rule Junk=RegExMatch((?s)(?!((interface)|(vlan)|(description)).)[^\n]+\n) in __asgn_zeroormore at position 35 => tion line *interface
            -- NoMatch at 35
         <<+ Matched rule __asgn_zeroormore=ZeroOrMore[invalid] in __asgn_zeroormore at position 35 => tion line *interface
         >> Matching rule __asgn_zeroormore=ZeroOrMore[interfaces] in Config at position 35 => tion line *interface
            >> Matching rule Interface=Sequence in __asgn_zeroormore at position 35 => tion line *interface
               ?? Try match rule StrMatch(interface) in Interface at position 35 => tion line *interface
               ++ Match 'interface' at 35 => 'tion line *interface* '
               >> Matching rule __asgn_plain=Sequence[name] in Interface at position 44 =>  interface* GigabitEt
                  ?? Try match rule RegExMatch([^\n]+\n) in __asgn_plain at position 45 => interface *GigabitEth
                  ++ Match 'GigabitEthernet0/0/0
' at 45 => 'interface *GigabitEthernet0/0/0 *'
               <<+ Matched rule __asgn_plain=Sequence[name] in __asgn_plain at position 66 => rnet0/0/0 *   descrip
               >> Matching rule ZeroOrMore in Interface at position 66 => rnet0/0/0 *   descrip
                  >> Matching rule OrderedChoice in Interface at position 66 => rnet0/0/0 *   descrip
                     >> Matching rule __asgn_oneormore=OneOrMore[description] in Interface at position 66 => rnet0/0/0 *   descrip
                        ?? Try match rule Description=RegExMatch(description[^\n]+\n) in __asgn_oneormore at position 69 => t0/0/0    *descriptio
                        ++ Match 'description 3 and again
' at 69 => 't0/0/0    *description 3 and again *'
                        ?? Try match rule Description=RegExMatch(description[^\n]+\n) in __asgn_oneormore at position 96 =>  again    *nonsense
                        -- NoMatch at 96

Gave 2 different results, though none of them were as I was hoping for - in the first format, the parser would end up stuck looping looking continuously for the invalid patterns (i.e.; Junk), while in the second format, the parser would be able to get passed seeking for invalids, and at least find the first interface GigabitEthernet0/0/0 though once inside the interface it would, once more, get into an infinite loop.

I was under the impression that doing a ( attr1*=pattern1 | attr2*=pattern2 | attr3*=pattern3) meant that it would try each of the patterns, but it seems to be stuck on pattern1 for as long as pattern1 is not being found. (Ordered choice describes it as such) - I must have something in the grammar that is causing this.

Any hints as to where are my misconceptions?



No comments:

Post a Comment