2022-08-23

Parse p nodes text including sibling nodes until the next p node

Weird title, I know. I am trying to parse an XML document which is kind of structured in paragraph. However, sometimes there are additional nodes which should be inside a paragraph but simply aren't.

What I need is to find each paragraph but also select everything until the next paragraph up to a "termination" node e.g. which is here the title node.

Here's an example:

<p typ="ct">(1) This is rule one</p>
<ol>
  <li>With some text</li>
  <li>that I want to parse</li>
</ol>
<p typ="ct">(2) And here is rule two</p>
<p typ="ct">(3) and rule three</p>
<title>Another section</title>

My desired output would be something like:

[
  "(1) This is rule one\nWith some text\nthat I want to parse", 
  "(2) And here is rule two", 
  "(3) and rule three"
]

If know I can select each paragraph using something like soup.select("p[typ=ct]") or soup.find_all("p", attr=dict(typ="ct") but it's those parts inbetween which I am not sure how to parse in a soupy way.



No comments:

Post a Comment