Parse p nodes text including sibling nodes until the next p node
Weird title, I know. I am trying to parse an XML document which is kind of structured in paragraph. However, sometimes there are additional nodes which should be inside a paragraph but simply aren't.
What I need is to find each paragraph but also select everything until the next paragraph up to a "termination" node e.g. which is here the title
node.
Here's an example:
<p typ="ct">(1) This is rule one</p>
<ol>
<li>With some text</li>
<li>that I want to parse</li>
</ol>
<p typ="ct">(2) And here is rule two</p>
<p typ="ct">(3) and rule three</p>
<title>Another section</title>
My desired output would be something like:
[
"(1) This is rule one\nWith some text\nthat I want to parse",
"(2) And here is rule two",
"(3) and rule three"
]
If know I can select each paragraph using something like soup.select("p[typ=ct]")
or soup.find_all("p", attr=dict(typ="ct")
but it's those parts inbetween which I am not sure how to parse in a soupy way.
Comments
Post a Comment