2022-05-21

How can I extract text from a flex container?

I'm a beginner in Java and I'm attempting to extract some text from a website. The text however is between two tags and when I use getByXPath to extract the text I get everything except the text I need.

This is the layout of the website I'm scraping from: Website HTML Layout

The two highlighted portions are the pieces of text I actually need.

And this is the code I've got so far:

List<HtmlElement> name = (List<HtmlElement>) page.getByXPath("//ul/li/a[@class='title']");
List<HtmlElement> subText = (List<HtmlElement>) page.getByXPath("//ul/li/p[@data-af=' (Secret)']");

This however results in two lists:

name - which has HtmlAnchor objects within

[HtmlAnchor[<a class="title" data-af="10" href="/a180775/daddys-home-achievement">], HtmlAnchor[<a class="title" data-af="11" href="/a180776/protector-achievement">], HtmlAnchor[<a class="title" data-af="12" href="/a180777/sinclairs-solution-achievement">]]

subText - which has HtmlParagraph objects within.

[HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">], HtmlParagraph[<p data-af=" (Secret)">]]

URL if you want to take a look at the whole website: https://truesteamachievements.com/game/BioShock-2-Remastered/achievements

I need the lists to look something like these:

["Daddy's Home", "Protector", "Sinclair's Solution"]
["Found your way back to the ruins of Rapture.", "Defended yourself against Lamb's assault in the train station.", "Joined forces with Sinclair in Ryan Amusements."]

This is the Html library I'm using : https://htmlunit.sourceforge.io/apidocs/overview-summary.html

Appreciate any help.



No comments:

Post a Comment