Theprogrammersfirst: parsing html using jsoup into document creates " conversion

2021-04-01

parsing html using jsoup into document creates " conversion

I am trying to parse out a html response using jsoup but finding that it converts the html characters and even trying to use suggested ascii settings or StringEscapeUtils.unescapeHtml before using jsoup.parse still does not help.

     String decodedHTML= StringEscapeUtils.unescapeHtml(htmlD);
     decodedHTML = Parser.unescapeEntities(htmlD,false);
     Document docs = Jsoup.parse(decodedHTML, "UTF-8");
     System.out.println(docs);

The output includes the following in 'docs':

<div class="\&quot;search-video-card\&quot;"><div 
     class="\&quot;video__cover\&quot;"

Why is this happening and how can I prevent it. I tried the following found in similar questions:

Document.OutputSettings settings = docs.outputSettings();
     
     
    docs.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
     settings.prettyPrint(false);
     settings.charset("ASCII");
     String modifiedFileHtmlStr = docs.html();

     System.out.println(modifiedFileHtmlStr);

Did not work.

from Recent Questions - Stack Overflow https://ift.tt/2QRv8pv
https://ift.tt/eA8V8J

Theprogrammersfirst

2021-04-01

parsing html using jsoup into document creates " conversion

No comments:

Post a Comment