2022-01-21

Align whitespace after segmentation into sentences

I am using a segmentation library to split a string into sentences:

s = "Lucas ipsum dolor sit amet darth\n 
     mandalore kit. Endor Mr. Wookiee wicket\n 
     jawa yavin ackbar jabba? Padmé\n
     utapau palpatine kenobi moff.\n
     Sidious anakin mace:\n
      - Ben darth.\n
      - Ben vader."

segment(s)
# => 
["Lucas ipsum dolor sit amet darth mandalore kit.",
 "Endor Mr. Wookiee wicket jawa yavin ackbar jabba?",
 "Padmé utapau palpatine kenobi moff.",
 "Sidious anakin mace:",
 "- Ben darth.",
 "- Ben vader."]

Unfortunately the library has no way to retain the whitespacing, but strips all newlines, leading and trailing newlines and collapses multiple spaces into a single space.

Given that the segmentation library is awesome otherwise (it keeps stuff like Mr. Wookie): What would be a concise code to re-insert the whitespace into the split sentences given the original text is still available?

Expected outcome:

["Lucas ipsum dolor sit amet darth\nmandalore kit. ",
 "Endor Mr. Wookiee wicket\njawa yavin ackbar jabba? ",
 "Padmé\nutapau palpatine kenobi moff.\n",
 "Sidious anakin mace:\n",
 " - Ben darth.\n",
 " - Ben vader."]

Original code is Ruby, but solutions could be in any language.



from Recent Questions - Stack Overflow https://ift.tt/3qMbTxb
https://ift.tt/eA8V8J

No comments:

Post a Comment