Align whitespace after segmentation into sentences
I am using a segmentation library to split a string into sentences:
s = "Lucas ipsum dolor sit amet darth\n
mandalore kit. Endor Mr. Wookiee wicket\n
jawa yavin ackbar jabba? Padmé\n
utapau palpatine kenobi moff.\n
Sidious anakin mace:\n
- Ben darth.\n
- Ben vader."
segment(s)
# =>
["Lucas ipsum dolor sit amet darth mandalore kit.",
"Endor Mr. Wookiee wicket jawa yavin ackbar jabba?",
"Padmé utapau palpatine kenobi moff.",
"Sidious anakin mace:",
"- Ben darth.",
"- Ben vader."]
Unfortunately the library has no way to retain the whitespacing, but strips all newlines, leading and trailing newlines and collapses multiple spaces into a single space.
Given that the segmentation library is awesome otherwise (it keeps stuff like Mr. Wookie): What would be a concise code to re-insert the whitespace into the split sentences given the original text is still available?
Expected outcome:
["Lucas ipsum dolor sit amet darth\nmandalore kit. ",
"Endor Mr. Wookiee wicket\njawa yavin ackbar jabba? ",
"Padmé\nutapau palpatine kenobi moff.\n",
"Sidious anakin mace:\n",
" - Ben darth.\n",
" - Ben vader."]
Original code is Ruby, but solutions could be in any language.
from Recent Questions - Stack Overflow https://ift.tt/3qMbTxb
https://ift.tt/eA8V8J
Comments
Post a Comment