How to iterate and apply text pre processing on sublists
I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.
For example:
X = [
["i","love,"to","play","games","","."],
["my","favourite,"colour","is","purple","!"],
["@ladygaga","we,"love","you","#stan","'someurl"]
]
tweet_tokens = []
for tweet in tweets:
tweet = tweet.lower()
tweet_tokens.append(tweet)
This is how I lowercased my tokens.
How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of @'s.
This is what I thought/tried but its not giving me the right results (only showing stop words for an example)
filtered_sentence = []
filtered_word = []
for sent in X:
for word in sent:
if word not in stopwords:
filtered_word.append(word)
filtered_sentence.append(word)
What would be the correct way to iterate through each sublists, process without disrupting the lists.
Ideally the output should look like this
Cleaned_X = [
["love,"play","games"],
["favourite,"colour","purple",],
["ladygaga","love","#stan"]
]
Comments
Post a Comment