2022-03-23

How to iterate and apply text pre processing on sublists

I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.

For example:

X = [
  ["i","love,"to","play","games","","."],
  ["my","favourite,"colour","is","purple","!"],
  ["@ladygaga","we,"love","you","#stan","'someurl"]
]
tweet_tokens = []

for tweet in tweets:
    tweet = tweet.lower()
    tweet_tokens.append(tweet)

This is how I lowercased my tokens.

How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of @'s.

This is what I thought/tried but its not giving me the right results (only showing stop words for an example)

filtered_sentence = []
filtered_word = []

for sent in X:
    for word in sent:
        if word not in stopwords:
            filtered_word.append(word)
            filtered_sentence.append(word)

What would be the correct way to iterate through each sublists, process without disrupting the lists.

Ideally the output should look like this

Cleaned_X = [
  ["love,"play","games"],
  ["favourite,"colour","purple",],
  ["ladygaga","love","#stan"]
]


No comments:

Post a Comment