2023-09-05

Is there a step to use relative frequency instead of step_tokenfilter() in recipes

I'm building a regression model using this great approach by Emil Hvitfeldt and Julia Silge in R (https://smltar.com/mlregression#fnref7) and I was wondering if it could be possible to use relative frequency instead of absolute in the preprocessing steps step_tokenfilter(). I looked into it but couldn't find the function.

Here is my code for now, using tf-idf instead on the 1000 most frequent tokens.

 data_rec <- recipe(year ~ sentence_lemma, data = data_train) %>%
  step_tokenize(sentence_lemma) %>%
  step_stopwords(sentence_lemma, custom_stopword_source = stopwords_list) %>%
  step_tokenfilter(sentence_lemma, max_tokens = 1e3) %>%
  step_tfidf(sentence_lemma) %>%
  step_normalize(all_predictors())

Thanks in advance for any help ;)



No comments:

Post a Comment