2023-05-21

Update Estimates in Party / Partykit model with averages from unseen holdout data

I want to create a decision tree (using evtree which has a VERY LONG run time with large datasets) on a subsample of data. Then I want to use the fitted model and to update the terminal node estimates with estimates from hold out data. I don't care about n, err, variance, etc. This is analogous to the concept of "honesty" in the GRF package where bias in model construction from sampling is countered by looking at hold out data. Ideally I'd be able to take the new model and inference new data on it.

library(partykit)
mtcars
set.seed(12)
train = sample(nrow(mtcars), nrow(mtcars)/1.5)
sample_tree = ctree(mpg ~. , data = mtcars[train, ])
sample_tree %>% as.simpleparty

# Fitted party:
# [1] root
# |   [2] cyl <= 6: 23.755 (n = 11, err = 224.8)
# |   [3] cyl > 6: 15.380 (n = 10, err = # 42.1)

data.frame(node = predict(sample_tree, newdata = mtcars[-train, ], type = 'node'),
           prediction = mtcars[-train, ]$mpg) %>%
group_by(node) %>%
summarize(mpg = mean(prediction)) %>% as.list

 # $node
 # [1] 2 3
 # $mpg
 # [1] 24.31429 14.40000

In this case I'd update the nodes id as 2,3 in the tree to 24.31429 and 14.40000 respective.

Things I've tried: chat GPT 1000x, a lot of googling, jumping through hoops to figure out how to get terminal node values, etc.

I've also "successfully" updated the model's data just not its estimates

sample_tree$data = mtrcars[-train,]

The ideal would be an update method similar to:

names(dataframe) = c(1,2,3,4)

or

update(tree_model) #tree with updated attached data

edit: updated to use set.seed(12) instead of set.seed(123). Figures/split rules changed but the model properly splits now.



No comments:

Post a Comment