I'm trying to work out how get two regressors to interact when using BigQuery ML.
In this example below (apologies for the rough fake data!), I'm trying to predict total_hire_duration
using trip_count
as well as the month of the year. BQ tends to treat the month
part as a constant to add on to the linear regression equation but I actually want it to grow with trip_count
. For my real dataset I can't just supply the timestamp as BQML seems to over parametise.
I should add, if I supply month
as a numeric value I just get a single coefficient that doesn't really work for my dataset (patterns form around parts of the academic year rather than calendar).
If the month part is a constant, then as trip_count
gets very very large, the constant in the equation y = ax+b
becomes inconsequential. It's almost as if I want something like y = ax + bx + c
where a
is the trip_count
and b
is a coefficient weighted on what the value of month
is.
This is quite easy to do in R, I'd just run glm(bike$totalHireDuration ~ bike$tripCount:bike$month)
Here's some fake data to reproduce:
CREATE OR REPLACE MODEL
my_model_name OPTIONS (model_type='linear_reg',
input_label_cols =['total_hire_duration']) AS (
SELECT
CAST(EXTRACT(MONTH FROM DATE(start_date)) AS STRING) month,
COUNT(*) trip_count,
SUM(duration_sec) total_hire_duration
FROM
bigquery-public-data.san_francisco_bikeshare.bikeshare_trips
GROUP BY
date)
Any help would be greatly appreciated!
No comments:
Post a Comment