What data type does VectorAssembler require for an input?
The Core problem is this here
from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()
with error IllegalArgumentException: Data type array<bigint> of column a is not supported.
I know this is a bit of a toy problem, but I'm trying to integrate this into a longer pipeline with steps
- StringIndexer
- OneHotEncoding
- Custom UnaryTransformer to multiply all the 1's by 10
- What datatype should be returned here?
- Then VectorAssembler to combine the vectors into a single vector for modeling.
If I can determine the proper input datatype for the VectorAssembler I should be able to string everything together properly. I think the input type is a Vector, but I can't figure out how to build one.
from Recent Questions - Stack Overflow https://ift.tt/3t0pqjH
https://ift.tt/eA8V8J
Comments
Post a Comment