2021-01-28

What data type does VectorAssembler require for an input?

The Core problem is this here

from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([([1, 2, 3], 0, 3)], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features", inputCols=["a", "b", "c"])
vecAssembler.transform(df).show()

with error IllegalArgumentException: Data type array<bigint> of column a is not supported.

I know this is a bit of a toy problem, but I'm trying to integrate this into a longer pipeline with steps

  • StringIndexer
  • OneHotEncoding
  • Custom UnaryTransformer to multiply all the 1's by 10
    • What datatype should be returned here?
  • Then VectorAssembler to combine the vectors into a single vector for modeling.

If I can determine the proper input datatype for the VectorAssembler I should be able to string everything together properly. I think the input type is a Vector, but I can't figure out how to build one.



from Recent Questions - Stack Overflow https://ift.tt/3t0pqjH
https://ift.tt/eA8V8J

No comments:

Post a Comment