How to overwrite pyspark DataFrame schema without data scan?
This question is related to https://stackoverflow.com/a/37090151/1661491. Let's assume I have a pyspark DataFrame with certain schema, and I would like to overwrite that schema with a new schema that I know is compatible, I could do:
df: DataFrame
new_schema = ...
df.rdd.toDF(schema=new_schema)
Unfortunately this triggers computation as described in the link above. Is there a way to do that at the metadata level, without triggering computation or conversions?
Edit, note:
- the schema can be arbitrarily complicated (nested etc)
- new schema includes updates to description, nullability and additional metadata (bonus points for updates to the type)
- I would like to avoid writing a custom query expression generator, unless there's one already built into Spark that can generate query based on the schema/
StructType
Comments
Post a Comment