2022-03-28

How to overwrite pyspark DataFrame schema without data scan?

This question is related to https://stackoverflow.com/a/37090151/1661491. Let's assume I have a pyspark DataFrame with certain schema, and I would like to overwrite that schema with a new schema that I know is compatible, I could do:

df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)

Unfortunately this triggers computation as described in the link above. Is there a way to do that at the metadata level, without triggering computation or conversions?

Edit, note:

  • the schema can be arbitrarily complicated (nested etc)
  • new schema includes updates to description, nullability and additional metadata (bonus points for updates to the type)
  • I would like to avoid writing a custom query expression generator, unless there's one already built into Spark that can generate query based on the schema/StructType


No comments:

Post a Comment