How to overwrite pyspark DataFrame schema without data scan?

By Ritesh Sahu - March 28, 2022

This question is related to https://stackoverflow.com/a/37090151/1661491. Let's assume I have a pyspark DataFrame with certain schema, and I would like to overwrite that schema with a new schema that I know is compatible, I could do:

df: DataFrame
new_schema = ...

df.rdd.toDF(schema=new_schema)

Unfortunately this triggers computation as described in the link above. Is there a way to do that at the metadata level, without triggering computation or conversions?

Edit, note:

the schema can be arbitrarily complicated (nested etc)
new schema includes updates to description, nullability and additional metadata (bonus points for updates to the type)
I would like to avoid writing a custom query expression generator, unless there's one already built into Spark that can generate query based on the schema/StructType

Search This Blog

Theprogrammersfirst | A technical portal.

How to overwrite pyspark DataFrame schema without data scan?

Comments

Post a Comment

Popular posts from this blog

Spring Elasticsearch Operations

Hibernate Search - Elasticsearch with JSON manipulation

Today Walkin 14th-Sept