2023-10-04

How to create a spark data frame from a txt file with multiple data points, each having a label and 4 features and their values given like 1:0.4

So, I have a txt file with a rows of numbers

0.0 1:5.1 2:3.5 3:1.4 4:0.2
0.0 1:4.9 2:3.0 3:1.4 4:0.2
...

and so on.

0.0 is a label, 0.0, 0.1, 0.2 and so on. the column separated values are features and the respective feature's values.

I want to load this to spark dataframe.

I did the following but the features are returned as nulls.

# Define the schema
schema = StructType([
        StructField("label", FloatType(), nullable=False),
        StructField("feature1", FloatType(), nullable=False),
        StructField("feature2", FloatType(), nullable=False),
        StructField("feature3", FloatType(), nullable=False),
        StructField("feature4", FloatType(), nullable=False)
])

# Load the data into a DataFrame using the specified schema
df = spark.read.option("delimiter", " ").csv("file_path", schema=schema)

# Show the first few rows of the DataFrame
df.show()


No comments:

Post a Comment