2023-10-04

How to pass values ​(string and array) via args to generate a dataframe?

I have a challenge where I have difficulty transforming the column type from String to an Array, but the way I'm doing it, it's bringing an empty column. Example:

How should you bring:

df.show() 

+----+----+----+--------------+
|col1|col2|col3|col4          |
+----+----+----+--------------+
|   a|   b|   c|[[book,phone]]|
|   e|   f|   g|[[phone,home]]|
+----+----+----+--------------+

And as the code is bringing:

+----+----+----+----+                                                           
|col1|col2|col3|col4|
+----+----+----+----+
+----+----+----+----+

Example of how it currently looks and how it should look:

Current:
|-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: string (nullable = true)

How it should look:
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: array (nullable = true)
 |    |-- element: string (containsNull = true)

But I'm having difficulty, because the array ends up being empty:

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val args = Seq("a,b, c, [[book,phone]] | e,f ,g, [[phone,home]] ", "col1, col2, col3 , col4")

val dataClient = args(0).split("\\|").map(_.trim).map { raw =>
  val rawSeq = raw.split(",").map(_.trim)
  if (rawSeq.length == args(1).split(",").length) {
    rawSeq
  } else {
    null
  }
}.filter(_ != null).toSeq

val columns = args(1).split(",").map(_.trim)
val schema = StructType(columns.map(c => StructField(c, StringType)))

val spark = SparkSession.builder().appName("example").getOrCreate()

val rdd = spark.sparkContext.parallelize(dataClient.map(Row.fromSeq(_)))

val df = spark.createDataFrame(rdd, schema)
df.show()



No comments:

Post a Comment