How to pass values (string and array) via args to generate a dataframe?
I have a challenge where I have difficulty transforming the column type from String to an Array, but the way I'm doing it, it's bringing an empty column. Example:
How should you bring:
df.show()
+----+----+----+--------------+
|col1|col2|col3|col4 |
+----+----+----+--------------+
| a| b| c|[[book,phone]]|
| e| f| g|[[phone,home]]|
+----+----+----+--------------+
And as the code is bringing:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
+----+----+----+----+
Example of how it currently looks and how it should look:
Current:
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: string (nullable = true)
How it should look:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: array (nullable = true)
| |-- element: string (containsNull = true)
But I'm having difficulty, because the array ends up being empty:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val args = Seq("a,b, c, [[book,phone]] | e,f ,g, [[phone,home]] ", "col1, col2, col3 , col4")
val dataClient = args(0).split("\\|").map(_.trim).map { raw =>
val rawSeq = raw.split(",").map(_.trim)
if (rawSeq.length == args(1).split(",").length) {
rawSeq
} else {
null
}
}.filter(_ != null).toSeq
val columns = args(1).split(",").map(_.trim)
val schema = StructType(columns.map(c => StructField(c, StringType)))
val spark = SparkSession.builder().appName("example").getOrCreate()
val rdd = spark.sparkContext.parallelize(dataClient.map(Row.fromSeq(_)))
val df = spark.createDataFrame(rdd, schema)
df.show()
Comments
Post a Comment