Pyspark: Concat Function Generated Columns Into New Dataframe
I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the correspo
Solution 1:
In this case, you can do a list comprehension inside of a call to select
.
To make the code a little more compact, we can first get the columns we want to diff in a list:
diff_columns = [c for c in df.columns if c != 'index']
Next select the index and iterate over diff_columns
to compute the new column. Use .alias()
to rename the resulting column:
df_diff = df.select(
'index',
*[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+#|index| col1_diff| col2_diff| col3_diff|#+-----+------------------+-------------------+-------------------+#| 1| null| null| null|#| 2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|#| 3|0.4054651081081646|0.40546510810816416|0.40546510810816416|#+-----+------------------+-------------------+-------------------+
Post a Comment for "Pyspark: Concat Function Generated Columns Into New Dataframe"