Higor Carmanini

05/25/2023, 10:31 PM
I have an issue with Kedro and SparkDatasets. I am using a
to read many CSVs into Spark DataFrames. I just found an issue where, apparently, Spark automatically appends the column position to the column name (as read from the header) to create the actual final name. See example in image. As this sometimes is done for deduplications, I investigated whether this was something close, and sure enough there is another dataset in this same
that reads another column of the same name. This could "explain" this funky behavior of Spark of thinking it is a duplicate. Of course, though, these are two separate DataFrames. Has anyone stumbled upon this issue before? I can't find any references online. Thank you! EDIT: Solved! It was due to Spark's default setting of case insensitiveness.
Adding an
to the table does not help with this, unfortunately.
The weird thing is: I just tried to do this in a
session to test if that's Spark's default behavior, and it doesn't seem like it:
In [15]: df1 =, sep=',', header=True)

In [16]: df1
Out[16]: DataFrame[col1: string, col2: string, col3: string]

In [17]: df2 =, sep=',', header=True)

In [18]: df2
Out[18]: DataFrame[col1: string, col2: string, col3: string]
Could it be a bug in
Welp, I missed that the same CSV had both
columns, so I eventually found out it's due to Spark's default case insensitiveness. Not Kedro-related at all! Problem solved 😄

Nok Lam Chan

05/26/2023, 11:52 AM
@Higor Carmanini Nice debugging! Thank you for sharing back your findings!
🙏 1