I have an issue with Kedro and SparkDatasets I am using a `P Kedro #questions

I have an issue with Kedro and SparkDatasets. I am...

Higor Carmanini

05/25/2023, 10:31 PM

I have an issue with Kedro and SparkDatasets. I am using a

PartitionedDataSet

to read many CSVs into Spark DataFrames. I just found an issue where, apparently, Spark automatically appends the column position to the column name (as read from the header) to create the actual final name. See example in image. As this sometimes is done for deduplications, I investigated whether this was something close, and sure enough there is another dataset in this same

PartitionedDataSet

that reads another column of the same name. This could "explain" this funky behavior of Spark of thinking it is a duplicate. Of course, though, these are two separate DataFrames. Has anyone stumbled upon this issue before? I can't find any references online. Thank you! EDIT: Solved! It was due to Spark's default setting of case insensitiveness.

✅ 1

Higor Carmanini

05/25/2023, 10:41 PM

Adding an

.alias()

to the table does not help with this, unfortunately.

Higor Carmanini

05/25/2023, 11:04 PM

The weird thing is: I just tried to do this in a
ipython
session to test if that's Spark's default behavior, and it doesn't seem like it:

Copy code

In [15]: df1 = spark.read.csv(filepath, sep=',', header=True)

In [16]: df1
Out[16]: DataFrame[col1: string, col2: string, col3: string]

In [17]: df2 = spark.read.csv(filepath, sep=',', header=True)

In [18]: df2
Out[18]: DataFrame[col1: string, col2: string, col3: string]

Could it be a bug in
PartitionedDataSet
? 🤔

Higor Carmanini

05/26/2023, 12:17 AM

Welp, I missed that the same CSV had both

min(rank)

and

min(Rank)

columns, so I eventually found out it's due to Spark's default case insensitiveness. Not Kedro-related at all! Problem solved 😄

Nok Lam Chan

05/26/2023, 11:52 AM

@Higor Carmanini Nice debugging! Thank you for sharing back your findings!

🙏 1

3 Views

Open in Slack

Previous Next