Hey, think i ran into a bug. Want to run profilin...
# troubleshoot
d
Hey, think i ran into a bug. Want to run profiling on the S3DataLake ingestion source like
<s3://bucket_name/{table}/20220729/*.csv>
but it doesnt seem to be working. spark gets initialized but thats about it. thanks in advanced. Actually doesnt look like profiling is run at all. Spark gets initialized but isnt used 🤔 More specifically getting this error
Unable to infer schema for CSV. It must be specified manually.
Looks like in the debug logs, its only going to the
{table}
when trying to open up spark
DEBUG:datahub.ingestion.source.s3.source:Opening file <s3://bucket/jordan-test/dataset_a> for profiling in spark
when the file lives 2 folders down
c
@delightful-barista-90363 Are you able to view schema for the table on the UI? From log
Unable to infer schema for CSV. It must be specified manually.
seems like issue has occurred during schema extraction. Can you confirm if the given recipe works well without profiling? Also, complete log file preferably with debug log enabled will be very helpful in pinning the issue.
d
Yeah the schema gets ingested with or without profiling enabled
But the profiling statistics aren't able to be ingested
It's when spark tries to open the csv that the error is thrown, cause the path that it gets is a folder
I'll send the debug logs in a bit
c
Datahub considers partitioned table in case of non-leaf level {table} and tried to read the entire table for profiling.
d
The partition isn't specified tho
Just {table}
with company info and non datahub logs removed
specifically at the read.csv
the path that its trying to read is
<s3://bucket-name/jordan-test/{table}>
so i was able to get the stats tab to show up, the file structure we had was
<s3://bucket-name/jordan-test/{table}/20220715/*.csv>
but that extra folder after
{table}
makes it so that spark cant read a folder
the only profiling available was row counts and columns, no statistics