Hey think i ran into a bug Want to run profiling on the S3Da DataHub #troubleshoot

Hey, think i ran into a bug. Want to run profilin...

delightful-barista-90363

07/29/2022, 6:27 PM

Hey, think i ran into a bug. Want to run profiling on the S3DataLake ingestion source like

<s3://bucket_name/{table}/20220729/*.csv>

but it doesnt seem to be working. spark gets initialized but thats about it. thanks in advanced. Actually doesnt look like profiling is run at all. Spark gets initialized but isnt used 🤔 More specifically getting this error

Unable to infer schema for CSV. It must be specified manually.

Looks like in the debug logs, its only going to the

{table}

when trying to open up spark

DEBUG:datahub.ingestion.source.s3.source:Opening file <s3://bucket/jordan-test/dataset_a> for profiling in spark

when the file lives 2 folders down

careful-pilot-86309

08/01/2022, 6:08 AM

@delightful-barista-90363 Are you able to view schema for the table on the UI? From log

Unable to infer schema for CSV. It must be specified manually.

seems like issue has occurred during schema extraction. Can you confirm if the given recipe works well without profiling? Also, complete log file preferably with debug log enabled will be very helpful in pinning the issue.

delightful-barista-90363

08/01/2022, 3:55 PM

Yeah the schema gets ingested with or without profiling enabled

delightful-barista-90363

08/01/2022, 3:56 PM

But the profiling statistics aren't able to be ingested

delightful-barista-90363

08/01/2022, 3:56 PM

It's when spark tries to open the csv that the error is thrown, cause the path that it gets is a folder

delightful-barista-90363

08/01/2022, 3:56 PM

I'll send the debug logs in a bit

careful-pilot-86309

08/01/2022, 3:58 PM

Datahub considers partitioned table in case of non-leaf level {table} and tried to read the entire table for profiling.

delightful-barista-90363

08/01/2022, 3:59 PM

The partition isn't specified tho

delightful-barista-90363

08/01/2022, 4:00 PM

Just {table}

delightful-barista-90363

08/01/2022, 4:41 PM

with company info and non datahub logs removed

delightful-barista-90363

08/01/2022, 4:56 PM

pretty sure the error is happening here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/s3/source.py#L334-L343

delightful-barista-90363

08/01/2022, 4:56 PM

specifically at the read.csv

delightful-barista-90363

08/01/2022, 4:57 PM

the path that its trying to read is

<s3://bucket-name/jordan-test/{table}>

delightful-barista-90363

08/01/2022, 5:17 PM

so i was able to get the stats tab to show up, the file structure we had was

<s3://bucket-name/jordan-test/{table}/20220715/*.csv>

but that extra folder after

{table}

makes it so that spark cant read a folder

delightful-barista-90363

08/01/2022, 5:17 PM

the only profiling available was row counts and columns, no statistics

Open in Slack

Previous Next