Hi guys, can you help me with this issue? <https:/...
# ingestion
p
đź“– 1
🔍 1
l
Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react ✅ to your original message to let us know!
p
can you help me, please?
a
Hi, what’s your desired outcome for this ingestion- should all of the files come in in a single table?
And what source are you ingesting from here?
We support Parquet from multiple including Google cloud, S3, and Spark
double checking the pathspec would also be helpful here, just to rule out a typo
p
Hi Paul, yes, these are the same table, so I expect one table definition for all files, and not for every file a dedicated table definition.
I ingesting a folder strutctures with different parquet "tables" which includes multiple parquet files
The issue still exists
GCS works only with GCS
But my files are not on Google Cloud
and Spark is not a datahub source
so the only possible solution is s3
m
Where are your files located?
And can you share your recipe?
d
Sorry, can you elaborate more where are your files are and if you can use
s3
or
gcs
source? Those sources support
path_spec
which should group path and should solve your individual path problem.
p
The files are locally, on my local drive.
GCS works only with Google cloud
I use s3
d
path_specs should work fine with gcs and s3 locations and if you have the
{table}
placeholders and the path is correct then it should be groped properly and shouldn’t ingest individual files
p
this is my path_specs: path_specs: [{ include: "c:/users/szger/PARQUET/{table}/{partition_key[0]}={partition[0]}/*.parquet", "sample_files": False, "table_name": "parquet_gsz" }]
still generates invidual files
I uploaded the test files in this thread
PARQUET.7z
d
Please, can you attach your ingestion logs in debug mode as well? Also, what did you mean in
But my files are not on Google Cloud
?
p
GCS modul works only on google cloud
How can I execute it in debug mode?
image.png
m
You want to do it on local file system?
d
datahub --debug ingest
wherever it didn’t work
Have you tried to run the ingestion for data on gcs? Is it worked there?
p
[2023-08-29 100537,706] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-02-18\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,711] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-02-28\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,711] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-02-28\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,717] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-04-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,717] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-04-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,722] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-09-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,722] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-09-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet /[2023-08-29 100537,779] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-10-03\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,780] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-10-03\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,786] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-01-01\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,787] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-01-01\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,791] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-02-10\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,792] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-02-10\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet
d
Can you share the whole log, please?
p
this is the debug log
image.png
I tested it with multiple arcyl-datahub versions
d
Is this only an issue on local fs?
p
we tested it only on local fs, we have currenlty local storage for parquet
were you able to reproduce the error, or to spot where the problem is?