Hi guys can you help me with this issue <https datahubspace DataHub #ingestion

Join Slack

Hi guys, can you help me with this issue? <https:/...

# ingestion

purple-terabyte-64712

05/13/2023, 3:58 AM

Hi guys, can you help me with this issue? https://datahubspace.slack.com/archives/CUMUWQU66/p1683703900400409

📖 1

🔍 1

lively-cat-88289

05/13/2023, 3:58 AM

Hey there 👋 I'm The DataHub Community Support bot. I'm here to help make sure the community can best support you with your request. Let's double check a few things first: ✅ There's a lot of good information on our docs site: www.datahubproject.io/docs, Have you searched there for a solution? ✅ button ✅ It's not uncommon that someone has run into your exact problem before in the community. Have you searched Slack for similar issues? ✅ button Did you find a solution to your issue? ❌ Sorry you weren't able to find a solution. I'm sending you some tips on info you can provide to help the community troubleshoot. Whenever you feel your issue is solved, please react ✅ to your original message to let us know!

purple-terabyte-64712

05/16/2023, 8:14 AM

can you help me, please?

astonishing-answer-96712

05/16/2023, 8:48 PM

Hi, what’s your desired outcome for this ingestion- should all of the files come in in a single table?

astonishing-answer-96712

05/16/2023, 8:48 PM

And what source are you ingesting from here?

astonishing-answer-96712

05/16/2023, 8:49 PM

We support Parquet from multiple including Google cloud, S3, and Spark

astonishing-answer-96712

05/17/2023, 12:11 AM

double checking the pathspec would also be helpful here, just to rule out a typo

purple-terabyte-64712

06/14/2023, 9:17 AM

Hi Paul, yes, these are the same table, so I expect one table definition for all files, and not for every file a dedicated table definition.

purple-terabyte-64712

06/14/2023, 9:22 AM

I ingesting a folder strutctures with different parquet "tables" which includes multiple parquet files

purple-terabyte-64712

06/14/2023, 9:22 AM

The issue still exists

purple-terabyte-64712

06/14/2023, 9:29 AM

GCS works only with GCS

purple-terabyte-64712

06/14/2023, 9:29 AM

But my files are not on Google Cloud

purple-terabyte-64712

06/14/2023, 9:30 AM

and Spark is not a datahub source

purple-terabyte-64712

06/14/2023, 9:30 AM

so the only possible solution is s3

modern-artist-55754

06/14/2023, 11:15 AM

Where are your files located?

modern-artist-55754

06/14/2023, 11:16 AM

And can you share your recipe?

dazzling-judge-80093

06/14/2023, 5:01 PM

Sorry, can you elaborate more where are your files are and if you can use

s3

gcs

source? Those sources support

path_spec

which should group path and should solve your individual path problem.

purple-terabyte-64712

08/29/2023, 7:03 AM

The files are locally, on my local drive.

purple-terabyte-64712

08/29/2023, 7:03 AM

GCS works only with Google cloud

purple-terabyte-64712

08/29/2023, 7:03 AM

I use s3

dazzling-judge-80093

08/29/2023, 7:05 AM

path_specs should work fine with gcs and s3 locations and if you have the

{table}

placeholders and the path is correct then it should be groped properly and shouldn’t ingest individual files

purple-terabyte-64712

08/29/2023, 7:47 AM

this is my path_specs: path_specs: [{ include: "c:/users/szger/PARQUET/{table}/{partition_key[0]}={partition[0]}/*.parquet", "sample_files": False, "table_name": "parquet_gsz" }]

purple-terabyte-64712

08/29/2023, 7:47 AM

still generates invidual files

purple-terabyte-64712

08/29/2023, 7:47 AM

I uploaded the test files in this thread

purple-terabyte-64712

08/29/2023, 7:48 AM

PARQUET.7z

dazzling-judge-80093

08/29/2023, 7:54 AM

Please, can you attach your ingestion logs in debug mode as well? Also, what did you mean in

But my files are not on Google Cloud

purple-terabyte-64712

08/29/2023, 7:55 AM

GCS modul works only on google cloud

purple-terabyte-64712

08/29/2023, 7:56 AM

How can I execute it in debug mode?

purple-terabyte-64712

08/29/2023, 7:56 AM

image.png

modern-artist-55754

08/29/2023, 8:03 AM

You want to do it on local file system?

dazzling-judge-80093

08/29/2023, 8:04 AM

datahub --debug ingest

dazzling-judge-80093

08/29/2023, 8:04 AM

wherever it didn’t work

dazzling-judge-80093

08/29/2023, 8:05 AM

Have you tried to run the ingestion for data on gcs? Is it worked there?

purple-terabyte-64712

08/29/2023, 8:06 AM

[2023-08-29 100537,706] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-02-18\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,711] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-02-28\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,711] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-02-28\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,717] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-04-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,717] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-04-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,722] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-09-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,722] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-09-04\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet /[2023-08-29 100537,779] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-10-03\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,780] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2021-10-03\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,786] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-01-01\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,787] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-01-01\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,791] INFO {datahub.ingestion.source.s3.source:519} - Extracting table schema from file: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-02-10\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet [2023-08-29 100537,792] INFO {datahub.ingestion.source.s3.source:527} - Creating dataset urn with name: c:/users/szger/PARQUET/EMPLOYEE\dt=2022-02-10\part-00000-d9367d37-7df8-40d0-84b8-bd344398ec26.c000.snappy.parquet

dazzling-judge-80093

08/29/2023, 8:06 AM

Can you share the whole log, please?

purple-terabyte-64712

08/29/2023, 8:08 AM

this is the debug log

debuglog.txt

purple-terabyte-64712

08/29/2023, 8:09 AM

image.png

purple-terabyte-64712

08/29/2023, 8:09 AM

this is the output

parquet_discovery_output.json

purple-terabyte-64712

08/29/2023, 9:35 AM

I tested it with multiple arcyl-datahub versions

dazzling-judge-80093

08/29/2023, 9:36 AM

Is this only an issue on local fs?

purple-terabyte-64712

08/29/2023, 11:26 AM

we tested it only on local fs, we have currenlty local storage for parquet

purple-terabyte-64712

08/29/2023, 1:09 PM

were you able to reproduce the error, or to spot where the problem is?

Open in Slack

Previous Next