Can someone give a sample recipe.yaml where you sp...
# ingestion
b
Can someone give a sample recipe.yaml where you specify both paths to include and to exclude? I am trying to ingest data from s3 data lake.
h
Hi @bumpy-journalist-41369 For s3 bucket similar to this structure • test-bucket ◦ food ▪︎ part1.csv ▪︎ part2.vsv ◦ games ▪︎ part1.csv ▪︎ part2.csv ◦ archive ▪︎ a.csv ▪︎ b.csv Here - to ingest
food
and
games
as dataset , but ignore
archive
folder , the recipe would look like:
Copy code
source:
  type: "s3"
  config:
    path_spec: 
      include: "<s3://test-bucket/{table}/*.*>"
      exclude:
        - "**/archive/**"
    aws_config:
      aws_access_key_id: accessKey
      aws_secret_access_key: secretKey
      aws_region: us-east-2

sink:
....sink configs...
b
Thanks for the response. In my case I have tables with name foo and foo-test-{unique_id} that are on the same level. Is there a way to include paths only to foo, but exclude foo-test-{unique_id}? My current recipe looks like that : sink: type: datahub-rest config: server: ‘http://datahub-datahub-gms:8080’ source: type: s3 config: profiling: enabled: false path_spec: include: ‘s3://cdca-dev-us-east-1-product-metrics/{table}/sh_date={partition[0]}/*.parquet’ env: DEV aws_config: aws_region: us-east-1
h
I believe you can config like this:
Copy code
source:
    type: s3
    config:
        profiling:
            enabled: false
        path_spec:
            include: '<s3://cdca-dev-us-east-1-product-metrics/{table}/sh_date={partition[0]}/*.parquet>'
            exclude: 
                - '**/foo-test-*/**'
        env: DEV
        aws_config:
            aws_region: us-east-1
b
Thanks again. Will it work like that : ‘**/{table}-test*/**’
as I want to exclude all {table}-test-{id}
h
no. but you can use below pattern to exclude all folders with
-test-
in their name.
Copy code
exclude: 
                - '<s3://cdca-dev-us-east-1-product-metrics/*-test-*/**>'