```botocore.exceptions.PaginationError: Error duri...
# ingestion
h
Copy code
botocore.exceptions.PaginationError: Error during pagination: The same next token was received twice: {'Marker': 'dwh/dev/fact/fact_gross_profit/order_date_key_07%3D20230109/part-00018-f1470254-2c8b-4a23-aaad-0260cdca7054.c000.snappy.parquet'}
g
Hi @hundreds-airline-29192 source ? some info is available on github https://github.com/aws/aws-cli/issues/3917
h
Copy code
[2023-05-27 09:05:28,029] ERROR    {datahub.entrypoints:195} - Command failed: Error during pagination: The same next token was received twice: {'Marker': 'dwh/dev/fact/fact_gross_profit/order_date_key_07%3D20230109/part-00018-f1470254-2c8b-4a23-aaad-0260cdca7054.c000.snappy.parquet'}
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/entrypoints.py", line 182, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
    loop.run_until_complete(run_func_check_upgrade(pipeline))
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
    ret = await the_one_future
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
    return await loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
    raise e
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
    pipeline.run()
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 359, in run
    for wu in itertools.islice(
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/ingestion/source/gcs/gcs_source.py", line 156, in get_workunits
    yield from auto_workunit_reporter(
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 115, in auto_workunit_reporter
    for wu in stream:
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 42, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 765, in get_workunits
    for file, timestamp, size in file_browser:
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 726, in s3_browser
    for obj in bucket.objects.filter(Prefix=prefix).page_size(PAGE_SIZE):
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/boto3/resources/collection.py", line 81, in __iter__
    for page in self.pages():
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/boto3/resources/collection.py", line 171, in pages
    for page in pages:
  File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/botocore/paginate.py", line 327, in __iter__
    raise PaginationError(message=message)
botocore.exceptions.PaginationError: Error during pagination: The same next token was received twice: {'Marker': 'dwh/dev/fact/fact_gross_profit/order_date_key_07%3D20230109/part-00018-f1470254-2c8b-4a23-aaad-0260cdca7054.c000.snappy.parquet'}
@gentle-hamburger-31302 this is full logs
g
@hundreds-airline-29192 how many object (files) do you have in your bucket ?
h
about 100
g
@hundreds-airline-29192 version of datahub ?
h
0.10.2.3
g
let me try it at my end
h
@gentle-hamburger-31302 if you find anything lets noti me
a
ok
h
@gentle-hamburger-31302 dis u find smt ??
a
trying now
g
It is working fine at my end
h
?
g
@hundreds-airline-29192 Could you please check
aws
and
boto3
version
aws version: aws-cli/2.11.23 Python/3.11.3 Linux/5.19.0-42-generic exe/x86_64.ubuntu.22 prompt/off
boto3: 1.26.143
h
iam using quickstart and ingest metadata from google cloud storage
why it relate to aws and boto3?
g
In log that you has shared I saw
Copy code
File "/tmp/datahub/ingest/venv-gcs-0.10.2.3/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 726, in s3_browser
    for obj in bucket.objects.filter(Prefix=prefix).page_size(PAGE_SIZE):
s3/source
Please share your recipe
you can obfuscated your credential/storage path
h
source: type: gcs config: path_specs: - include: 'gs://abc-lakehouse/dwh/dev/*.*' credential: hmac_access_id: hmac_access_secret:
g
@dazzling-judge-80093 Could you please check this issue
h
@gentle-hamburger-31302 does it same as your recipe
g
nope my recipe is
Copy code
source:
  type: s3
  config:
    path_specs:
      -
        include: "<s3://pansurg-curation-raw-open-data/*.*>"
    aws_config:
      aws_region: xxxxxxx
      aws_profile: xxxxxxxxx
    env: "PROD"
    profiling:
      enabled: false
h
😞 iam using gcs source
not s3
but maybe it has same error
f
You said that you got a botocore exception from gcs ingestion job. No way!
h
@famous-florist-7218 why no way ? my team are using gcs
so please explain
@dazzling-judge-80093 can u help me with this
d
I have to investigate this as I have never seen this issue before.
h
yup , hope you find the problem soon
@dazzling-judge-80093 did you find the problems
a
Can you try what happens if you use the
aws s3
command list files from this gcs bucket?
I wonder if this fails only with our api call or even with the vanilla aws command
h
Is there any other way, I'm not the system admin so I don't have the permission
@dazzling-judge-80093 can you provide the full command ?
g
Copy code
aws s3 ls --recursive s3://<bucket-name>/
h
@gentle-hamburger-31302 i cant use this command to list file in my gcs
a
Ok, I am aware of S3 command , check if equivalent available in gcs
h
in my case , i using gsutil ls -r gs://<bucket-name>/ command to list file in the gcs bicket
*bucket
and it work
a
fyi.. @dazzling-judge-80093
d
@hundreds-airline-29192 did you set all these prerequisites -> https://datahubproject.io/docs/generated/ingestion/sources/gcs/#prerequisites?
h
@dazzling-judge-80093 yes , i have "Storage Object Viewer" Role and get the hmac key
@dazzling-judge-80093 did u find something news?