alert-fall-82501
01/09/2023, 7:04 AMcurved-planet-99787
01/09/2023, 7:37 AMs3_staging_dir
is used as a parameter name in the Athena
source recipe?
For me it is rather unintuitive and I would expect something like query_result_location
since this is also the term AWS uses
But I'm also interested in what others in the community think about this?
I could try to come up with a PR to change this if I you agree with my suggestionaloof-energy-17918
01/09/2023, 9:31 AM"'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fdad48c4d90>, 'Connection to <http://pypi.org|pypi.org> timed out. "
"(connect timeout=15)')': /simple/wheel/\n"
'WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by '
"'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fdad48c4f10>, 'Connection to <http://pypi.org|pypi.org> timed out. "
"(connect timeout=15)')': /simple/wheel/\n"
'WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by '
"'ConnectTimeoutError(<pip
best-umbrella-88325
01/09/2023, 9:56 AMdatahub-ingestion-cron:
enabled: true
crons:
s3:
schedule: "* * * * *" # Every Minute
recipe:
configmapName: s3-ingestion
fileName: s3-ingestion.yaml
image:
repository: acryldata/datahub-ingestion
tag: "v0.9.5"
Config Map:
Data
====
s3-ingestion.yaml:
----
source:
type: "s3"
config:
path_spec:
include: 's3://*****/datafiles/*.*'
platform: s3
aws_config:
aws_access_key_id: ******
aws_region: us-west-1
aws_secret_access_key: *****
sink:
type: "datahub-rest"
config:
server: '<http://datahub-datahub-gms:8080>'
BinaryData
====
Events: <none>
s3-ingestion.yaml
source:
type: "s3"
config:
path_spec:
include: 's3://****/datafiles/*.*'
platform: s3
aws_config:
aws_access_key_id: *****
aws_region: us-west-1
aws_secret_access_key: *****
sink:
type: "datahub-rest"
config:
server: '<http://datahub-datahub-gms:8080>'
Command used to create config map
kubectl create configmap s3-ingestion --from-file=s3-ingestion.yaml
refined-tent-35319
01/09/2023, 9:32 AMsalmon-motorcycle-36881
01/09/2023, 11:36 AMsalmon-motorcycle-36881
01/09/2023, 11:43 AMgorgeous-memory-27579
01/09/2023, 4:45 PMambitious-room-6707
01/09/2023, 4:41 PMadorable-summer-43339
01/10/2023, 1:37 AMfresh-processor-63024
01/10/2023, 5:34 AMprofiling:
enabled: true
limit: 5000
but if i use limit option, the row count of row count is setted to limit count
is this normal?microscopic-machine-90437
01/10/2023, 6:26 AMpolite-actor-701
01/10/2023, 7:55 AMfresh-processor-63024
01/10/2023, 8:10 AMcool-tiger-42613
01/10/2023, 10:55 AMrich-policeman-92383
01/10/2023, 10:27 AMlively-engine-55407
01/10/2023, 12:49 PMalert-fall-82501
01/10/2023, 1:47 PMhallowed-lizard-92381
01/10/2023, 4:55 PMUSERNAME = ***
pipeline = Pipeline.create(
{
"source": {
"type": "mysql",
"config": {
"username": "user",
"password": "pass",
"database": "db_name",
"host_port": "localhost:3306",
},
},
"sink": {
"type": "datahub-rest",
"config": {"server": "<http://localhost:8080>"},
},
}
)
# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()
hallowed-lizard-92381
01/10/2023, 4:56 PMUSERNAME = ***
PASS = ***
DB_NAME = ***
HOST = ***
def pipeline1():
pipeline = Pipeline.create(
{
"source": {
"type": "mysql",
"config": {
"username": USERNAME,
"password": PASS,
"database": DB_NAME,
"host_port": HOST,
},
},
"sink": {
"type": "datahub-rest",
"config": {"server": "<http://localhost:8080>"},
},
}
)
# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()
hallowed-lizard-92381
01/10/2023, 5:00 PMUSERNAME = ***
PASS = ***
DB_NAME = ***
HOST = ***
def pipeline1():
pipeline = Pipeline.create(
{
"source": {
"type": "mysql",
"config": {
"username": USERNAME,
"password": PASS,
"database": DB_NAME,
"host_port": HOST,
},
},
"sink": {
"type": "datahub-rest",
"config": {"server": "<http://localhost:8080>"},
},
}
)
# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()
instead it would be nice to have
def run_pipeline(pipeline_str, **kwargs):
pipeline=pipeline.create(json_loads(pipeline_str.format(**kwargs))
pipeline.run()
pipeline.pretty_print_summary()
Invoked by...
run_pipeline('''{
"source": {
"type": "mysql",
"config": {
"username": USERNAME,
"password": PASS,
"database": DB_NAME,
"host_port": HOST,
},
},
"sink": {
"type": "datahub-rest",
"config": {"server": "<http://localhost:8080>"},
},
}''')
Anybody doing this?bland-lighter-26751
01/10/2023, 5:11 PMsource:
type: bigquery
config:
include_table_lineage: true
include_usage_statistics: true
include_tables: true
include_views: true
profiling:
enabled: true
profile_table_level_only: false
stateful_ingestion:
enabled: true
remove_stale_metadata: true
credential:
project_id: study-342717
microscopic-carpet-71950
01/10/2023, 5:22 PMplain-cricket-83456
01/11/2023, 2:38 AMrich-policeman-92383
01/11/2023, 7:37 AMtransformers:
- type: "simple_add_dataset_domain"
config:
replace_existing: true # false is default behaviour
domains:
- "urn:li:domain:engineering"
- "urn:li:domain:hr"
astonishing-cartoon-6079
01/11/2023, 8:52 AMbetter-orange-49102
01/11/2023, 9:16 AM{
"auditHeader": null,
"entityType": "container",
"entityUrn": "urn:li:container:19c4d1f6538241d930dba76ede90e9a9",
"entityKeyAspect": null,
"changeType": "UPSERT",
"aspectName": "containerProperties",
"aspect": {
"value": "{\"customProperties\": {\"platform\": \"mysql\", \"instance\": \"mycustomMySQL\", \"database\": \"datahub\"}, \"name\": \"datahub\"}",
"contentType": "application/json"
},
"systemMetadata": {
"lastObserved": 1673423105823,
"runId": "mysql-2023_01_11-15_45_03",
"registryName": null,
"registryVersion": null,
"properties": null
}
}
refined-hamburger-93459
01/11/2023, 9:57 AMplain-cricket-83456
01/11/2023, 10:10 AMmagnificent-lock-58916
01/11/2023, 11:03 AM