Hi! we are in the process of upgrading to `0.8.33`...
# troubleshoot
w
Hi! we are in the process of upgrading to
0.8.33
and we have found this exception quite recurrent in different connectors:
Copy code
[2022-04-21 09:40:45,039] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
Any idea what it could be?
d
which ingestion source had this issue?
w
This log is for snowflake. Also in the hive one. And we have also found this in some custom connectors.
Redshift connector too.
I have executed the snowflake connector using the
--debug
flag
Copy code
[2022-04-21 10:15:46,579] DEBUG    {datahub.ingestion.source.sql.snowflake:481} - Upstream lineage of 'avalanche_dev.dwh_bridge.b_xiti_traffic': ['urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_core_green.f_xiti_daily_by_level2,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_core_green.f_xiti_daily_by_site,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_core_green.f_xiti_monthly_by_level2,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_core_green.f_xiti_monthly_by_site,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_core_green.f_xiti_weekly_by_level2,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_core_green.f_xiti_weekly_by_site,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.lu_vertical,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.p_xiti_daily_by_level2_corrections,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.p_xiti_daily_by_site_corrections,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.p_xiti_monthly_by_level2_corrections,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.p_xiti_monthly_by_site_corrections,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.p_xiti_weekly_by_level2_corrections,DEV)', 'urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_parameter.p_xiti_weekly_by_site_corrections,DEV)']
[2022-04-21 10:15:46,626] INFO     {datahub.ingestion.run.pipeline:84} - sink wrote workunit snowflake-urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_bridge.b_xiti_traffic,DEV)-upstreamLineage
[2022-04-21 10:15:46,676] INFO     {datahub.ingestion.run.pipeline:84} - sink wrote workunit avalanche_dev.dwh_bridge.b_xiti_traffic
[2022-04-21 10:15:46,722] INFO     {datahub.ingestion.run.pipeline:84} - sink wrote workunit avalanche_dev.dwh_bridge.b_xiti_traffic-subtypes
2022-04-21 10:15:46,722 INFO sqlalchemy.engine.base.Engine SHOW /* sqlalchemy:get_view_names */ VIEWS IN dwh_bridge
[2022-04-21 10:15:46,722] INFO     {sqlalchemy.engine.base.Engine:110} - SHOW /* sqlalchemy:get_view_names */ VIEWS IN dwh_bridge
2022-04-21 10:15:46,722 INFO sqlalchemy.engine.base.Engine {}
[2022-04-21 10:15:46,722] INFO     {sqlalchemy.engine.base.Engine:110} - {}
[2022-04-21 10:15:46,857] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 10:15:46,858] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 10:15:46,858] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 10:15:46,859] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
2022-04-21 10:15:46,859 INFO sqlalchemy.engine.base.Engine SHOW /* sqlalchemy:get_table_names */ TABLES IN dwh_core_ad_content_red
[2022-04-21 10:15:46,859] INFO     {sqlalchemy.engine.base.Engine:110} - SHOW /* sqlalchemy:get_table_names */ TABLES IN dwh_core_ad_content_red
2022-04-21 10:15:46,859 INFO sqlalchemy.engine.base.Engine {}
[2022-04-21 10:15:46,859] INFO     {sqlalchemy.engine.base.Engine:110} - {}
2022-04-21 10:15:46,968 INFO sqlalchemy.engine.base.Engine SHOW /* sqlalchemy:_get_schema_primary_keys */PRIMARY KEYS IN SCHEMA avalanche_dev.dwh_core_ad_content_red
2022-04-21 10:15:46,968 INFO sqlalchemy.engine.base.Engine {}
[2022-04-21 10:15:46,968] INFO     {sqlalchemy.engine.base.Engine:110} - SHOW /* sqlalchemy:_get_schema_primary_keys */PRIMARY KEYS IN SCHEMA avalanche_dev.dwh_core_ad_content_red
[2022-04-21 10:15:46,968] INFO     {sqlalchemy.engine.base.Engine:110} - {}
2022-04-21 10:15:47,071 INFO sqlalchemy.engine.base.Engine
            SELECT /* sqlalchemy:_get_schema_columns */
Not sure if this can tell you were it comes from 😅
m
@witty-butcher-82399 can you check in the summary_report usually at the end of the run and see if there’s any other useful information? This is a catch all try/except block in the outermost call.
i suspect you have some permission issue in can you run
SHOW /* sqlalchemy:get_view_names */ VIEWS IN dwh_bridge
in your snowflake using the same account you use for datahub ingestion?
w
Hi @modern-artist-55754! I haven’t found anything specific about the error in the summary report. Just in case, I have shared the full log for the job in a DM with you.
I don’t have network access from local so I cannot debug from my machine, however I was able to set up an sqlalchemy client from my pod instance
Copy code
>>> from sqlalchemy import create_engine
>>> engine = create_engine('<snowflake://XXXX:YYY@ZZZZ>')
>>> connect = engine.connect()
>>> results = connect.execute("SHOW /* sqlalchemy:get_view_names */ VIEWS IN dwh_bridge").fetchone()
>>> print(results)
None
and got no error when running the show views, so there is no permission issue
Copy code
Failed to extract some records due to: 'NoneType' object has no attribute 'group'
This error looks to me like trying to run the
group
method from the matches in a regular expression (the
NoneType
suggests there was no match)
From the pod instance, I also tried to run with a debugger and set breakpoints, but I couldn’t make it stop in the error. I don’t know if someone was able to do something similar
Copy code
datahub@demo-ingestion-snowflake-willhaben-manual-wgn-r2dwn:/$ /usr/local/bin/python -m pdb /usr/local/bin/datahub ingest -c /etc/recipe/recipe.yaml 
> /usr/local/bin/datahub(3)<module>()
-> import re
(Pdb) b /datahub-ingestion/src/datahub/ingestion/run/pipeline.py:210
Breakpoint 1 at /datahub-ingestion/src/datahub/ingestion/run/pipeline.py:210
(Pdb) b /datahub-ingestion/build/lib/datahub/ingestion/run/pipeline.py:210
Breakpoint 2 at /datahub-ingestion/build/lib/datahub/ingestion/run/pipeline.py:210
(Pdb) c
[2022-04-21 13:58:15,270] INFO     {datahub.cli.ingest_cli:96} - DataHub CLI version: 0.8.33.post1.dev0+b84ccb6
[2022-04-21 13:58:20,916] INFO     {datahub.ingestion.source_config.sql.snowflake:107} - using authenticator type 'DEFAULT_AUTHENTICATOR'
/usr/local/lib/python3.8/site-packages/datahub/ingestion/transformer/add_dataset_browse_path.py:33: DeprecationWarning: Call to deprecated class DatasetTransformer. (Legacy transformer that supports transforming MCE-s using transform_one method. Use BaseTransformer directly and implement the transform_aspect method)
  return cls(config, ctx)
/usr/local/lib/python3.8/site-packages/datahub/ingestion/transformer/add_dataset_ownership.py:174: DeprecationWarning: Call to deprecated class DatasetTransformer. (Legacy transformer that supports transforming MCE-s using transform_one method. Use BaseTransformer directly and implement the transform_aspect method)
  return cls(config, ctx)
[2022-04-21 13:58:21,057] INFO     {datahub.cli.ingest_cli:112} - Starting metadata ingestion
[2022-04-21 13:58:21,062] INFO     {datahub.ingestion.source.sql.snowflake:89} - Checking current version
[2022-04-21 13:58:23,923] INFO     {datahub.ingestion.source.sql.snowflake:106} - Current role is META_DATA_READER
[2022-04-21 13:58:23,923] INFO     {datahub.ingestion.source.sql.snowflake:110} - Checking grants for role META_DATA_READER
[2022-04-21 13:58:33,597] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,598] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,599] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,681] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,683] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,684] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,685] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,875] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,877] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,879] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:33,881] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
[2022-04-21 13:58:36,485] INFO     {datahub.ingestion.run.pipeline:84} - sink wrote workunit container-urn:li:container:fdb05ecbd6619a97ff103dafe85caf0b-to-urn:li:dataset:(urn:li:dataPlatform:snowflake,avalanche_dev.dwh_bridge.b_ad_active_ads,DEV)
[2022-04-21 13:58:44,387] INFO     {datahub.ingestion.source.sql.snowflake:406} - A total of 12395 Table->Table edges found for 4102 downstream tables.
^C
🎉 I managed to find the error with
python -m trace
It seems the problem is a custom transform that we apply in most of our recipes, for some reason it is failing with the new version CC: @quick-pizza-8906
Copy code
add_custom_dataplatform.py(77):     full_name = result.group(2)
Sharing here the trick in case someone is in a similar situation. Thanks @modern-artist-55754 @dazzling-judge-80093 for the support and sorry for the false alarm.
Copy code
datahub@demo-ingestion-snowflake-willhaben-manual-77v-fxsj6:/$ python -m trace -t /usr/local/bin/datahub ingest -c /etc/recipe/recipe.yaml | grep -C 10 "Failed to extract some records due to:"
[2022-04-21 14:13:54,824] INFO     {datahub.cli.ingest_cli:96} - DataHub CLI version: 0.8.33.post1.dev0+b84ccb6
[2022-04-21 14:14:31,074] INFO     {datahub.ingestion.source_config.sql.snowflake:107} - using authenticator type 'DEFAULT_AUTHENTICATOR'
/usr/local/lib/python3.8/site-packages/datahub/ingestion/transformer/add_dataset_browse_path.py:33: DeprecationWarning: Call to deprecated class DatasetTransformer. (Legacy transformer that supports transforming MCE-s using transform_one method. Use BaseTransformer directly and implement the transform_aspect method)
  return cls(config, ctx)
/usr/local/lib/python3.8/site-packages/datahub/ingestion/transformer/add_dataset_ownership.py:174: DeprecationWarning: Call to deprecated class DatasetTransformer. (Legacy transformer that supports transforming MCE-s using transform_one method. Use BaseTransformer directly and implement the transform_aspect method)
  return cls(config, ctx)
[2022-04-21 14:14:31,321] INFO     {datahub.cli.ingest_cli:112} - Starting metadata ingestion
[2022-04-21 14:14:31,375] INFO     {datahub.ingestion.source.sql.snowflake:89} - Checking current version
%6|1650550495.245|FAIL|rdkafka#producer-1| [thrd:sasl_<ssl://kafka-rapidpaper-internal.storage.mpi-internal.com:9>]: sasl_<ssl://kafka-rapidpaper-internal.storage.mpi-internal.com:9094/bootstrap>: Disconnected (after 59623ms in state UP)
%6|1650550495.246|FAIL|rdkafka#producer-2| [thrd:sasl_<ssl://kafka-rapidpaper-internal.storage.mpi-internal.com:9>]: sasl_<ssl://kafka-rapidpaper-internal.storage.mpi-internal.com:9094/bootstrap>: Disconnected (after 59619ms in state UP)
[2022-04-21 14:14:55,966] INFO     {datahub.ingestion.source.sql.snowflake:106} - Current role is META_DATA_READER
[2022-04-21 14:14:55,967] INFO     {datahub.ingestion.source.sql.snowflake:110} - Checking grants for role META_DATA_READER
enum.py(635):         if type(value) is cls:
enum.py(640):         try:
enum.py(641):             return cls._value2member_map_[value]
re.py(306):         if len(_cache) >= _MAXCACHE:
re.py(308):             try:
re.py(309):                 del _cache[next(iter(_cache))]
re.py(312):         _cache[type(pattern), pattern, flags] = p
re.py(313):     return p
add_custom_dataplatform.py(77):     full_name = result.group(2)
pipeline.py(209):             except Exception as e:
pipeline.py(210):                 logger.error(f"Failed to extract some records due to: {e}")
 --- modulename: __init__, funcname: error
__init__.py(1474):         if self.isEnabledFor(ERROR):
 --- modulename: __init__, funcname: isEnabledFor
__init__.py(1693):         if self.disabled:
__init__.py(1696):         try:
__init__.py(1697):             return self._cache[level]
__init__.py(1698):         except KeyError:
__init__.py(1699):             _acquireLock()
 --- modulename: __init__, funcname: _acquireLock
__init__.py(224):     if _lock:
[2022-04-21 14:15:24,773] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
--
 --- modulename: add_custom_dataplatform, funcname: extract_dataset_from_urn
add_custom_dataplatform.py(76):     result = re.search(r"(^urn:li:dataset:)\(([^)]+)\)", urn)
 --- modulename: re, funcname: search
re.py(201):     return _compile(pattern, flags).search(string)
 --- modulename: re, funcname: _compile
re.py(291):     if isinstance(flags, RegexFlag):
re.py(293):     try:
re.py(294):         return _cache[type(pattern), pattern, flags]
add_custom_dataplatform.py(77):     full_name = result.group(2)
pipeline.py(209):             except Exception as e:
pipeline.py(210):                 logger.error(f"Failed to extract some records due to: {e}")
 --- modulename: __init__, funcname: error
__init__.py(1474):         if self.isEnabledFor(ERROR):
 --- modulename: __init__, funcname: isEnabledFor
__init__.py(1693):         if self.disabled:
__init__.py(1696):         try:
__init__.py(1697):             return self._cache[level]
__init__.py(1475):             self._log(ERROR, msg, args, **kwargs)
 --- modulename: __init__, funcname: _log
__init__.py(1571):         sinfo = None
__init__.py(1572):         if _srcfile:
[2022-04-21 14:15:24,783] ERROR    {datahub.ingestion.run.pipeline:210} - Failed to extract some records due to: 'NoneType' object has no attribute 'group'
b
cc @square-activity-64562 For visibility on Snowflake
s
Ack. This looks like out the regex has a problem.
``` --- modulename: add_custom_dataplatform, funcname: extract_dataset_from_urn
add_custom_dataplatform.py(76): result = re.search(r"(^urnlidataset:)\(([^)]+)\)", urn)```
I will try to see if I can add some unit tests to reliably reproduce it and change the regex
@witty-butcher-82399 I missed that this was in your custom transformer. In that case just adding a logger before that regex to see the input and adding unit tests for that should help out.
let us know in case you find some incorrect data being created in the connectors.