Hi all a question about the lookml ingestion we have seen re DataHub #ingestion

Hi all, a question about the lookml ingestion, we ...

thankful-ghost-61888

09/28/2022, 9:02 AM

Hi all, a question about the lookml ingestion, we have seen recurring errors with the the SQL lineage analyzer, it looks like it doesn’t recognise some lookML syntax like the

${EXTENDED}

keyword (docs here). From the ingestion logs:

Copy code

b"2022-09-25 10:05:57,890 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: EXTENDED] instead.' for query: 'SELECT\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b"      date_trunc('week',purchase_date) as purchase_date,\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      user_id,\n'
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      buyer_country,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      count(id) as items_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      sum(GMV) as gmv_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      min(user_order_sequence_number) as user_order_sequence_number_minweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      max(user_order_sequence_number) as user_order_sequence_number_maxweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    FROM (EXTENDED)\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    GROUP BY 1,2,3\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get tables\n'
[2022-09-25 10:05:57,894] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get columns\n'

The original lookML sql looks like this:

Copy code

include: "dt_engine_purchasetransaction.view"

view: dt_buyer_transactions_weekly {
  extends: [dt_engine_purchasetransaction]
  derived_table: {
  sql:
    SELECT
      date_trunc('week',purchase_date) as purchase_date,
      user_id,
      buyer_country,
      count(id) as items_bought_week,
      sum(GMV) as gmv_bought_week,
      min(user_order_sequence_number) as user_order_sequence_number_minweek,
      max(user_order_sequence_number) as user_order_sequence_number_maxweek,
      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,
      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country
    FROM (${EXTENDED})
    GROUP BY 1,2,3
    ;;
  }

We’re also seeing an issue with parsing a new line after the

FROM

statement:

Copy code

b"2022-09-25 10:06:10,069 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: \n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b"] instead.' for query: 'SELECT\n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        __d_a_t_e\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , userid\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , query\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , sum(total_searches) AS total_searches\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        FROM\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (SELECT\n'...

The original syntax looks like this:

Copy code

view: product_interaction_searches {
    derived_table: {
      sql_trigger_value: SELECT max(event_date::date) FROM datalake_processed.etl_tracking_searches;;
      distribution: "date"
      sortkeys: ["date"]
      sql:
       SELECT
        date
        , userid
        , query
        , sum(total_searches) AS total_searches
        FROM
            (
            (SELECT
              cast(date as date) AS date
              , userid AS userid
              , query
              , count(*) AS total_searches
            FROM datalake_compacted.mixpanel_tracking_search_results_query_action...

Any thoughts about the above?

modern-artist-55754

09/28/2022, 9:41 AM

Newline and double

, this is the open issue in

sqllineage

https://github.com/reata/sqllineage/issues/278

2 Views

Open in Slack

Previous Next