Hi all, a question about the lookml ingestion, we ...
# ingestion
t
Hi all, a question about the lookml ingestion, we have seen recurring errors with the the SQL lineage analyzer, it looks like it doesn’t recognise some lookML syntax like the
${EXTENDED}
keyword (docs here). From the ingestion logs:
Copy code
b"2022-09-25 10:05:57,890 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: EXTENDED] instead.' for query: 'SELECT\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b"      date_trunc('week',purchase_date) as purchase_date,\n"
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      user_id,\n'
[2022-09-25 10:05:57,891] {{pod_launcher.py:156}} INFO - b'      buyer_country,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      count(id) as items_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      sum(GMV) as gmv_bought_week,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      min(user_order_sequence_number) as user_order_sequence_number_minweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'      max(user_order_sequence_number) as user_order_sequence_number_maxweek,\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b"      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country\n"
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    FROM (EXTENDED)\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'    GROUP BY 1,2,3\n'
[2022-09-25 10:05:57,892] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get tables\n'
[2022-09-25 10:05:57,894] {{pod_launcher.py:156}} INFO - b'2022-09-25 10:05:57,890 ERROR    sql holder not present so cannot get columns\n'
The original lookML sql looks like this:
Copy code
include: "dt_engine_purchasetransaction.view"

view: dt_buyer_transactions_weekly {
  extends: [dt_engine_purchasetransaction]
  derived_table: {
  sql:
    SELECT
      date_trunc('week',purchase_date) as purchase_date,
      user_id,
      buyer_country,
      count(id) as items_bought_week,
      sum(GMV) as gmv_bought_week,
      min(user_order_sequence_number) as user_order_sequence_number_minweek,
      max(user_order_sequence_number) as user_order_sequence_number_maxweek,
      percent_rank() over (partition by date_trunc('week',purchase_date) order by items_bought_week, sum(GMV)) as rank_items_bought,
      percent_rank() over (partition by date_trunc('week',purchase_date),buyer_country order by items_bought_week, sum(GMV)) as rank_items_bought_country
    FROM (${EXTENDED})
    GROUP BY 1,2,3
    ;;
  }
We’re also seeing an issue with parsing a new line after the
FROM
statement:
Copy code
b"2022-09-25 10:06:10,069 ERROR    SQL lineage analyzer error 'An Identifier is expected, got Token[value: \n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b"] instead.' for query: 'SELECT\n"
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        __d_a_t_e\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , userid\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , query\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        , sum(total_searches) AS total_searches\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'        FROM\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (\n'
[2022-09-25 10:06:10,070] {{pod_launcher.py:156}} INFO - b'            (SELECT\n'...
The original syntax looks like this:
Copy code
view: product_interaction_searches {
    derived_table: {
      sql_trigger_value: SELECT max(event_date::date) FROM datalake_processed.etl_tracking_searches;;
      distribution: "date"
      sortkeys: ["date"]
      sql:
       SELECT
        date
        , userid
        , query
        , sum(total_searches) AS total_searches
        FROM
            (
            (SELECT
              cast(date as date) AS date
              , userid AS userid
              , query
              , count(*) AS total_searches
            FROM datalake_compacted.mixpanel_tracking_search_results_query_action...
Any thoughts about the above?
m
Newline and double
(
, this is the open issue in
sqllineage
https://github.com/reata/sqllineage/issues/278