Title
#shameless-plugs
Christian Persson

Christian Persson

09/29/2021, 10:17 AM
Hi! Not sure this is the right channel, maybe we should have a channel for general data engineering discussions that might be unrelated to airbyte but interesting for the community. I'm toying around with a query engine/data engineering toolbox. One operation that is interesting is to merge two data sets, (similar to the case when you have a (airbyte) destination and insert new data to it, depending on the dedup mode). The canonical example is to update a record if newer data arrives in the sync stream. One way to express that operation with SQL would be to create something like this, that is both cumbersome and error-prone to type:
WITH merged_stream AS (
  SELECT id, 0 updated_at, expression1, expression2, ... FROM old_table
  UNION ALL
  SELECT id, 1 updated_at, expression1, expression2, ... FROM new_table
)

SELECT id, FIRST(expression1), FIRST(expression2),... FROM
(
	SELECT id, expression1, expression2, ... 
	FROM merged_stream
	ORDER BY id, updated_at
) tmp
GROUP BY id
Wouldn't it be nice if you instead could do something like this instead?
WITH merged_stream AS (
  SELECT id, 0 updated_at, expression1, expression2, ... FROM old_table
  UNION ALL
  SELECT id, 1 updated_at, expression1, expression2, ... FROM new_table
)

SELECT * FROM
merged_stream
DEDUP BY id ON DUPLICATE KEY (TAKE LAST RECORD | TAKE FIRST RECORD | COALESCE | PANIC ) ORDERED BY updated_at
Do you know if this kind of construction exist somewhere already? Or do you know a more elegant way to express it with known SQL (instead of the GROUP BY example).
11:21 AM
postgres have
DISTINCT ON
, that is close enough to what I'am looking for I think.https://www.geekytidbits.com/postgres-distinct-on/
d

Deividas J

09/29/2021, 11:36 AM
distinct on
does the job
Christian Persson

Christian Persson

09/29/2021, 11:36 AM
WITH merged_stream AS (
  SELECT id, 0 updated_at, expression1, expression2, ... FROM old_table
  UNION ALL
  SELECT id, 1 updated_at, expression1, expression2, ... FROM new_table
)

SELECT DISTINCT ON(id) * FROM
merged_stream
ORDER BY id, updated_at DESC
would be the postgres equivalent to the
TAKE LAST RECORD
and
TAKE FIRST RECORD
modes. Nice.
m

Marco Fontana

09/29/2021, 1:30 PM
you can use
RANK
function