Hi Not sure this is the right channel maybe we should have a Airbyte #releases

Hi! Not sure this is the right channel, maybe we s...

Christian Persson

09/29/2021, 10:17 AM

Hi! Not sure this is the right channel, maybe we should have a channel for general data engineering discussions that might be unrelated to airbyte but interesting for the community. I'm toying around with a query engine/data engineering toolbox. One operation that is interesting is to merge two data sets, (similar to the case when you have a (airbyte) destination and insert new data to it, depending on the dedup mode). The canonical example is to update a record if newer data arrives in the sync stream. One way to express that operation with SQL would be to create something like this, that is both cumbersome and error-prone to type:

Copy code

WITH merged_stream AS (
  SELECT id, 0 updated_at, expression1, expression2, ... FROM old_table
  UNION ALL
  SELECT id, 1 updated_at, expression1, expression2, ... FROM new_table
)

SELECT id, FIRST(expression1), FIRST(expression2),... FROM
(
	SELECT id, expression1, expression2, ... 
	FROM merged_stream
	ORDER BY id, updated_at
) tmp
GROUP BY id

Wouldn't it be nice if you instead could do something like this instead?

Copy code

WITH merged_stream AS (
  SELECT id, 0 updated_at, expression1, expression2, ... FROM old_table
  UNION ALL
  SELECT id, 1 updated_at, expression1, expression2, ... FROM new_table
)

SELECT * FROM
merged_stream
DEDUP BY id ON DUPLICATE KEY (TAKE LAST RECORD | TAKE FIRST RECORD | COALESCE | PANIC ) ORDERED BY updated_at

Do you know if this kind of construction exist somewhere already? Or do you know a more elegant way to express it with known SQL (instead of the GROUP BY example).

Christian Persson

09/29/2021, 11:21 AM

postgres have

DISTINCT ON

, that is close enough to what I'am looking for I think. https://www.geekytidbits.com/postgres-distinct-on/

Deividas J

09/29/2021, 11:36 AM

distinct on

does the job

👍 1

Christian Persson

09/29/2021, 11:36 AM

Copy code

WITH merged_stream AS (
  SELECT id, 0 updated_at, expression1, expression2, ... FROM old_table
  UNION ALL
  SELECT id, 1 updated_at, expression1, expression2, ... FROM new_table
)

SELECT DISTINCT ON(id) * FROM
merged_stream
ORDER BY id, updated_at DESC

would be the postgres equivalent to the

TAKE LAST RECORD

and

TAKE FIRST RECORD

modes. Nice.

👍 1

Marco Fontana

09/29/2021, 1:30 PM

you can use

RANK

function

4 Views

Open in Slack

Previous Next