Hi team! Could you please confirm how Datahub obtain profiling stats for Bigquery under the hood? Does it query over each table in bigquery to compute it's statistics or does it obtain this directly from logs?
cc. @acceptable-potato-35922
d
dazzling-judge-80093
02/25/2022, 5:56 PM
We query the tables directly but there are some optimisation:
• Running approx queries wherever possible
• Profiling only the latest partition for partitioned/sharded tables
dazzling-judge-80093
02/25/2022, 5:57 PM
Do you happen to know what stats bigquery can provide?
g
gifted-queen-80042
02/25/2022, 7:49 PM
Thanks @dazzling-judge-80093! As far as I know, bq can get schema related information. Not sure it can get stats like row & column counts directly without the querying the table. That's where profiling would come in, right?
d
dazzling-judge-80093
02/25/2022, 7:50 PM
Yes, exactly, we collect distinct count, sample values, min/max values from columns etc…