Postgres ingestion - allow_deny_pattern Hello! S...
# ingestion
d
Postgres ingestion - allow_deny_pattern Hello! Starting this thread to discussion postgres ingestion profiling Under profiling, what does
allow_deny_pattern
signify?
Copy code
profiling:
      enabled: true 
      allow_deny_patterns:
        allow: 
          - .*
        deny:
          - 
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'
Is that filtering for data within the columns? If so, are there any examples to refer to? I am interested in knowing if those can be regexes to do Luhn algorithm checks.
Thanks in advance!
b
It only profiles datasets that meets the regex. So while ingest will generate datasets that meets regex1, you can profile more narrowly with regex2 in profiling.allow_deny_pattern (provided your regex is narrower)
If regex2 is wider, the profiles get ingested but doesn't show up
d
Thanks @better-orange-49102 So if I were to allow only American Express credit cards (simple regex
^3[47][0-9]{13}$
) and deny every other credit card in a table column from profiling, what would be the syntax for that?
b
Not sure on this, I don't think it's possible to filter, then sample
d
Something like this?
Copy code
profiling:
      enabled: true 
      allow_deny_patterns:
        allow: 
          - ^3[47][0-9]{13}$
        deny:
          - ^(^3[47][0-9]{13}$)
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'
you can't sample on a queried subset, if thats what you mean
m
the profiling allow deny patterns are meant to allow you to select which datasets (tables) you want to be profiled. For example if you have a large warehouse and only want to profile the tables in the
finance
database with the
public
schema (to keep profiling costs and load on operational system low), then you would use the allow_deny patterns. It is not meant to filter the values within the columns themselves.
d
I see, thanks @mammoth-bear-12532 and @better-orange-49102.
@mammoth-bear-12532 , @better-orange-49102 is the granularity of allow_deny at the table level or it can be at the column level if a specification such as below (the table creation script) is used Table
Copy code
CREATE TABLE IF NOT EXISTS public.test_to_exclude_table_columns_from_datahub
(
    id integer NOT NULL,
    val character varying(255) COLLATE pg_catalog."default" NOT NULL,
    time_stamp timestamp without time zone NOT NULL DEFAULT clock_timestamp(),
    secret_value character varying(255) COLLATE pg_catalog."default" NOT NULL,
    secret_value2 character varying(255) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT test_to_exclude_table_columns_from_datahub_pkey PRIMARY KEY (id)
)
Specification for
allow_deny
Copy code
profiling:
      allow_deny_patterns:
        allow: 
         - .*
        deny:
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value*'
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'
OR
Copy code
profiling:
      allow_deny_patterns:
        allow: 
         - .*
        deny:
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'
b
my understanding is that it should be possible to deny specific columns in a table from being sampled
d
To add more details… I created the following three tables
Copy code
dvdrental.public.test_to_exclude_table_from_datahub
dvdrental.public.test_to_exclude_table_from_datahub_2
dvdrental.public.test_to_exclude_table_from_datahub_3
I tried the following configuration/syntax by moving the table deny
- 'dvdrental.public.test_to_exclude_table_from_datahub*'
from above
profiling
to under it. I noticed that tables got ingested (which was expected) however they got profiled as well, as in the images above, which was not expected. Previously the tables did not get ingested.
Copy code
profiling:
        allow_deny_patterns:
        # allow: 
        #  - .*
        deny:
          - 'dvdrental.public.test_to_exclude_table_from_datahub*'
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
It is suggesting that the deny under
profiling
either needs a different specification or not working as expected AFAICT.
Also, I was expecting min/max/median values populated for the non-textual columns… which is not happening. Here’s the profiling section from my latest iteration:
Copy code
profiling:
      enabled: true # default false
      limit: 
      offset: 
      report_dropped_profiles: False
      turn_off_expensive_profiling_metrics: False
      profile_table_level_only: false # default false
      include_field_null_count: True
      include_field_min_value: True
      include_field_max_value: True
      include_field_mean_value: True
      include_field_median_value: True
      include_field_stddev_value: True
      include_field_quantiles: False
      include_field_distinct_value_frequencies: False
      include_field_histogram: True
      include_field_sample_values: True
      allow_deny_patterns:
        # allow: 
        #  - .*
        deny:
          - 'dvdrental.public.test_to_exclude_table_from_datahub*'
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'
      max_number_of_fields_to_profile: 
      # profile_if_updated_since_days: 1 # BigQuery only                 
      # profile_table_size_limit: 1 # BigQuery only
      # profile_table_row_limit: 50000 # BigQuery only
      max_workers: 10
      query_combiner_enabled: True
      catch_exceptions: True
      partition_profiling_enabled: True
      # bigquery_temp_table_schema: None # BigQuery only
      # partition_datetime: None # BigQuery only
An update on the above (re: profiling). Empirically, I noticed that non-textual columns with the exception of primary key columns are showing min/max/mean/median values.
I would appreciate if someone can guide on
allow_deny_patterns
on profiling. As per my testing I could not get it to work. TY CC @little-megabyte-1074
l
Hi @dazzling-insurance-83303, many apologies for the delayed response here! We have a handful of our teammates on some much-deserved PTO so we’re doing our best to keep up with support 🙂 I’ll escalate this with the team so we can get you some help; most of our ingestion experts are in later timezones & have singed off for the weekend; thanks for your patience!!
d
Happy Friday Maggie. No worries. It’s the long weekend here in Canada. I should probably get my headspace in the weekend mode 😇 Let’s sync up next week. Have a great weekend!
h
Hi @dazzling-insurance-83303, could you change
*
to
.*
and try e.g.:
'dvdrental.public.test_to_exclude_table_from_datahub.*'
?
d
Thanks @helpful-optician-78938. What I have setup is as follows
Copy code
Database : dvdrental
Schema : public
Tables : test_to_exclude_table_from_datahub, test_to_exclude_table_from_datahub_1, test_to_exclude_table_from_datahub_2
I will give it you suggestion a try.
CC @helpful-optician-78938.
g
Yes this is a bug in the config spec - I believe you should use
profile_pattern
instead of the
profiling.allow_deny_patterns
option