Anyone here is using AWS Athena with SQL Profile? ...
# ingestion
a
Anyone here is using AWS Athena with SQL Profile? How are you guys using it? We just can’t find a way to go around SQL Profiles on big tables in AWS Athena 😞
s
SQL profiling in datahub is currently is not suited for big tables. I had to limit number of rows in our postgres DB let alone AWS Athena. For big tables having profiling over time is going to be more useful for our case. I am exploring using great expectations directly for that purpose.
a
I see. yea our team is also exploring that route as well….we are using aws deequ hmm…that’s sad… SQL Profile is such a great feat
l
@helpful-optician-78938 is looking into short-term optimizations to make it run acceptably. Please stay tuned and don't give up yet on sql profiling in datahub 🙂
🙌 1
We're also looking into a more fundamental rewrite since this is an important area of investment for us for data quality related items on the roadmap cc @gray-shoe-75895
b
We would love a deequ based integration @adventurous-scooter-52064 🙂
s
@adventurous-scooter-52064 I have heard AWS deequ many times. Considering great expectations because it is open source. We have stuff running on both AWS and GCP. I agree sql profile is a good feature. Even if we use great expectations it doesn't give us a timeseries view over dates for the columns. That should be possible in datahub. While stats per dataset are good while testing I would prefer a single page per pipeline. I am sure datahub will figure this out as this is on roadmap. Just exploring GE for now. I haven't thought how pipeline testing might be done with datahub. There is also the concern of test re-use if we rely on datahub's sql profiling. We have to move some pipelines. So will need to be able to re-use our data tests. Will see. Currently just exploring GE.
h
I've done some exploring of GE myself, and failed to even get off the ground. @square-activity-64562 let me know if you figure out how to use it in any grander scale than in-memory Pandas dataframes. We've been more successful with https://github.com/sodadata/soda-sql