hey everyone :slightly_smiling_face: I'm trying t...
# ask-community-for-troubleshooting
d
hey everyone 🙂 I'm trying to setup my custom DBT transformations for my connection but not being able to do so. I'm runnning Airbyte on Kubernetes. This is my transformations project: https://github.com/delucca-workspaces/analytics/tree/feat/user-access/dbt This is the repository URL I'm using inside Airbyte: https://github.com/delucca-workspaces/analytics.git This is the branch name I'm using in my custom DBT transformer inside Airbyte:
feat/user-access
This is the command I'm using for DBT inside Airbyte:
run --project-dir dbt
(if anyone needs it, this is the link directly to the branch: https://github.com/delucca-workspaces/analytics/tree/feat/user-access) I've attached the logs. In a nutshell, it fails after a few minutes. I've tried running
dbt run --project-dir dbt
locally (inside the root of my repository) and it works. Only in Airbyte it fails
u
@Daniel looks that the normalization container didn't found the
dbt_project.yml
file
d
@[DEPRECATED] Marcos Marx Thanks for your answer 🙂 Evaluating Airbyte's logs I found out that it passes the
--project-dir
to the run command always. So, my manual flag was simply ignored IMHO there should be a way to change the project-dir. Since my repo has the DBT project inside of a subdir, not the root dir. Anyway, I moved my project file to root and added the custom paths in it for the models and others. It is not optimal, but it was the only way to fix it. I am running the job again to see if it worked
@[DEPRECATED] Marcos Marx for some reason it is still not working. I've tried both commands that Airbyte runs (
git clone branch
and
dbt run
) and locally it works Here are the logs
it seems to miss the
dbt_project.yml
file, but it is already in the root of my repo
and, it seems that the cloned repo is already updated. Check the chunk where Airbyte's logs the latest commits:
Copy code
2021-07-28 21:20:14 INFO  Last 5 commits in git_repo:
2021-07-28 21:20:14 INFO  e7ada0c fix(dbt): moving dbt_project since Airbyte has issues
2021-07-28 21:20:14 INFO  943da2c bugfix: typo
2021-07-28 21:20:14 INFO  708198c feat(dbt): adds user_acesses fact
2021-07-28 21:20:14 INFO  c3af0e5 feat(dbt): adds last_access_time to dim__user
2021-07-28 21:20:14 INFO  c0759ed feat(dbt): adds basic structure
those are indeed the latest commits from that repo. The latest one is where I moved the
dbt_project.yml
file from
./dbt
to root
@[DEPRECATED] Marcos Marx do you want me to move this thread to #C01MFR03D5W? Sorry about that, I missed the channel for some reason
c
the
--project-dir
is always used but if you specify it in the arguments, it’ll take precedence over the one that airbyte usually supply: https://github.com/airbytehq/airbyte/blob/master/airbyte-workers/src/main/resources/dbt_transformation_entrypoint.sh
can you try to run it with:
--project-dir=git_repo/dbt
? i think the dbt command is run from the folder one level up from your git repo… it’d probably be more intuitive if it was
cd git_repo
first instead though…
or
--project-dir=/config/git_repo/dbt
d
@Chris (deprecated profile) thanks for your answer You can check the original log that I've shared (
logs-7
) and you will see that actually Airbyte
--project-dir
is passed in the end, so it will overwrite the previous flag (the one I've provided)
but, in any case, I've already moved the
dbt_project
to the root of the project (you can check in my branch: https://github.com/delucca-workspaces/analytics/tree/feat/user-access) and it still won't work
regarding logs-7, check the line 344 of the log:
Copy code
2021-07-28 21:01:34 INFO  Running: dbt run --project-dir dbt --profiles-dir=/config --project-dir=/config/git_repo
as you can see, Airbyte passes their
--project-dir
in the end of the command. I've tested it locally and when you do that DBT uses only the last value
c
Yes it s probably because you are passing with
--project-dir dbt
and the grep is expecting the syntax with
=
instead
--projec-dir=dbt
So we should grep without the
=
and in the meantimes you should use that syntax instead
d
ah ok! I'm going to try that, 1 sec
c
but, in any case, I’ve already moved the 
dbt_project
 to the root of the project (you can check in my branch: https://github.com/delucca-workspaces/analytics/tree/feat/user-access) and it still won’t work
i don’t see a
dbt_project.yml
file at the root of your repo but in the
dbt
folder instead so you’d need to use
--project-dir=/config/git_repo/dbt
d
I've just changed that, problably that's why you cant see it. The commit I was testing before was this: https://github.com/delucca-workspaces/analytics/tree/e7ada0ce81449ad6bee7e542e91bb41ce0d71465
👌 1
I'm waiting for it to finish, I'll let you know in a sec if it worked
c
i think it might complain about your profile next 🙂
re-reading the script i linked earlier, you might have to move your dbt_project.yml back to the root of your git_repo…
d
No deal 😞
@Chris (deprecated profile) I can move it back to the root, but that would lead me to the same error as before (the commit I sent to you)
for some reason, even with the dbt_project in root it was not working
c
this is a kube deployment right?
d
yes
do you think this can be a kubernetes-related issue? :S
c
on docker-compose deployment, the container that does the
git clone --depth 5 -b feat/user-access --single-branch  $GIT_REPO git_repo
command and the container that runs
dbt run
inside the git repo are different but they are both started with the same volume mount. So everything is fine and the
git_repo
folder is shared between the two
DockerProcess
@Davin Chia (Airbyte) / @Jared Rhizor (Airbyte): on kube, does this behave properly the same way?
j
There aren’t any shared mounts for the single-container processes launched on Kube. Looks like we’ll have to change this to remove the dependency on the shared mount.
@Chris (deprecated profile) is it easy to make the second container also perform the clone?
c
There s a clone but also translating a airbyte destination config into dbt profile config file to do... So these tasks were done by a container using normalization images
Whereas the custom dbt transformations are run by a different docker image
j
hmm
there are multiple ways we could handle this
c
And if users defined multiple operations (different containers) they might want to share "artifacts files" (shared volume mounts) between the different steps
j
hmm
we could probably accomplish this by running a sequence of initContainers for the process we spin up
with a shared mount across all of them
then they’d run in order on the same node with a shared mount
another option would be to use cloud storage to store the directory and load it between the processes
c
Back to Daniel's issue... To answer his question, it seems the answer is therefore yes... We need to fix custom dbt transformations to work on kube... https://airbytehq.slack.com/archives/C021JANJ6TY/p1627512382385400?thread_ts=1627506369.378200&cid=C021JANJ6TY
d
😞
do you see any possible workaround for this?
my startup is creating it's ETL strucutre. It is pretty simple, but Airbyte fit to it so well
it would be a shame to avoid using it because of this issue. I don't know how hard you think it would be to fix that. And how long that could take too
our entire infrastructure is inside Kubernetes
c
Maybe to work around for the momemt you can build your docker image with the git repo already loaded inside...
j
We’ll probably be fixing this pretty soon. I can brainstorm with Davin a bit on this tonight and we’ll try to get a rough timeline.
👌 1
d
that would be awesome. If the idea is to fix that quickly I could run it locally until it got fixed
I think that would be easier for us (since we're just starting our ETL structure) to do that instead of creating a custom image
d
discussed with Chris today, there are 2 parts to this:
Copy code
* Problem 1 - need to run 'configure' operation before we can run the actual DBT runner
	* we do not share file space today between Kube processes as we do in docker
	* Solution
	Approach 1)  Make the user install git and the base normalize folder in their submitted docker image. This way we can run the operation in      the container
	Approach 2)  Migrate the transform_config directory to Java. This way the scheduler can run this and transfer the yml file over to the container. 
		* Submitted image will still need git
		* need to modify normalization as well (all we need to do is remove this from the entrypoint.sh in base-normalization and make sure we also copy the yaml over)
		
* Problem 2 - need to share file space between operations with multiple sequential steps
	* Solution
		* 'Create' a new multi-step operation to be executed in the same container/pod. This will take the form of a 'script' the user can submit.
		* User controls docker image + entryxoint script so has as much flexibility as possible.
	* Users that aren't as technical can still use the CustomDbtRunner and do sequential operations. The operations won't be able to share the same file space, but they will be same to operate on the same warehouse.
👍 1
octavia rocket 2
@Jared Rhizor (Airbyte) Chris and I think approach 2 is a good approach for problem 1 and would solve Daniel's immediate problem. What do you think?
d
Looking forward for that 🙏
j
@Davin Chia (Airbyte) sounds good to me
👍 1
c
This PR should fix the first confusing part of this thread (before discovering limitations on kube): https://github.com/airbytehq/airbyte/pull/5076
👍 1
d
@Chris (deprecated profile) after that PR I would be able to define a custom --project-dir accordingly? 🙂
👍 1
c
you’d need to upgrade but yes, moving your
--project-dir
around will be better handled after that PR is released
d
@charles I think we can slot this into next sprint as a stretch? I think this is 2 days of work so it won't take too much away from Cloud. We can plan to work on it later in the sprint so we cross over with when Chris comes back. Wdyt?
c
how does the priority of this stack up against the private beta launch? chris's next project after bata launch was going to be schema evolution which is also pretty high priority.
o
Dear all, any news about this? @Daniel did the workaround works for you? If yes, could you please say how you made it work? And what did you put in the airbyte “Add transformation” fields, please? Thanks
j
Any status on this?
What exactly does “Migrate the transform_config directory to Java. This way the scheduler can run this and transfer the yml file over to the container.” mean?
@Parker (Airbyte)? Are you working on this?
@Davin Chia (Airbyte)?
Could the approach be to copy the files generated during k8s normalization back to the worker after it finishes, then copy those files up to the k8s dbt transformer before it does it’s work?
p
@Jonathan Alvarado I’m not currently working on this - I did some scoping/investigatory work earlier that included a rough spike, but the actual implementation hasn’t been prioritized yet. The approach we scoped out was to rewrite the spec transformation script from Python to Java so that it could run on one of our java workers, and then copy the transformed spec into the container. But there are other details to workout around when and where to check out the dbt repository, testing that the changes work with docker, etc etc
🙏 1
j
Thanks for the update @Parker (Airbyte)
d
Why wouldn't I be able to create a persistent volume / persistent volume claim that is shared across the deployment specs to manage this issue? Has anyone considered or tried this?
I don't really understand exactly how airbyte bootstraps the whole dbt process in k8s, but if someone can point me to any kind of documentation or even the source itself where it happens, I can probably find a solution to these issues in k8s itself. Any pointers are greatly appreciated. We are experimenting with using airbyte+dbt -- we are also all k8s, and I would like to see if we can get this working.
🙏 1
g
@Davin Chia (Airbyte) I wanted to confirm this is a viable workaround: https://airbytehq.slack.com/archives/C021JANJ6TY/p1627556825392100?thread_ts=1627506369.378200&cid=C021JANJ6TY? If so, we’re prepared to do just about anything to get dbt working Airbyte on K8s at this point.
👋 1
Took a shot at implementing it if you have any advice. More details here: https://github.com/airbytehq/airbyte/issues/5091#issuecomment-1382307149
a
@Gabriel Levine @Davin Chia (Airbyte) Any update? Seems like the kube issue is still persistent. Tried downloading the git repo in my Dockerfile
Copy code
FROM fishtownanalytics/dbt:1.0.0

RUN mkdir /git_repo
WORKDIR /git_repo
RUN apt update \
    && apt install curl bash git openssh-server libpq-dev gcc -y
RUN git clone <https://github.com/ajyadav013/dbt-test.git>

ENTRYPOINT ["dbt", "--project-dir=/git_repo/dbt-test"]
Stuck in the same loop
Copy code
2023-01-16 14:29:29 destination > completed destination: class io.airbyte.integrations.destination.postgres.PostgresDestination
2023-01-16 14:29:34 normalization > Running: git clone --depth 5 -b main --single-branch  $GIT_REPO git_repo
2023-01-16 14:29:35 normalization > Last 5 commits in git_repo:
2023-01-16 14:29:35 normalization > d88428f added more column transformation
2023-01-16 14:29:35 normalization > 9516cd0 added more column transformation
2023-01-16 14:29:35 normalization > 7b96100 added more column transformation
2023-01-16 14:29:35 normalization > a7546aa added more column transformation
2023-01-16 14:29:35 normalization > f0be440 added more column transformation
2023-01-16 14:29:35 normalization > /config
2023-01-16 14:29:35 normalization > Running: transform-config --config destination_config.json --integration-type postgres --out /config
2023-01-16 14:29:36 normalization > Namespace(config='destination_config.json', integration_type=<DestinationType.POSTGRES: 'postgres'>, out='/config')
2023-01-16 14:29:36 normalization > transform_postgres
2023-01-16 14:29:36 normalization > Cloning into 'git_repo'...
2023-01-16 14:29:42 dbt > entrypoint.sh: line 6: cd: git_repo: No such file or directory
p
Hi Guys, I'm facing with the same problem to use CustomDbtTransformation on k8s. I'm working in a workaround here. If I succeed, I will share with you all...
🙏 2
g
@Davin Chia (Airbyte) Would at least be nice to migrate the reference to the official dbt registry since fishtownanalytics has been deprecated since 2021: https://github.com/dbt-labs/dbt-core/pkgs/container/dbt-core. I understand if the team doesn’t want to deal with testing new minor versions but I feel like it would be reasonable to pull in some of the patch level fixes: https://github.com/dbt-labs/dbt-core/blob/1.0.latest/CHANGELOG.md
a
Hey yall, i’m having the same issue, would really appreciate a hint on the workaround implementation
a
@Anas El Mhamdi You can use airflow to trigger the dbt transformation I’ve done the same - API hit from UI to airflow DAG 1 - Check Airbyte connection and create if necessary 2 - Sync data from connections 3 - DBT transformation on the synced data Airbyte API Docs - https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#auth
🙏 1