Hello, is there a particular reason why docker ima...
# getting-started
i
Hello, is there a particular reason why docker images are created directly from datahub's source instead of relying on published artifacts? I.e: published jars for GMS? published packages for python? Right now, if I need to modify a particular image I need to have the entire codebase locally available to perform relatively minor changes.
m
No specific reason other than convenience (of writing the docker script I imagine) ... @microscopic-receptionist-23548 or @steep-airplane-62865 might be able to shed more light
i
Would it be reasonable to release the code artifacts for each module to their respective places and then modify the dockerfiles? I understand the value of convenience to test local changes, but perhaps we could find a middle ground like having 2 sets of dockerfiles? The first set for local development and the second as part of the release process.
m
we have docker files for local development. see
docker/dev.sh
. I guess I'm not sure I see the advantage of using prepublished packages to build images. Can you explain further what kind of changes you're making?
Also, I'm no docker expert, but I do think it is a wise thing to build the code on the image it is going to run, to help ensure compatibility
i
In my case I was trying to modify the datahub-ingestion docker image to add support for druid, see this PR: https://github.com/linkedin/datahub/pull/2235 I've tested in on end well enough for my use-case. I wanted to make the docker-image available so that I could use it but it implies having the entire datahub repository on my end during the build phase which uses my company's resources and is essentially duplicated work once the PR is merged.
I imagined similar use-cases of smallish changes for devs in other companies may occured and thought if there could be an easier way to extend/modify these dockerfiles
m
I wanted to make the docker-image available so that I could it
could what it?
i
so sorry, so that it could be used in a deployment at my company without having to wait for a release on datahub's side.
m
so, assume we did publish python artifacts here, for sake of argument. how would that "fix" this issue? the artifact itself would still need to be built and published, and then a new docker image that pulls in the artifact built...
right?
i
Yes, my only intention is to reduce the dependency overhead and steps in the docker image building process (having to copy the entire codebase, compile it in the docker build process, move compiled objects, etc...)
m
all you did was offload it to some prior step that builds artifacts; that still needs to be run
i
Very true, it is my experience that that is normal procedure anyway, whether in a CI or locally, no need to duplicate it again in the dockerfile.
If those build artifacts are published somewhere it allows others to create modified docker images to the oficial ones without having the codebase at hand if not needed.
m
Hmm that makes sense, though I don't think it helps this specific PR since you're modifying artifacts anyway. Really this is the opposite? You want to modify the artifact but not the docker image 😛 This specific PR aside, I'll look into it a bit. Again, not a docker expert, I'm certainly not sure what best practices are. Again, so far as I know, building on the image is a good idea, but python/java should be portable and it shouldn't matter....
i
Probably, altering the artifact alters the docker image I guess. Since it is python I think I could replace the relevant python files but yes it would not be easiest thing.