Hello I just did run datahub with docker-compose a...
# getting-started
n
Hello I just did run datahub with docker-compose and opened frontend. How do I learn how to "load data" to datahub? It's kind of foggy to me, yet
b
Hi Cesar -- To load sample data, you can use the
./ingestion.sh
script under
docker/ingestion
. If you want to start loading in your own metadata, you can use the Python Ingestion framework 🙂 cc @gray-shoe-75895
n
I see, just loaded the sample data. My main issue now is how to figure out, let's say, how to create metadata about postgresql/greenplum or oracle's schemas/tables in my company
Is there a step by step about that?
g
Yes the Python ingestion framework is perfect for that - see https://github.com/linkedin/datahub/tree/master/metadata-ingestion. We already have support for Postgres as a metadata source, and adding other databases is fairly straightforward as well
Let me know if you need any help with it! It'd also be helpful to know if you find anything confusing so that I can improve the docs
n
I'll check that, thanks
Where do I run
datahub ingest -c examples/recipes/file_to_file.yml
?
i
Make sure to read the read me there, you will have to compile and install things
👍 2
n
What if I see things like
Failed building wheel for avro-python3
or
Failed building wheel for avro-gen
and others? (pip install -e .)
error: invalid command 'bdist_wheel'
m
@gray-shoe-75895: ^^
g
huh that's odd - can you try
pip install wheel
?
n
Sure, that solved! New problem:
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
g
That means that you're missing some python headers - did you run
sudo apt install librdkafka-dev python3-dev python3-venv
?
n
Yep, will do again 🙂
g
Also, what python version are you using?
n
Copy code
(venv) me:~/git-github/datahub/metadata-ingestion$ sudo apt install librdkafka-dev python3-dev python3-venv
[sudo] password for cesarribeiro: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
librdkafka-dev is already the newest version (0.11.3-1build1).
python3-dev is already the newest version (3.6.7-1~18.04).
python3-venv is already the newest version (3.6.7-1~18.04).
The following packages were automatically installed and are no longer required:
  golang-docker-credential-helpers python-asn1crypto python-backports.ssl-match-hostname python-cached-property python-certifi python-cffi-backend python-chardet
  python-cryptography python-docker python-dockerpty python-dockerpycreds python-docopt python-enum34 python-funcsigs python-functools32 python-idna python-ipaddress
  python-jsonschema python-mock python-openssl python-pbr python-requests python-six python-texttable python-urllib3 python-websocket python-yaml
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 128 not upgraded.
(venv) me:~/git-github/datahub/metadata-ingestion$ python --version
Python 3.6.9
g
That all seems fine
What was the full error you got with the pip install?
n
Several 😄 1 sec..
https://gist.github.com/carrbrpoa/f441a3e1a4bc59125e970d557c1e1070 I'm not familiar with python, sorry if it's something trivial 😅
g
That's a new error to me as well - definitely not trivial 🙂
It seems there's a version mismatch between the librdkafka you have installed and the python wrapper library. The other surprising thing is that it's trying to build from source instead of using the prebuilt packages
Can you try running
pip install confluent_kafka==1.5.0
?
n
g
Great - that seems to have worked. Now can you try the original pip install command again?
n
I tried! Line 7 in that gist 😄
g
Ah didn't catch that
Then you should be good to go!
n
Yep, will follow next steps! Thanks for the help
g
Glad I could help, and I'll be updating the docs so people don't run into these issues in the future 🙂
👍 1
n
g
Yet another odd one - can you try
pip install avro-python3==1.10.0
n
Done; just datahub ingest again?
Same error running again; maybe I should redo some step?
g
Yeah perhaps
Can you list your installed packages with
pip freeze
n
Copy code
avro-gen==0.3.0
avro-python3===file-.avro-VERSION.txt
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
confluent-kafka==1.5.0
dataclasses==0.8
-e git+<https://github.com/linkedin/datahub.git@12ff330a54bf1eb69b4364a3d622464077cfac5e#egg=datahub&subdirectory=metadata-ingestion>
fastavro==1.3.2
frozendict==1.2
idna==2.10
mypy-extensions==0.4.3
pkg-resources==0.0.0
pydantic==1.7.3
pytz==2021.1
PyYAML==5.4.1
requests==2.25.1
six==1.15.0
SQLAlchemy==1.3.23
toml==0.10.2
typing-extensions==3.7.4.3
tzlocal==2.1
urllib3==1.26.3
g
That avro-python3 part looks weird
it should’ve been 1.10.0
n
here's what it showed when i tried that command to install 1.10:
Copy code
(venv) me@carrbrpoa:~/git-github/datahub/metadata-ingestion$ pip install avro-python3==1.10.0
Collecting avro-python3==1.10.0
  Downloading <https://files.pythonhosted.org/packages/b2/5a/819537be46d65a01f8b8c6046ed05603fb9ef88c663b8cca840263788d58/avro-python3-1.10.0.tar.gz>
  Requested avro-python3==1.10.0 from <https://files.pythonhosted.org/packages/b2/5a/819537be46d65a01f8b8c6046ed05603fb9ef88c663b8cca840263788d58/avro-python3-1.10.0.tar.gz#sha256=a455c215540b1fceb1823e2a918e94959b54cb363307c97869aa46b5b55bde05>, but installing version file-.avro-VERSION.txt
Building wheels for collected packages: avro-python3
  Running setup.py bdist_wheel for avro-python3 ... done
  Stored in directory: /home/me/.cache/pip/wheels/3f/15/cd/fe4ec8b88c130393464703ee8111e2cddebdc40e1b59ea85e9
Successfully built avro-python3
Installing collected packages: avro-python3
  Found existing installation: avro-python3 file-.avro-VERSION.txt
    Uninstalling avro-python3-file-.avro-VERSION.txt:
      Successfully uninstalled avro-python3-file-.avro-VERSION.txt
Successfully installed avro-python3-file-.avro-VERSION.txt
g
Hmm it tried to build from source, which I suspect is broken. Can you try
pip uninstall avro-python3 && pip cache purge && pip install -e .
n
pip cache purge didn't work. ok?
ERROR: unknown command "cache" - maybe you meant "check"
g
do you have an older version of pip? maybe a
pip install --upgrade pip
would help
n
could be! will do
Rerunning ingest:
Copy code
(venv) me@carrbrpoa:~/git-github/datahub/metadata-ingestion$ datahub ingest -c examples/recipes/file_to_file.yml
Traceback (most recent call last):
  File "/home/me/git-github/datahub/venv/bin/datahub", line 11, in <module>
    load_entry_point('datahub', 'console_scripts', 'datahub')()
  File "/home/me/git-github/datahub/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 480, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/home/me/git-github/datahub/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2693, in load_entry_point
    return ep.load()
  File "/home/me/git-github/datahub/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2324, in load
    return self.resolve()
  File "/home/me/git-github/datahub/venv/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2330, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/me/git-github/datahub/metadata-ingestion/src/datahub/entrypoints.py", line 13, in <module>
    from datahub.ingestion.run.pipeline import Pipeline
  File "/home/me/git-github/datahub/metadata-ingestion/src/datahub/ingestion/run/pipeline.py", line 9, in <module>
    from datahub.ingestion.sink import sink_class_mapping
  File "/home/me/git-github/datahub/metadata-ingestion/src/datahub/ingestion/sink/__init__.py", line 6, in <module>
    from .datahub_kafka import DatahubKafkaSink
  File "/home/me/git-github/datahub/metadata-ingestion/src/datahub/ingestion/sink/datahub_kafka.py", line 13, in <module>
    from datahub.metadata.schema_classes import SCHEMA_JSON_STR
ImportError: cannot import name 'SCHEMA_JSON_STR'
g
Huh the codegen also didn't run correctly. Can you try
pip uninstall avro-gen && pip cache purge && pip install -e .
- my bet is that the old pip installed an old avro-gen
n
Hello, resuming today Worked! Thanks a lot Now, I'll try to follow the pg recipe
It's me again in this thread 😄 Today I tried to install things in another environment (windows 10 -
./gradlew :metadata-events:mxe-schemas:build
step) and got several errors: https://gist.github.com/carrbrpoa/04e3c5bb5fe9c92b596089375b1f4c1c (Python 3.8.2)
(this step doesn't depend on datahub services running, right?)
g
It shouldn’t depend on anything else. That step is actually included as part of the normal build, so it points to a broader issue here
If you run a simple ./gradlew build, what happens?
n