Hi DataHub ingestion experts, I followed the <inst...
# ingestion
a
Hi DataHub ingestion experts, I followed the instructions to configure and build the ingestion CLI. Specifically I ran the following:
Copy code
cd metadata-ingestion
../gradlew :metadata-ingestion:installDev
source venv/bin/activate
Then I tried to ingest something from mysql using the command below
Copy code
python3 -m datahub ingest -c ../test.mysql.localhost.dhub.yml
And I got the following mysterious error.
Copy code
Failed to create source due to mysql is disabled due to an error in initialization
Some small instrumentation of code revealed the exception to be
Copy code
dlopen(/Users/jinlin/Code/datahub/metadata-ingestion/venv/lib/python3.9/site-packages/greenlet/_greenlet.cpython-39-darwin.so, 0x0002): tried: '/Users/jinlin/Code/datahub/metadata-ingestion/venv/lib/python3.9/site-packages/greenlet/_greenlet.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e'))
I am on a Mac with M1 chip and this looks like a mismatch between M1 binary and x86 binary. What should I do to make this working?
c
@able-evening-90828 Was installDev successful? Seems like something was missed in installdev
a
Yes, it was. I can post the build transcript once I am at my computer.
m
Could you also post the result of
uname -a
?
Some users have reported that python installs on Mac in emulator mode by default
and that results in python thinking it is running on an x86 when actually it is running on an m1
also run this
Copy code
python -c "import platform; print(f'{platform.uname().system},{platform.uname().machine}')"
a
@mammoth-bear-12532,
uname -m
gave different results when running in
datahub_preflight.sh
as part of a build v.s. when running in the shell directly. In the former it showed
x86_64
, in the latter it showed
arm64
.
Below is the output of
uname -m
from the terminal directly.
Copy code
$ uname -m
arm64
Then I modified
datahub_preflight.sh
like below to log the output of `uname -m`:
Copy code
$ git diff scripts/datahub_preflight.sh
diff --git a/metadata-ingestion/scripts/datahub_preflight.sh b/metadata-ingestion/scripts/datahub_preflight.sh
index 2450d8d287..18ce365d0f 100755
--- a/metadata-ingestion/scripts/datahub_preflight.sh
+++ b/metadata-ingestion/scripts/datahub_preflight.sh
@@ -98,6 +98,8 @@ if [ "$(basename "$(pwd)")"    != "metadata-ingestion" ]; then
        exit 123
 fi
 printf 'āœ… Current folder is metadata-ingestion (%s) folder\n' "$(pwd)"
+printf 'uname -m result: (%s)\n' "$(uname -m)"
+printf 'uname result: (%s)\n' "$(uname)"
 if [[ $(uname -m) == 'arm64' && $(uname) == 'Darwin' ]]; then
   printf "šŸ‘Ÿ Running preflight for m1 mac\n"
   arm64_darwin_preflight
I got the following when I built it again.
Copy code
> Task :metadata-ingestion:runPreFlightScript
šŸ”Ž Checking if current directory is metadata-ingestion folder
āœ… Current folder is metadata-ingestion (/Users/jinlin/Code/datahub/metadata-ingestion) folder
uname -m result: (x86_64)
uname result: (Darwin)

āœ… Preflight was successful
I dug it a bit more and found out that it is because my Oracle JDK 1.8 is
x86_64
. Because gradle depends on JDK, that is why
uname -m
returned
x86_64
when running inside gradle build. The only JDK 1.8 for arm64 on MacOS I can find is at the link below. https://www.azul.com/downloads/?version=java-8-lts&amp;os=macos&amp;architecture=arm-64-bit&amp;package=jdk Once I installed it and configured
JAVA_HOME
to use it,
datahub_preflight.sh
ran
arm64_darwin_preflight
now and my ingestion from mysql worked fine. I still had one build error for
confluent-kafka
, which I haven't dug into. But I assume it isn't going to be a problem if I don't use kafka sync.
Copy code
/private/var/folders/bq/6j4gngqj3vlbthrxsz3v89zw0000gn/T/pip-install-hopsry44/confluent-kafka_9d021ea4647441e3b9ec907d3915ef2b/src/confluent_kafka/src/confluent_kafka.h:23:10: fatal error: 'librdkafka/rdkafka.h' file not found
      #include <librdkafka/rdkafka.h>
               ^~~~~~~~~~~~~~~~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

Ɨ Encountered error while trying to install package.
╰─> confluent-kafka
It would be good to update the pre-requirements on the page below with a note about installing the JDK I mentioned above on M1 chip. I am still not quite sure how to update/build the doc yet. https://datahubproject.io/docs/developers @careful-pilot-86309 and @mammoth-bear-12532
Want to note that if I do the following before kicking off the build, then both
confluent-kafka
and
psycopg2-binary
built successfully. What is puzzling is
datahub_preflight.sh
already did these and it is unclear why it didn't work.
Copy code
export GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1
  export GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1
  export CPPFLAGS="-I/opt/homebrew/opt/openssl@1.1/include -I/opt/homebrew/opt/librdkafka/include"
  export LDFLAGS="-L/opt/homebrew/opt/openssl@1.1/lib -L/opt/homebrew/opt/librdkafka/lib"
  export CPATH="/opt/homebrew/opt/librdkafka/include"
  export C_INCLUDE_PATH="/opt/homebrew/opt/librdkafka/include"
  export LIBRARY_PATH="/opt/homebrew/opt/librdkafka/lib"
m
That's great info. We recently fixed an confluent_kafka related issue on oss master, so that failure might have been a transient one.
a
Ok, great. I saw this too and just tested it.. https://github.com/datahub-project/datahub/pull/5489