Zirui Xu
10/27/2022, 4:34 PMkedro.extras.datasets.spark.SparkDataSet without installing dependencies specified in kedro[spark] ? I am on a databricks cluster where the installation of pyspark is blocked.Nok Lam Chan
10/27/2022, 4:45 PMpyspark and s3fs installed it should be fine.Zirui Xu
10/27/2022, 4:59 PMpip install kedro[spark] it still tries to install pyspark .
Annoyingly even though this is a databricks cluster, if I pip freeze , pyspark is not there. Even if I can import pyspark .Zirui Xu
10/27/2022, 5:00 PMpip install kedro[spark] -> pip cannot see pyspark (although it is import-able), so it tries to install it
• pyspark is blocked on our cluster
• failNok Lam Chan
10/27/2022, 5:13 PMpip install kedro[spark] if you don’t want to? As long as you have the library there it should be fine.Nok Lam Chan
10/27/2022, 5:14 PM!pip freeze or %pip? The shell environment on Databricks is different from your python environment if I remember.Zirui Xu
10/27/2022, 5:24 PMpip install kedro[spark] to make spark.SparkDataSet available - but it seems the code is always in the main kedro package.
• When I ran the pipeline with __main__.py , the error message hid the actual import error (it just threw a message that pointed me to a page on the kedro documentation on managing dependencies).
• I tried to from kedro.extras.datasets.spark import SparkDataset . However potentially due to suppress(ImportError) , the error was still not helpful - it just said cannot import SparkDataSet from kedro.extras.datasets.spark
• Finally I from kedro.extras.datasets.spark.spark_dataset import SparkDataSet . That showed the real errors, hdfs and s3fs not installed.
• After I installed the two packages, all import errors are solved and the pipeline is now happy.Zirui Xu
10/27/2022, 5:25 PMNok Lam Chan
10/28/2022, 11:05 AMpandas.CSVDataSet instead of kedro.extras.datasets.pandas.CSVDataSet . So it is hard to determine if a module is not found because of a missing dependency or a non-existing module. Under the hood, kedro will look for this module in a couple of places until it finds one.
I’ll try to look at it and see if there is something that we can improve.
For the time being, your debugging strategy is correct. I would also do open up a Python console and import the full path!