Gilad Rubin
07/15/2024, 9:34 AMfrom langchain_community.document_loaders.pdf import BasePDFLoader
def pdf_pages(pdf_path: Union[str, Path],
pdf_loader: BasePDFLoader) -> List[Document]:
return pdf_loader(pdf_path).load()
Driver:
from langchain_community.document_loaders import PyMuPDFLoader
results = dr.execute(final_vars = outputs, inputs={"pdf_loader" : PyMuPDFLoader)
PyMuPDFLoader inherits from BasePDFLoader.
I'm getting the following error:
Error: Type requirement mismatch. Expected pdf_loader:typing.Type[langchain_community.document_loaders.pdf.BasePDFLoader] got class 'langchain_community.document_loaders.pdf.PyMuPDFLoader':class 'abc.ABCMeta' instead.
My motivation comes from wanting to check the DAG with a different Langchain PDF loaders.
Thanks!Thierry Jean
07/15/2024, 1:41 PM..., inputs={"pdf_loader": PyMuPDFLoader})
Should be (it's missing the ()
for object instantiation)
..., inputs={"pdf_loader": PyMuPDFLoader()})
Gilad Rubin
07/15/2024, 1:43 PMThierry Jean
07/15/2024, 1:45 PMfrom sklearn.linear import LinearRegression
# returns specific
def trained_model() -> LinearRegression:
return LinearRegression()
# expects generic, which includes the specific
def prediction(trained_model: BaseEstimator):
return ...
While this isn't
from sklearn.linear import LinearRegression
# returns generic
def trained_model() -> BaseEstimator:
return LinearRegression()
# expects specific, which may conflict with the generic
def prediction(trained_model: LinearRegression):
return ...
Thierry Jean
07/15/2024, 1:46 PMpdf_loader
?Gilad Rubin
07/15/2024, 1:46 PMThierry Jean
07/15/2024, 1:47 PMinputs={"pdf_loader": PyMuPDFLoader(pdf_path=...)}
because the pdf_path
is injected in the Hamilton DAG (i.e., it comes from another node) ?Gilad Rubin
07/15/2024, 1:52 PMThierry Jean
07/15/2024, 1:52 PMError: Type requirement mismatch. Expected pdf_loader:typing.Type[langchain_community.document_loaders.pdf.BasePDFLoader] got <class 'langchain_community.document_loaders.pdf.PyMuPDFLoader'>:<class 'abc.ABCMeta'> instead.This error code is due to
Driver.execute()
. It tells me that you have a valid DAG
with pdf_loader: langchain_community.document_loaders.pdf.BasePDFLoader
but there's a mismatch with the pdf_loader
value passed in inputs.Thierry Jean
07/15/2024, 1:54 PMdef pdf_loader(pdf_type: abc.ABCMeta, pdf_loader_config: dict) -> BasePDFLoader:
return pdf_type(**pdf_loader_config) # this creates a `BasePDFLoader`
def my_other_func(pdf_loader: BasePDFLoader, ...) -> ...:
return ...
Driver code
dr.execute(..., inputs={"pdf_type": PyMuPDFLoader, pdf_loader_config={"pdf_path": ...}})
Thierry Jean
07/15/2024, 1:55 PMGilad Rubin
07/15/2024, 1:56 PMdef pdf_pages(pdf_path: Union[str, Path],
pdf_loader: Any) -> List[Document]:
return pdf_loader(pdf_path).load()
Gilad Rubin
07/15/2024, 1:59 PMThierry Jean
07/15/2024, 1:59 PMpdf_loader
is a type rather than an object?Gilad Rubin
07/15/2024, 2:00 PMThierry Jean
07/15/2024, 2:06 PMdef model_config(model_config_overrides: Optional[dict] = None) -> dict:
config = dict(...)
if model_config_overrides:
config.update(**model_config_overrides)
return config
def base_model(model_config: dict) -> BaseEstimator:
return LinearRegression(**model_config)
def trained_model(base_model: BaseEstimator, ...) -> BaseEstimator:
# fit model
return base_model
This has the following benefits:
⢠the most common config is directly in my code model_config()
and I don't have to specify a config to use dr.execute()
⢠if I want to change a few values at the same time, I can use model_config_override
. This will be very explicit from dr.execute()
and I can inspect the return value of model_config()
if I need to debug
⢠for full control or testing, I can use dr.execute(..., overrides={"model_config": dict()})
to explicitly pass a config and skip potential bugs in model_config()
It might be a bit overengineered, but you might find it useful if you do a lot of experimentation!
example: https://github.com/zilto/ordinal-forecasting-digital-phenotyping/blob/master/src/tabular_model.py#L164Gilad Rubin
07/15/2024, 2:46 PMThierry Jean
07/15/2024, 2:55 PM@config.when()
(e.g., swap between LinearRegression, SVM, HistGradientBoostingRegressor).
⢠If you want even more flexibility / dynamism, you can write
def model_config(model_config_overrides: Optional[dict] = None) -> dict:
config = dict(...)
if model_config_overrides:
config.update(**model_config_overrides)
return config
# you can set a default value if you want
def base_model(model_config: dict, model_type: type = LinearModel) -> BaseEstimator:
return model_type(**model_config)
def trained_model(base_model: BaseEstimator, ...) -> BaseEstimator:
# fit model
return base_model
With this much flexibility though, you probably want to check if the model_config
will be valid for the model_type
Gilad Rubin
07/15/2024, 2:56 PMmodel_type"
Thierry Jean
07/15/2024, 2:57 PM