Jakub Bartczuk
06/06/2022, 12:43 PMJakub Bartczuk
06/06/2022, 12:44 PMEduardo
Jakub Bartczuk
06/06/2022, 12:45 PMJakub Bartczuk
06/06/2022, 12:45 PMJakub Bartczuk
06/06/2022, 12:46 PMJakub Bartczuk
06/06/2022, 12:46 PMvector_data = pd .read_csv
Jakub Bartczuk
06/06/2022, 12:47 PMEduardo
Jakub Bartczuk
06/06/2022, 1:02 PMEduardo
Jakub Bartczuk
06/06/2022, 1:08 PMJakub Bartczuk
06/06/2022, 1:09 PMJakub Bartczuk
06/06/2022, 1:09 PMEduardo
Eduardo
ploomber interact
then:
# replace 'task-name' with the actual name
print(dag['task-name'].status(return_code_diff=True)['Code diff'])
and show me the output?Jakub Bartczuk
06/06/2022, 4:43 PMdef train(
product,
upstream,
classes_to_use: List[str],
class_maping: Dict,
model: str,
model_parameters: Optional[Dict] = None,
search_type: Optional[str] = None,
parameters_search: Optional[Dict] = None,
cv: int = 5,
test_size: float = 0.2,
perform_data_scaling: bool = True,
):
model_path = str(product["model_path"])
vector_data = pd .read_csv(str(upstream["train.enrich_tif_metadata"]))
vector_data = process_labels(vector_data, classes_to_use, class_maping)
X = vector_data .drop(["labels", "filename"], axis=1).to_numpy()
y = vector_data["labels"].to_numpy()
X_train, X_test, y_train, y_test = model_selection .train_test_split(
X, y, test_size=test_size, random_state=42, stratify=y
)
logging .info("used metadata features")
logging .info(
[
col
for col in vector_data .columns
if "embedding"not in col and col not in ["labels", "filename"]
]
)
logging .info("also using embeddings")
if model_parameters is None:
model_parameters = {}
classifier = MODELS[model](**model_parameters)
if perform_data_scaling:
classifier = pipeline .make_pipeline(
preprocessing .StandardScaler(), classifier)
if search_type in PARAMETER_SEARCH_TYPES .keys():
logging .info("performing parameter search")
if parameters_search is None:
parameters_search = {}
classifier = PARAMETER_SEARCH_TYPES[search_type](
classifier, parameters_search, random_state=42, cv=cv
)
logging .info("used metadata features")
logging .info(
[
col
for col in vector_data .columns
if "embedding"not in col and col not in ["labels", "filename"]
]
)
classifier .fit(X_train, y_train)
logging .info(classifier .best_params_)
classifier = classifier .best_estimator_
else:
logging .info("fitting model")
classifier .fit(X_train, y_train)
y_pred = classifier .predict(X_test)
logging .info("classification report")
logging .info(metrics .classification_report(y_test, y_pred))
with open(model_path, "wb")as handle:
pickle .dump(classifier, handle)
Jakub Bartczuk
06/06/2022, 4:43 PMEduardo
-
and +
to the diff to show what's detecting, but I don't see any, but the whitespace definitely looks weird. let me do some debuggingJakub Bartczuk
06/07/2022, 9:13 AMEduardo
ploomber interact
then:
dag[task_name].status()
and share the table that appearsEduardo