How does ploomber actually check if something is o...
# ask-anything
j
How does ploomber actually check if something is outdated? I have a function that is ALWAYS considered outdated, and it gives a spurious diff
Is there some way to freeze file version?
e
it stores the source code of the function and it compares it to the current one. it normalizes whitespace and ignores comments. can you show the spurious diff? what do you mean by freezing the file version?
j
how does it normalize whitespace?
That might be it
uhm sorry
vector_data = pd .read_csv
in the original file there is no space between "pd"
e
It runs autopep8, I don't think that's the problem. But let me do some digging. I'll send you some commands that you can run to debug the problem
j
I ran autopep8 and then ran black which I use for formatting, and black didn't detect any changes. But ploomber status says that code changed
e
Oh I see what the problem is. It happened before. I think black and autopep8 change the quotation marks; I remember someone having a problem when using black. Try skipping black and see if that fixes it. We still have to provide a long term solution since black is pretty popular
j
but when I run autopep8 it also doesn't change anything
I tried normalize_python from your codediffer and it returns code with these weird new whitespace, this is different from autopep8 normalization
or maybe you use some specific options?
e
alright, let me take a look at the source code
ok, can you run the task that's always marked as outdated, then execute:
Copy code
ploomber interact
then:
Copy code
# replace 'task-name' with the actual name
print(dag['task-name'].status(return_code_diff=True)['Code diff'])
and show me the output?
j
def train(
product,
upstream,
classes_to_use: List[str],
class_maping: Dict,
model: str,
model_parameters: Optional[Dict] = None,
search_type: Optional[str] = None,
parameters_search: Optional[Dict] = None,
cv: int = 5,
test_size: float = 0.2,
perform_data_scaling: bool = True,
):
model_path = str(product["model_path"])
vector_data = pd .read_csv(str(upstream["train.enrich_tif_metadata"]))
vector_data = process_labels(vector_data, classes_to_use, class_maping)
X = vector_data .drop(["labels", "filename"], axis=1).to_numpy()
y = vector_data["labels"].to_numpy()
X_train, X_test, y_train, y_test = model_selection .train_test_split(
X, y, test_size=test_size, random_state=42, stratify=y
)
logging .info("used metadata features")
logging .info(
[
col
for col in vector_data .columns
if "embedding"not in col and col not in ["labels", "filename"]
]
)
logging .info("also using embeddings")
if model_parameters is None:
model_parameters = {}
classifier = MODELS[model](**model_parameters)
if perform_data_scaling:
classifier = pipeline .make_pipeline(
preprocessing .StandardScaler(), classifier)
if search_type in PARAMETER_SEARCH_TYPES .keys():
logging .info("performing parameter search")
if parameters_search is None:
parameters_search = {}
classifier = PARAMETER_SEARCH_TYPES[search_type](
classifier, parameters_search, random_state=42, cv=cv
)
logging .info("used metadata features")
logging .info(
[
col
for col in vector_data .columns
if "embedding"not in col and col not in ["labels", "filename"]
]
)
classifier .fit(X_train, y_train)
logging .info(classifier .best_params_)
classifier = classifier .best_estimator_
else:
logging .info("fitting model")
classifier .fit(X_train, y_train)
y_pred = classifier .predict(X_test)
logging .info("classification report")
logging .info(metrics .classification_report(y_test, y_pred))
with open(model_path, "wb")as handle:
pickle .dump(classifier, handle)
that's basically the whole function
e
interesting. so that should add
-
and
+
to the diff to show what's detecting, but I don't see any, but the whitespace definitely looks weird. let me do some debugging
j
BTW I tried this with first with ploomber 0.15, I updated it to 0.19.6 and it's the same
e
yeah, i was expecting that. we haven't changed the code that compares the cache vs the actual in a while. please try this:
Copy code
ploomber interact
then:
Copy code
dag[task_name].status()
and share the table that appears
were you able to fix this?