No data migration. Ploomber Cloud runs on AWS, and can integrate with your existing infrastructure such as S3, GCS, RedShift and Athena.

Ploomber

not ploomber-specific, but I often have code like
```pathlib.Path(__file__).absolute().resolve() / "path/to/file.csv"```

or getting the root of your git project
```Path(subprocess.Popen("git rev-parse --show-toplevel", shell=True, stdout=subprocess.PIPE)
    .stdout.read().decode().strip()) ```

alternatively, you can pass the path as param and have ploomber resolve it to an absolute path:

```tasks:
  - source: ...
    param:
      path: '{{root}}/path/to/file.csv'```
`{{root}}` will resolve to the parent of `pipeline.yaml`

<https://docs.ploomber.io/en/latest/api/spec.html#default-placeholders>

<@U0327AHTCUU> have you seen anything similar? I'm looking for inspiration

I've always built this kind of thing myself. E.g., track summary statistics of your target as an <https://www.mlflow.org/docs/latest/tracking.html#performance-tracking-with-metrics|mlflow run metric>, and then have a separate process for pulling that data and analyzing it (could be a dashboard, slack alert based on some threshold, etc.).

You might get some inspiration from <https://docs.evidentlyai.com/> (I haven't used this in prod before, but it looks nice)

mini question - when using `grid`, i’m setting two parameters so that my output is `/path/[[parameter]]-sampled.csv` but when I build, I get `/path/[[parameter]]-sampled-0.csv`

Currently there isn't a way to deactivate the -0.csv part, it's added automatically. But please open an issue and we'll work on it!

yeah, looks like a <https://docs.ploomber.io/en/latest/cookbook/grid.html|grid> can be helpful for your use case, it'd be something like this:

```# execute independent tasks in parallel
executor: parallel

tasks:
  - source: clean.py
    name: clean-
    product:
      nb: clean.html
      clean: clean.csv
    grid:
        cities: [a, b, c, d]```

to control the product's filenames, you can use placeholders, <https://docs.ploomber.io/en/latest/api/spec.html#tasks-grid|more on the docs>

Does this work similarly in the Python API? Sorry, I should've specified that I am utilizing the python API in the original question.

you can replace `grid` in the Python API with a for loop, <https://github.com/ploomber/projects/blob/master/templates/ml-advanced/src/ml_advanced/pipeline.py|check out this example> - essentially you create many `NotebookRunner`/`PythonCallable` instances and each one gets a different parameter (the path to the file)

i'm guessing you have a folder  where each file corresponds to data from a single city, right? then you could do:

```for city in ['a', 'b', ...]
    NotebookRunner(Path('your-script.py'), File(f'{city}-clean.csv'), name=f'{city}-clean', dag=dag, params=dict(input_path=f'path/to/{city}.csv'))```

(it'd work the same if you're using PythonCallable)

hi, this was very helpful, but i did have another question. I am fairly new to all of this so I would just like to ask how env works and in the tasks why the file comes from env.path.data (line45 for example) , im not sure how that works and a very simple rundown would be helpful. I have been trying to find documentation for this but I can't seem to understand it. Thank you!

yeah, our Python docs need more work. you can find all the <https://github.com/ploomber/projects/tree/master/python-api-examples|examples here>

<https://github.com/ploomber/projects/blob/master/python-api-examples/guide/env.ipynb|this one> explains how the basics of env work

<https://github.com/ploomber/projects/blob/master/python-api-examples/guide/env-life-cycle.ipynb|and this one> os a follow-up, showing more advanced things regarding the env

but note that using an env is optional

once a task is finished and a product is made ( in this case a csv file), is there anywhere to access it? can i make a folder to store all the products I've created through the tasks? does this require an env? right now i have my products set up like this in PYthonCallable: File(env. path.cleaned_data / "social.csv") based on what I've read, does this mean that I'm putting the social csv in a folder called cleaned_data?

&gt;  once a task is finished and a product is made ( in this case a csv file), is there anywhere to access it?
the simplest way would be to pass the path to the csv file. e.g. `pd.read_csv('/path/to/data.csv')` - now, if you want to avoid harcoding paths (which is a good practice), then you can load your DAG into a Python session and extract information from it. To enable that create a <https://docs.ploomber.io/en/latest/user-guide/spec-vs-python.html#python-api-factory-entry-point|factory function>, then import that function into a script or notebook like this:

```from my_module import my_factory_function

dag = my_factory_function()
dag.render()

# if task_name only generates one product
path = str(dag[task_name].product)

# if task_name generates &gt;1 product
path = str(dag[task_name].product[product_key])

# ... then use path to load the csv file```
&gt; does this require an env?
It doesn't

&gt; right now i have my products set up like this in PYthonCallable: File(env. path.cleaned_data / "social.csv") based on what I've read, does this mean that I'm putting the social csv in a folder called cleaned_data?
No. It depends on the value you passed in the env. for example if `env = {"path": {"cleaned_data": "some_directory"}}` , then your social.csv will go into `some_directory/social.csv`  in other words, the output path, the path depends on the value stored in the dictionary.

Yaml gets a little picky sometimes :sweat_smile: just remove the dash before “analysis”. It should be “analysis: analysis1”

Should mention the script is in R, maybe some strange cross-language list mapping issue?

(oh wait, is this just the - throwing things lol)

you mean when a task fails or when a pipeline fails to load?

you could register an `on_failure` hook for a task and add the code to send the email there: <https://docs.ploomber.io/en/latest/cookbook/hooks.html>

we don't have an API to send emails but you could use any of the ones available

if you're looking to send an alert when *any* task fails, check out the dag-level hook

our team just loves Ploomber .. Any update on the pipeline scheduler ?

We're working on a scheduler for Ploomber Cloud. In the meantime, you can use soopervisor and use something else for scheduling 

<https://soopervisor.readthedocs.io/en/latest/index.html|https://soopervisor.readthedocs.io/en/latest/index.html>

<@U03J22AV0LD> There's the cloud monitoring for pipeline status updates, it notifies the email you registered with

interesting. are you looking only for the `[thing].[another]`  format for executing from the CLI? If so, you can override the name in the pipeline.yaml

```tasks:
  - source: scripts/script.py # this can be a function task too
    name: something.task1```
ploomber identifies tasks by name so can now do:

```ploomber build something.task1```
does this solve the issue?

Well, this is helpful already. But what I was thinking about is to also group steps so pipeline.yaml would have more of a hierarchical structure that would be easier to read.

we currently do not support this, bu i'd like to know how do you think this may work. any thoughts?

I mean something like this:
`tasks:`
  `task_group1:`
    `- source (...)`
      `name: t1`
  `task_group2:`
    `- source (...)`
      `name: t2`
      `upstream: task_group1.t1`

but come to think of it, if I use what you proposed the added benefit of having this is not that big

yeah. I think it offers some value, but might be confusing since we'd now have two ways of declaring tasks (nested or regular). We have something related (<https://docs.ploomber.io/en/latest/api/spec.html#meta-import-tasks-from|import_tasks_from>), currently, it only supports importing tasks from a single file but we could do something like:

```meta:
  import_tasks_from: [tasks.clean.yaml, tasks.features.yaml]```
perhaps this can help?

other than that, we want to keep the pipeline.yaml spec simple. at some point, once pipelines grow in size in complexity. it's probably better to use the Python API, but right now that involves a manual translation from yaml to python - at some point we wanna offer a command to translate

ooh so basically using a bunch of pipeline.yaml works out of the box?

Right now it only supports importing from a single file but we could extend it to support multiple 

Is there some way to freeze file version?

it stores the source code of the function and it compares it to the current one. it normalizes whitespace and ignores comments. can you show the spurious diff?

what do you mean by freezing the file version?

in the original file there is no space between "pd"

It runs autopep8, I don't think that's the problem. But let me do some digging. I'll send you some commands that you can run to debug the problem 

I ran autopep8 and then ran black which I use for formatting, and black didn't detect any changes. But ploomber status says that code changed

Oh I see what the problem is. It happened before. I think black and autopep8 change the quotation marks; I remember someone having a problem when using black. Try skipping black and see if that fixes it. We still have to provide a long term solution since black is pretty popular 

but when I run autopep8 it also doesn't change anything

I tried normalize_python from your codediffer and it returns code with these weird new whitespace, this is different from autopep8 normalization

alright, let me take a look at the source code

ok, can you run the task that's always marked as outdated, then execute:

```ploomber interact```
then:

```# replace 'task-name' with the actual name
print(dag['task-name'].status(return_code_diff=True)['Code diff'])```
and show me the output?

`def train(`
      `product,`
      `upstream,`
      `classes_to_use: List[str],`
      `class_maping: Dict,`
      `model: str,`
      `model_parameters: Optional[Dict] = None,`
      `search_type: Optional[str] = None,`
      `parameters_search: Optional[Dict] = None,`
      `cv: int = 5,`
      `test_size: float = 0.2,`
      `perform_data_scaling: bool = True,`
  `):`
      `model_path = str(product["model_path"])`
      `vector_data = pd .read_csv(str(upstream["train.enrich_tif_metadata"]))`
      `vector_data = process_labels(vector_data, classes_to_use, class_maping)`
      `X = vector_data .drop(["labels", "filename"], axis=1).to_numpy()`
      `y = vector_data["labels"].to_numpy()`
      `X_train, X_test, y_train, y_test = model_selection .train_test_split(`
          `X, y, test_size=test_size, random_state=42, stratify=y`
      `)`
      `logging .info("used metadata features")`
      `logging .info(`
          `[`
              `col`
              `for col in vector_data .columns`
              `if "embedding"not in col and col not in ["labels", "filename"]`
          `]`
      `)`
      `logging .info("also using embeddings")`
      `if model_parameters is None:`
          `model_parameters = {}`
      `classifier = MODELS[model](**model_parameters)`
      `if perform_data_scaling:`
          `classifier = pipeline .make_pipeline(`
              `preprocessing .StandardScaler(), classifier)`
  
      `if search_type in PARAMETER_SEARCH_TYPES .keys():`
          `logging .info("performing parameter search")`
          `if parameters_search is None:`
              `parameters_search = {}`
          `classifier = PARAMETER_SEARCH_TYPES[search_type](`
              `classifier, parameters_search, random_state=42, cv=cv`
          `)`
          `logging .info("used metadata features")`
          `logging .info(`
              `[`
                  `col`
                  `for col in vector_data .columns`
                  `if "embedding"not in col and col not in ["labels", "filename"]`
              `]`
          `)`
          `classifier .fit(X_train, y_train)`
          `logging .info(classifier .best_params_)`
          `classifier = classifier .best_estimator_`
      `else:`
          `logging .info("fitting model")`
          `classifier .fit(X_train, y_train)`
      `y_pred = classifier .predict(X_test)`
      `logging .info("classification report")`
      `logging .info(metrics .classification_report(y_test, y_pred))`
  
      `with open(model_path, "wb")as handle:`
          `pickle .dump(classifier, handle)`

interesting. so that should add `-` and `+` to the diff to show what's detecting, but I don't see any, but the whitespace definitely looks weird. let me do some debugging

BTW I tried this with first with ploomber 0.15, I updated it to 0.19.6 and it's the same

yeah, i was expecting that. we haven't changed the code that compares the cache vs the actual in a while. please try this:

```ploomber interact```
then:

```dag[task_name].status()```
and share the table that appears

Is this a variable generated by another task or a parameter that's coming from outside the pipeline? 

a variable used in a cell in a notebook

The key name used to upload to s3

1st step =

Fetch from psql database --&gt; upload to s3 under a key_name

2nd step =

Download from s3 with the key_name
clean data

3rd Step =
Upload to another s3 Bucket

the key name is generated from
```s3_file = f"ploomber/{date_pipeline_run}/{table}/{filename}"

filename = f"{extract_date}_{table_newest_update_date}_{table_oldest_update_date}_{table}.parquet"```

I tried using the %store &lt;variable&gt;

Ok, so if the key name is known before the pipeline starts execution. you can pass it via “params” <https://docs.ploomber.io/en/latest/api/spec.html|https://docs.ploomber.io/en/latest/api/spec.html>

You can pass the same key as a param to as many tasks as you want. But to keep things simple, you may defined it once in an env.yaml file and reference it 

<https://docs.ploomber.io/en/latest/user-guide/parametrized.html|https://docs.ploomber.io/en/latest/user-guide/parametrized.html>

I'm guessing they key name is known before the execution right? If it's generated via the previous task, then you need to add another product to the previous task. For example you can store the key in a JSON file and load it in the next task 

the key name is generated during 1st step

build from the date the pipeline would be run and if the data change at the source

Alright then the way to do is to store it in a JSON or a text file. We actually have an open issue to simplify this 

image.png

I was able to get to this with the %store magic

Ah. Interesting. I'm not familiar with the stores magic, but I'll dig into that. Maybe we can incorporate that into the documentation. Thanks for sharing! 

quick example

<https://stackoverflow.com/a/47707020>

writing to a json and retrieving it works !

Ah, thanks for reporting this! This definitely looks like a bug. Can you open an issue? We introduced a new D3 backend but I forgot to integrate it with the report functionality

<https://github.com/ploomber/ploomber/issues/845#issue-1264692907>

thanks! we'll work on it. in the meantime, if you want to generate the plot, you can use `ploomber plot`, and for the summary table  `ploomber status`

technically , i have a production system which runs a pipeline and send's the report over email.

are we also working on the plot beautification ..

alright, you can downgrade to ploomber 0.18.1, `ploomber report` should work there. any feedback on the report? (things we should include, remove, change)

```ploomber plot```
 saves the pipeline in HTML format

you can change to use pygraphviz with `ploomber plot --backend pygraphviz`

am not able to install pygraphviz using pip

it's probably because you don't have graphviz, which is not pip-installable. check this out <https://docs.ploomber.io/en/latest/user-guide/faq_index.html#conda-simplest>

graphviz is required to make _py_graphviz work :sweat_smile:

Hi, and welcome! Is the code an inference pipeline? Do you want users to submit some input data, run a model and output results?

Yes, the result probably just a numpy array

I'm not entirely sure what you mean by inference pipeline but I was thinking more like a function to execute a probabilistic algorithm based on some input as json and the output will be send back to the user

Unfortunately, Ploomber Cloud does not offer that functionality. However we have a template that you can use Ploomber to wrap your code and integrate it with flask (here’s a sample project <https://github.com/ploomber/projects/tree/master/templates/ml-online|https://github.com/ploomber/projects/tree/master/templates/ml-online>), then you can use a service like Heroku to deploy it

Ah Isee.. anyway thanks for your help. I will ask again if I there's something I need

<@U03JNQNQ62Z> there’s also a streamlit example, similar concept but on the flask

I could deploy standalone flask server but I wanted to try ploomberg

thanks so much for your feedback!

1. You have a good point, I think we should update our guide based on your feedback. If you're reading external files (that are not generated by the pipeline), it makes sense to use the `dd.read_parquet("<gs://file/path/*.parquet>")`  parquet approach. The benefit of using Ploomber's client, is that any product generated by your pipeline gets uploaded to the bucket. Uploading helps in two ways: 1) it allows you to keep track of generated artifacts on each run  2) if you run ploomber distributively (e.g. Kubernetes), then each task will execute in a separate pod. cofiguring a file client will automatically upload/download products as needed so the whole pipeline executes end to end
2. if the pipelines are fully isolated (they don't have source code in common), then you can create one folder per workflow and have on `pipeline.yaml` per folder. but I guess at some point you'll have components that you may wanna share. for example, some utility functions. In such case, I'd recommended you keeping the common code in a directory (say `my_package`) and create a Python package (by adding a `setup.py` file), then you'd be able to import any functions or classes in any of your workflows 
3. It depends on how you're using Ploomber. You can run ploomber locally and in a distributed way (we support kubernetes, aws, airflow and slurm as backends). The benefit is that you don't need to you can run any Python library. We also have a free tier of <https://docs.ploomber.io/en/latest/cloud/index.html|Ploomber Cloud>  that works the same: each task is executed in a container and you can request custom resources for each (e.g. run some script with GPUs). However, if you already have dask code, you can use Ploomber to orchestrate multiple notebooks/scripts/functions where each component runs dask code. in this second use case ploomber is helping you better structure your projects, providing the incremental builds, etc. but dask handles the actual computations
does this help? please keep the feedback coming, it helps us a lot!

Oh, nice thanks for this question Edward. My team need me to implement something close to what you want !

Thanks guys!! I'd just like to probe into Point 3 a little bit more:

I'm planning to deploy Ploomber into my own GKE cluster using Soopervisor. Since the underlying codes are using Dask, I also want to utilise Dask's <https://docs.dask.org/en/stable/deploying.html|distributed cluster>. However, since Ploomber is already sort of like distributing the tasks into the GKE cluster already - by having a pod for each task and distributed across the cluster, is it still possible to redistribute it on the code level again using Dask's distributed cluster (pardon me, I might be speaking gibberish :laughing:)? Because if its not possible, the Dask code will only be running on a single machine (or a node), and we wont be maximising the usage of Dask in this case.

yeah so if you deploy ploomber with GKE via soopervisor and your notebooks have dask code, then dask is doing the computations and ploomber the orchestration of multiple parts. I'm not an expert on Dask but if you're connecting to the cluster inside your ploomber pipeline, they it's fine since you'll be fully utilizing the cluster. Something that we want to add to soopervisor is the ability to run a full pipeline in a single pod, we've seen this use case a few times before when distributing across pods isn't necessary; and that applies to your use case, since dask is doing the computations

i think as you make progress with the deployment, i'll be easier to assist you in solving the challenges that arise

yea sure! thanks for the clarification btw! :raised_hands:

&gt;  what section of the documentation is related to the whole task execution management.
are you looking for a long-form explanation of how the rerun logic works?
&gt; How does ploomber know when to rerun a task other than if the code_source of the task is changed ?
ploomber evaluates all tasks in the "render" step and determines if it has to run a task or not if the task being evaluated has changed (source code) or if any of its dependencies has changed. For example if you have A -&gt; B -&gt; C. and A changes, that will trigger execution of A, B,C. If B changes, it will trigger B and C. There are other cases that trigger a run, this happens when the is a "corrupted state" for example, if you run a pipeline, but then delete any of the products, ploomber will rerun it.

&gt; Could it be possible that in the product of a task when saving a .ipynb we're able to modify the name of the file. The execution I would like to have is that the output filename receive the date from datetime.now().
<https://docs.ploomber.io/en/latest/api/spec.html#default-placeholders|yes, this is built in> it would be something like this:

```tasks:
    - source: scripts/my-script.py
      product:
        # {{now}} will be replaced by the current timestamp
        nb: 'products/{{now}}/report.html'
        data: 'products/{{now}}/data.csv'```
&gt;  I would have to integrate validation step to activate or not specific task inside the .yaml .
&gt; Would those needs be met by using hook step like
I need a few more details here. It sounds like you want to customize the rerun logic and conditionally execute tasks based on some external rules? We have an <https://github.com/ploomber/projects/tree/master/cookbook/python-load|example here>

Let me know if this solves all your questions!

Can you please add the error? Did you try using the interact module?

sure yeah!
```---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
&lt;ipython-input-5-828509ada065&gt; in &lt;cell line: 1&gt;()
----&gt; 1 dag['model-run-adultsVsChildren-child'].debug()

~/opt/anaconda3/envs/wordsense_pipeline/lib/python3.8/site-packages/ploomber/tasks/notebook.py in debug(self, kind)
    382         """
    383         if self.source.language != 'python':
--&gt; 384             raise NotImplementedError(
    385                 'debug is not implemented for "{}" '
    386                 'notebooks, only python is supported'.format(

NotImplementedError: debug is not implemented for "r" notebooks, only python is supported```

ah. yeah we have not implemented debugging for R. My recommendation would be to manually inject cells:

```ploomber nb -i```
then grab that R script and use R debugging tools. Not an expert on R debugging but this seems like a great resource: <https://adv-r.hadley.nz/debugging.html>

thanks for reporting this, I think we should add a section in the docs explaining how to debug R scripts and add the link to it in that error message :slightly_smiling_face:

FYI: I created an issue. feel free to share your feedback! <https://github.com/ploomber/ploomber/issues/852>

dag takes the task name, not the task object. So try: sex_age_load.product 

this makes sense. we use soopervisor internally and we recently made some changes to pass custom args to `docker build`. I agree that there should be some simple ways to customize how we build the image. can you open an issue?

opened an issue here <https://github.com/ploomber/soopervisor/issues/83> :pray:

Has anyone used <https://github.com/microsoft/gather|gather>? Looks interesting. Although it seems like it's an academic project and no longer maintained.

I have a bit of an ignorant question about running a ploomber project on a remote server vs locally. I have a project where the current workflow is to develop and test a large model locally, push via github to a remote server, run the model there, the push/pull back to local to do the analysis and summarization of the results. any thoughts on the best way to do this?

yes. ploomber stores metadata in hidden files. you can fix the issue by copying the metadata as well. Just copy all the `.metadata` files from the server to local and the metadata will work fine

example:

```total 1640
drwxr-xr-x  12 Edu  staff     384 Jun 13 16:59 ./
drwxr-xr-x  23 Edu  staff     736 Jun 13 18:37 ../
-rw-r--r--   1 Edu  staff     368 Jun 13 16:59 .features.parquet.metadata
-rw-r--r--   1 Edu  staff     361 Jun 13 16:59 .get.parquet.metadata
-rw-r--r--   1 Edu  staff     315 Jun 13 16:59 .join.parquet.metadata
-rw-r--r--   1 Edu  staff     819 Jun 13 16:59 .model.pickle.metadata
-rw-r--r--   1 Edu  staff     819 Jun 13 16:59 .nb.html.metadata
-rw-r--r--   1 Edu  staff    2514 Jun 13 16:59 features.parquet
-rw-r--r--   1 Edu  staff    5627 Jun 13 16:59 get.parquet
-rw-r--r--   1 Edu  staff    7093 Jun 13 16:59 join.parquet
-rw-r--r--   1 Edu  staff  161274 Jun 13 16:59 model.pickle
-rw-r--r--   1 Edu  staff  595640 Jun 13 16:59 nb.html```

each product in a pipeline has a corresponding `.metadata` file

you mean importing something defined inside a ipynb file? I think I saw a method for doing it but I can't remember the name of the project

I tried several methods found from the internet. Didn't work.

The dict is added by the Jupyter plug-in automatically. Perhaps the plug-in isn't working. You can manually add the dict with “ploomber nb -i”

By the pass, what is the jupyter plugin you mentioned?

It's an integration that automatically injects the product variable. It should
Install automatically but sometimes it needs extra steps to activate depending on your setup


<https://docs.ploomber.io/en/latest/user-guide/jupyter.html|https://docs.ploomber.io/en/latest/user-guide/jupyter.html>

I have checked the doc. After `jupyter serverextension list`, it outputs:
```    ploomber enabled
    - Validating...
      ploomber 0.19.6 ok```
ploomber jupyter extension should work. But it did not work. Additional `ploomber nb -i`  is required.

do you any ploomber-related logs? in the jupyter console (the windows where you start jupyter)

I checked the ploomber-related logs, including:

ploomber | extension was successfully linked.
[Ploomber] setting content manager to PloomberContentsManager
ploomber | extension was successfully loaded.

However, it does not work. Neither in a new conda env with only ploomber and jupyterlab. :worried:

do you see any other logs when opening a file that is part of the pipeline. Example, try downloading `ploomber examples ml-basic -o example` , then move to the example folder, right click on fit.py then "open as notebook" and see if new logs appear when you click on "open as notebook"

I work on Windows 10. After test `ploomber examples -n templates/ml-basic -o ml-example` in `C:/` and `D:/`  drives respectively, I think I can locate the problem. My own code report the same error.

In dirve D:/,  after "open as notebook", inject cell failed. Errors are:
```[I 2022-06-17 16:58:03.926 ServerApp] [Ploomber] Requested model: 0000CodeOngoing/ploomber/demo100/scripts/get.py. Looking for DAG with root dir: D:\
[E 2022-06-17 16:58:03.928 ServerApp] [Ploomber] An error occured when trying to initialize the pipeline. Cells won't be injected until your pipeline processes correctly. See error details below.
    Traceback (most recent call last):
      File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\jupyter\manager.py", line 129, in load_dag
        path) = loader.lazily_load_entry_point(
      File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\util\loader.py", line 37, in lazily_load_entry_point
        spec, path, _ = _default_spec_load(starting_dir=starting_dir,
      File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\util\loader.py", line 79, in _default_spec_load
        path_to_entry_point = default.entry_point(root_path=root_path)
      File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\util\default.py", line 200, in entry_point
        return entry_point_with_name(root_path=root_path, name=filename)
      File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\util\default.py", line 132, in entry_point_with_name
        return relpath(Path(project_root, filename), Path().resolve())
      File "C:\ProgramData\Anaconda3\envs\ag10\lib\ntpath.py", line 703, in relpath
        raise ValueError("path is on mount %r, start on mount %r" % (
    ValueError: path is on mount 'D:', start on mount 'C:'
[I 2022-06-17 16:58:04.184 ServerApp] Kernel started: b3cb7424-b7ac-4722-aeae-c19f1348cb75
[IPKernelApp] ERROR | No such comm target registered: jupyter.widget.control```
However, in drive `C:/` , with the same code, it works fine.

But I don't know how to fix it :rolling_on_the_floor_laughing:

this looks like a bug. can you open an issue? please include the full error message that you posted here

I think you could ask them to develop their code in a notebook or script and then include those as part of the pipeline. any feedback on why they refused to use it?

Ploomber seems pretty good for me. But we are not in CS related research field, switching from traditional .py files or jupyter notebooks may set a deep learning curve. I would highly recommend it, but I can not push my colleague or partners to change their workflow.

If I were the only person using Ploomber in the team, maybe they need to run all my .py or .ipnb files sequentially according to the pipeline.yaml? Is there any way for them who do not install Ploomber to run all tasks, like in a main.py file?

Actually, Ploomber supports .Py files as well. And you can mix ipynb and py files in the same pipeline. Check out this example: <https://github.com/ploomber/projects/tree/master/templates/ml-basic|https://github.com/ploomber/projects/tree/master/templates/ml-basic>

Thanks. I have run the ml-basic demo. Maybe I did not make it clear.

After I develop the project with Ploomber. Is there any way to run export like a main.py, like:
```# run this script to get all the tasks done

# script 1
os.system("python get.py)

# script 2
os.system("python feature-engineering.py)

# script 3
os.system("python fit.py)

# script 4
os.system("python plot.py)```
So my colleague can just run this main.py to get all the outputs, if he do not want to learn how to use ploomber.

Ah. Currently. There isn't a way to export Ploomber pipelines that way. But you could create a bash script that has the “ploomber build” command and have your colleagues execute it that way. Alternatively, you could create a main.py and execute the pipeline from Python. Here's an example: <https://github.com/ploomber/projects/blob/master/cookbook/python-load/pipeline.py|https://github.com/ploomber/projects/blob/master/cookbook/python-load/pipeline.py>

You can use the “make” function. Define the typical `if __name__ == "__main__"` at the bottom and do:

`dag = make()`
`dag.build()`


Then they can execute the pipeline with

`Python main.py`

Thanks, Eduardo! I will try it. Ploomber seems great for me by far. Thanks for your brilliant work.

By the way, <https://github.com/ploomber/projects/tree/master/templates/ml-basic|templates/ml-basic> report an error after `ploomber build`:
```Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\nbformat\reader.py", line 18, in parse_json
    nb_dict = json.loads(s, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 192 column 411 (char 5665)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\tasks\abc.py", line 554, in _build
    res = self._run()
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\tasks\abc.py", line 663, in _run
    self.run()
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\tasks\notebook.py", line 833, in run
    self._converter.convert()
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\tasks\notebook.py", line 199, in convert
    self._from_ipynb(self._path_to_output, self._exporter,
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\ploomber\tasks\notebook.py", line 254, in _from_ipynb
    nb = nbformat.reads(path.read_text(), as_version=nbformat.NO_CONVERT)
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\nbformat\__init__.py", line 88, in reads
    nb = reader.reads(s, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\nbformat\reader.py", line 72, in reads
    nb_dict = parse_json(s, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\ag10\lib\site-packages\nbformat\reader.py", line 21, in parse_json
    raise NotJSONError(("Notebook does not appear to be JSON: %r" % s)[:77] + "...") from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n  {\n   "cell_type": "m...

ploomber.exceptions.TaskBuildError: Error building task "fit"```
Found that there are related issues at github. Changing `nb: output/nb.html` to `nb: output/nb.ipynb` , it worked fine for me.

Thanks for using Ploomber! And thanks for reporting this issue, we'll get to it 

You can achieve that by loading the pipeline as a Python object. You can use the input function and then pass the user’s input in a dictionary to the env argument of DAGSpec. Check out this example: <https://github.com/ploomber/projects/blob/master/cookbook/python-load/pipeline.py|https://github.com/ploomber/projects/blob/master/cookbook/python-load/pipeline.py>

You'd need to adapt the example so happy to help you if you have questions. 

Here’s how I look at it: pdb is a common way of doing it and you can use the `%debug` in Jupyter. also you can run Jupyter locally and pycharm has integration with it so it shows you all of the local variables. Since a notebook can run cells you can move the problematic lines or functions into a new cell and use prints to understand exactly what’s going on. You can always work in the convenient way for you with ploomber, it has integration with IDEs like Pycharm and VScode so you can enjoy both worlds.

Very helpful. I will try them out. Thanks Ido.

I usually just use
```%load_ext autoreload
%autoreload 2```
In jupyter lab and then just print the variables in code that I'm interested in.

Interesting point, does it happens for you on both platforms?

Opened an issue, feel free to add more context in there <https://github.com/ploomber/ploomber/issues/877|https://github.com/ploomber/ploomber/issues/877>

On my Linux, all dicts are in a line. It works fine on my Windows platform with the same code. Like below:

For ones who encounter the same problem, a workaround before ploomber fix it is:
```import pprint
pprint.pprint(upstream)```

hey <@U03KD8THEG1>, you can fix this by installing black, then ploomber will automatically format the dictionaries: `pip install black` let me know if that fixes it

Wow, by installing black it is solved perfectly. Thanks a lot.:+1:

For one, I don't love this architecture. Data scientists need to rewrite their pipelines when they go to deploy. This feels time consuming especially to fit it into their zflow concept.

Personally I like running notebooks in production because I receive a richer runbook when things go wrong. To me, it's very important that data scientists own as much of the end to end pipeline as possible. It's so much easier to debug a failed pipeline in a notebook than some work ported into step functions.

Notebooks totally scale and are reproducible. Papermill is a great tool if data scientists own their work all the way into production. 

Sometimes you learn things by what wasn't said. The article didn't talk about containers and managing python environments, which I find to be a harder thing to scale across an organization than notebooks in production. Anyway, I feel like this company had bad notebook practices and opted for a very bespoke solution when it probably would have been cheaper to just buy a platform for data scientists. But maybe I'm just being a little pessimistic this morning.

thanks for your feedback! I agree with you; forcing data scientists doesn't feel like an optimal solution in many fronts. Most companies actually solve the "notebook problem" this way: by asking the data scientists to refactor the notebooks into something else, when I talk to them, something I've realized is that they want to enforce this refactoring process because notebooks are usually unusable for production (dead code paths, non-formatted, contain chunks of code that are not relevant for the final output, etc) and by enforcing this refactoring they ensure data scientists clean up the mess. I think we're still far from making "notebooks in production" mainstream but we'll get there :slightly_smiling_face: