Ploomber #hacktoberfest-team-8

Join Slack

Laura Gutierrez Funderburk

10/30/2023, 7:12 PM

Eva is also really close. I think she is working on dev-eva

Ben Marsh

10/30/2023, 7:21 PM

ok, my work is slowly pushing to a new branch.

Ben Marsh

10/30/2023, 7:33 PM

ok, now my work in on the ben-dev branch. src/app/app_precalc_index.py is what I'm trying to get work. You can see in my notebook the issue I'm having. src/etl/Ben_ETL.ipynb about halfway down in the Document Store section, when I try to run the pipeline.

Ben Marsh

10/30/2023, 7:34 PM

i initialize the faiss document store, write the documents in, load the index I had previously calculated, but it doesn't seem to recognize the existence of the documents when I run the pipeline.

Laura Gutierrez Funderburk

10/30/2023, 7:34 PM

ok let me take a look

Laura Gutierrez Funderburk

10/30/2023, 9:25 PM

Hello @Ben Marsh took a look

Laura Gutierrez Funderburk

10/30/2023, 9:25 PM

which of the 2 apps is giving issues?

Laura Gutierrez Funderburk

10/30/2023, 9:25 PM

I see

src/app/app_precalc_index.py

Laura Gutierrez Funderburk

10/30/2023, 9:25 PM

and

src/app/app.py

Laura Gutierrez Funderburk

10/30/2023, 9:25 PM

are using 2 different approaches to load the document store

Laura Gutierrez Funderburk

10/30/2023, 9:26 PM

from the pipeline's perspective, it seems the app is currently incomplete I see an extraction script but I don't see the indexing script from the solution I shared, the team had created one script for each of the steps: • extract • index • q&a

Laura Gutierrez Funderburk

10/30/2023, 9:28 PM

also, index_path and config_path are not specified

Copy code

if __name__ == "__main__":
    # Load environment variables (if any)
    openai_key = os.environ['OPENAI_HACKTOBERFEST_KEY']

    # Initialize documents
    # documents = initialize_documents('../../data/recipe_docs.csv')

    # Initialize document store and retriever
    # document_store, retriever = initialize_faiss_document_store(documents=documents)
    document_store, retriever = preloaded_faiss(index_path=index_path, config_path=)

    # Initialize pipeline
    query_pipeline = initialize_rag_pipeline(retriever=retriever, openai_key=openai_key)

Ben Marsh

10/30/2023, 9:29 PM

so app.py i got working when just loading a sample of 1000 rows of our data and calculating the index. for the app_precalc_index, i was trying to load an index I had saved from running it in google colab (my computer doesn't have a gpu, so calculating the index for the full data set was taking like 30 hrs.)

Laura Gutierrez Funderburk

10/30/2023, 9:30 PM

ok so from what I understand

app.py

is functional with the second approach taking longer due to computational issues

Ben Marsh

10/30/2023, 9:32 PM

yes. in theory app.py would work for the full data set if i comment out the part of the initialize_documents function that takes a sample of the data, but it will take forever on my computer. the second approach was an attempt at a workaround, using colab to calculate the index, then saving that locally and loading it

Ben Marsh

10/30/2023, 9:33 PM

but loading it doesn't seem to work properly

Laura Gutierrez Funderburk

10/30/2023, 9:33 PM

I am a bit confused also by the 2 approaches here is a second solution (some similarities) from Eva https://github.com/btmarsh6/rag-pipeline-chatbot/tree/dev-eva/src/app this solution is a bit more complete in terms of the packaging of the application and the extraction pipeline

Laura Gutierrez Funderburk

10/30/2023, 9:34 PM

It is confusing to know which solution to take

Laura Gutierrez Funderburk

10/30/2023, 9:36 PM

My take from this is given the goal is to complete an MVP, then if your team has found a solution for a smaller subset of the data (100 rows or 1000 rows) and were able to connect the pipeline to a chainlit application, then this smaller application is what is packaged for deployment

Laura Gutierrez Funderburk

10/30/2023, 9:38 PM

From the requirements perspective, this means 1. having finalized extraction scripts 2. having an indexing pipeline 3. having an app.py that can read and setup RAG for the smaller subset of the data 4. having a dockerfile and requirements.txt 5. package these for deployment to ploomber cloud

Laura Gutierrez Funderburk

10/30/2023, 9:39 PM

can get on a call tomorrow or Wednesday to help out merging the two approaches

Ben Marsh

10/30/2023, 9:41 PM

Yes, i think a call would be helpful. Could you let me know what time would be best for you?

Laura Gutierrez Funderburk

10/30/2023, 9:49 PM

Sounds good I can meet on Wednesday at 12 PM Pacific Time or later (until 5 pm) Between now and then, can you follow some of the steps that Eva added by incorporating the dockerfile and requirements.txt for the smaller working app? Other steps that she followed included adding a complete download script. Will add you as a reviewer to her PR your goal is to merge her work with your work for the smaller working app

Laura Gutierrez Funderburk

10/30/2023, 9:49 PM

the larger app with the full subset of the data that is split into an indexing pipeline + q&a pipeline is something we can explore as future work

Eva Draganova

10/30/2023, 10:34 PM

Hi @Ben Marsh and @Laura Gutierrez Funderburk, I just come back from work. Let me know if you want to meet tonight. I may be available tomorrow evenings, after the Halloweens candy time giving..so much candy to give tomorrow

Laura Gutierrez Funderburk

10/30/2023, 10:38 PM

Hi @Eva Draganova and @Ben Marsh Can meet tomorrow evening (anytime after 3 pm PT) Tonight I can meet a bit later (around 9 pm PT)

Ben Marsh

10/30/2023, 10:40 PM

I could do tonight though I know that's late for you Eva. Tomorrow, I can meet, but I have something else starting at 6 pm PT.

Slackbot

10/30/2023, 10:49 PM

This message was deleted.

Laura Gutierrez Funderburk

10/31/2023, 12:54 AM

Ploomber cloud deployment https://docs.cloud.ploomber.io/en/latest/intro.html

Laura Gutierrez Funderburk

11/16/2023, 9:35 PM

archived the channel