rama #general

Sam Yu

05/04/2023, 3:19 AM

week 1: Hiroki will setup the stackllama pipeline on huggingface or other platform. Sam will have a script to scrap all the yahoo japan q&a to generate a dataset for training

Sam Yu

05/05/2023, 4:57 AM

Discussion on scrape data format. Please have you input by Monday. Then I'll start the scraping process. https://vigorous-tuna-52b.notion.site/Yahoo-jp-Q-A-Data-format-d9da3e8f915140e4936cf2c00515bcf9

hiroki

05/07/2023, 7:43 AM

Im doing some research on the yahoo japan知恵袋 webscraping, which we thought would be good data source for the StackLLaMa model. Before scraping, I am doing some research regarding the policies on yahoo japan. I want to make sure if web-scraping is allowed before we retrieve since I want to be transparent about what kind of data we used if we were to open source it. Scraping is against on some services like yahoo finance : https://support.yahoo-net.jp/PccFinance/s/article/H000011276 Scraping on 知恵袋 seems to be kind of ambiguous, but there are some people posting about web-scraping, so I think it might be fine, but will research further tonight: https://gist.github.com/jshirius/e8992c0e7620de098a43d77e4bd91859 Yahoo seems to provide some data to NII some datasets from chiebukuro: https://www.nii.ac.jp/dsc/idr/yahoo/chiebkr3/Y_chiebukuro.html If we conclude if it is safe to web-scraping (if it is not against the policy), I think we should just scrape a reasonable amount of data per scrape. Can't really find any info about webscraping and the yahoo server.

Sam P

05/08/2023, 12:50 AM

Hey guys, thanks for this, catching up now

Sam P

05/08/2023, 1:17 AM

So if I understand correctly we're thinking of training a Japanese instruct model based on a llama foundation. I think that could be a good start. Other people have done similar things: see, e.g. https://github.com/masa3141/japanese-alpaca-lora , which made a Japanese instruction-following model by fine-tuning Llama on a GPT-translated version of the alpaca instructions, which are a set of ~50,000 answers by ChatGPT to questions. There are other public instruction/answer sets commonly used, which can have different rights restrictions. One is the Dolly datset, which is a set of ~15,000 instruction/answer pairs. I'm not sure if anyone has made a japanese instruct model with this dataset. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm One lesson from instruction tuning/RLHF tuning is that in principle we don't need a truly massive dataset of millions of instructions. This is good since I'm not sure what compute we have access to. Order 10^4~10^5 high-quality instructions should be sufficient. So I'm not convinced scraping all of chiebukuro is the right way to go. It could be better to just extract high quality answers, or to use a different dataset entirely.

Sam Yu

05/08/2023, 2:52 AM

OKWAVE [https://okwave.co.jp](https://okwave.co.jp) This could be an alternative to Yahoo Answers, as it has a good number of Q&A and only a portion has more than one answer. After reading its policy, I don't see anything against scraping. [OKWAVE's Usage Policy](https://okwave.co.jp/about/policy/#policy04) Also, after reading through NII's Yahoo data policy for Chiebukuro: [「Yahoo!知恵袋」データ使用規約](https://www.nii.ac.jp/dsc/idr/yahoo/chiebkr3/documents/chiebkr3-policy.html)

1. 利用者が、本データを使用して開発した技術、システム等に関連する知的財産権は利用者に帰属するものとする。

The intellectual property rights related to the technology, systems, etc. developed by the user using this data shall belong to the user.

This means that if we develop a model based on Yahoo data, they do not have ownership of the model. However, their restrictive control of the data may require us to consider the best approach for acquiring the data in the future.

Sam Yu

05/08/2023, 2:55 AM

Regarding high quality answers, chiebukuro has this

expert

and

categoryMaster

tag which can help us filter out high quality answers when it comes to it.

Sam Yu

05/13/2023, 5:07 AM

okwave's page format is so "classic", https://okwave.jp/qa/q00000001.html it's so easy to scrap them

👍 1

Sam Yu

05/13/2023, 5:24 AM

https://vigorous-tuna-52b.notion.site/OKWave-Scrap-Note-600a0efe4bdb4a3598539508822e7cf1 <- wrote about the format of the data, please take a look, if there's no object, i'll start making the script

hiroki

05/13/2023, 5:29 AM

logo ideas: https://app.logomaster.ai/share/tK6nRrX2

hiroki

05/13/2023, 9:25 AM

Any favorites? powered by stable diffusion. Making prompts was harder than expected https://playgroundai.com/profile/clhlk3hvk0mr2s60106sz2hub personally, I like this one: https://playgroundai.com/post/clhlsampe0ms8s601m73iz64a

hiroki

05/15/2023, 1:21 PM

Working on converting the json file to csv right now. Two points.

1. Code snippet from sample.json

Taking a look at the

sample.json

file and attempting to create a csv file from it. Do you think the following code snippet from the json file is necessary? (it's the second object)

Copy code

{
  "@context": "http://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {
      "@type": "ListItem",
      "position": "1",
      "name": "パソコン・スマートフォン",
      "item": "https://okwave.jp/c207.html"
    },
    {
      "@type": "ListItem",
      "position": "2",
      "name": "PCパーツ・周辺機器",
      "item": "https://okwave.jp/c689.html"
    },
    {
      "@type": "ListItem",
      "position": "3",
      "name": "その他(PCパーツ・周辺機器)",
      "item": "https://okwave.jp/c248.html"
    }
  ]
}

Importing the json file to python right now, and having this second json object is throwing off my code. If I exclude this snippet, I can import it to python successfully. I think it may be better to leave out the second {} out from the web-scrapping, if it's really not that essential. What do you guys think?

2. json file per question?

Do we have a json file per question or are we thinking of appending the questionnaire-answer pair in one single file?

Sam P

05/15/2023, 1:55 PM

did you see what I do in clean.py?

Sam P

05/15/2023, 1:56 PM

at the end of the end we want files in this format: https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data.json

Sam P

05/15/2023, 1:56 PM

I think clean.py does that already

hiroki

05/15/2023, 2:14 PM

My bad, didn't check your clean.py Checked it right now and seems like it's a list of json objects. Thanks! I'll switch to making a pipeline for the model.

Sam P

05/15/2023, 2:24 PM

Thank you! maybe take a look at what that alpaca-lora repo is doing, I'm thinking we'll follow the same procedure

Sam P

05/15/2023, 2:24 PM

Though alternative approaches are also welcome if you come upon something that looks better

hiroki

05/15/2023, 2:30 PM

I Agree! Maybe first follow the same process, and then maybe explore ways of doing things better. I'll take a closer look at it today Should we make a new repo for the pipeline? scrap-script seems to be an independent repo

Sam P

05/15/2023, 4:13 PM

@Sam Yu any thoughts on project organization? should hiroki start a new repo for the training code?

hiroki

05/28/2023, 12:23 AM

5/27 How to improve the model.

Train with more LoRA matrices.
Longer context length.
Train with 4bit-- not a robust method yet.

How to compare with other models -- think of more systematic way of building

Testing on benchmarks --

https://github.com/declare-lab/flan-eval Website Create a blog to curate some models for comparison.