Skip to content

Commit d9e965f

Browse files
authored
Update data_source.md
1 parent 200d5f3 commit d9e965f

File tree

1 file changed

+19
-6
lines changed

1 file changed

+19
-6
lines changed

docs/data_source/data_source.md

+19-6
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,19 @@
11
## Data Source \ Copyright of Raw Counseling Data
22

3+
We collect and filter data from the following raw common crawl datasets based on dialogue structure and psychotherapy labels. The License of Psy-Insight inherits from these raw datasets, including the MIT license, the Apache-2.0 license, and the Chunsong public license.
4+
35
### Raw Datasets
46

57
* **MNBVC dataset**
6-
* MNBVC: Massive Never-ending BT Vast Chinese corpus
8+
* [MNBVC: Massive Never-ending BT Vast Chinese corpus](http://mnbvc.253874.net/)
79

8-
* **Book3** dataset
9-
* The Pile: An 800GB Dataset of Diverse Text for Language Modeling
10+
* **Book3 dataset**
11+
* [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://huggingface.co/datasets/defunct-datasets/the_pile_books3)
1012

1113
* Crawled Data Source
12-
* **psyarxiv**
14+
* **PsyArXiv Website **
1315
* [Emotional First Aid Raw Dataset](https://github.com/chatopera/efaqa-corpus-raw)
16+
* [CBook](https://github.com/FudanNLPLAB/CBook-150K)
1417

1518

1619

@@ -72,10 +75,20 @@ The MNBVC Open Source Chinese Corpus Project provides a platform for researchers
7275
* Crawler scripts [isLinXu/xxarxiv_mnbvc](https://github.com/isLinXu/xxarxiv_mnbvc)
7376
* License: [License and copyright - arXiv info](https://info.arxiv.org/help/license/index.html)
7477

78+
7579
### [Emotional First Aid Raw Dataset](https://github.com/chatopera/efaqa-corpus-raw)
7680

77-
Emotional First Aid Raw Dataset (EFARD) is a common crawl paid dataset for psychological dataset. This dataset is composed of data crawled from numerous open data websites, including Yixinli, Douban, etc. We retrieve about 1300 turns of counseling conversations from it.
81+
Emotional First Aid Raw Dataset (EFARD) is a paid psychological dataset. This dataset is composed of data crawled from numerous open data websites, including Yixinli, Douban, etc. We retrieve about 1300 turns of counseling conversations from it.
7882

79-
* license [cskefu.com/licenses/v1.html](https://www.cskefu.com/licenses/v1.html)
83+
* Website:[EFARD](https://github.com/chatopera/efaqa-corpus-raw)
84+
* License [cskefu.com/licenses/v1.html](https://www.cskefu.com/licenses/v1.html)
8085
* Crawler scripts: cn_pipeline
8186

87+
### [CBook](https://github.com/FudanNLPLAB/CBook-150K)
88+
A common crawled Chinese books dataset from Fudan NLP lab. We extract 10+ sessions from this dataset.
89+
* License: Apache 2.0
90+
* Scripts: http://www.doc-ai.cn/
91+
* Policy: It is allowed to be used for scientific research.
92+
93+
94+

0 commit comments

Comments
 (0)