|
1 | 1 | ## Data Source \ Copyright of Raw Counseling Data
|
2 | 2 |
|
| 3 | +We collect and filter data from the following raw common crawl datasets based on dialogue structure and psychotherapy labels. The License of Psy-Insight inherits from these raw datasets, including the MIT license, the Apache-2.0 license, and the Chunsong public license. |
| 4 | + |
3 | 5 | ### Raw Datasets
|
4 | 6 |
|
5 | 7 | * **MNBVC dataset**
|
6 |
| - * MNBVC: Massive Never-ending BT Vast Chinese corpus |
| 8 | + * [MNBVC: Massive Never-ending BT Vast Chinese corpus](http://mnbvc.253874.net/) |
7 | 9 |
|
8 |
| -* **Book3** dataset |
9 |
| - * The Pile: An 800GB Dataset of Diverse Text for Language Modeling |
| 10 | +* **Book3 dataset** |
| 11 | + * [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://huggingface.co/datasets/defunct-datasets/the_pile_books3) |
10 | 12 |
|
11 | 13 | * Crawled Data Source
|
12 |
| - * **psyarxiv** |
| 14 | + * **PsyArXiv Website ** |
13 | 15 | * [Emotional First Aid Raw Dataset](https://github.com/chatopera/efaqa-corpus-raw)
|
| 16 | + * [CBook](https://github.com/FudanNLPLAB/CBook-150K) |
14 | 17 |
|
15 | 18 |
|
16 | 19 |
|
@@ -72,10 +75,20 @@ The MNBVC Open Source Chinese Corpus Project provides a platform for researchers
|
72 | 75 | * Crawler scripts [isLinXu/xxarxiv_mnbvc](https://github.com/isLinXu/xxarxiv_mnbvc)
|
73 | 76 | * License: [License and copyright - arXiv info](https://info.arxiv.org/help/license/index.html)
|
74 | 77 |
|
| 78 | + |
75 | 79 | ### [Emotional First Aid Raw Dataset](https://github.com/chatopera/efaqa-corpus-raw)
|
76 | 80 |
|
77 |
| -Emotional First Aid Raw Dataset (EFARD) is a common crawl paid dataset for psychological dataset. This dataset is composed of data crawled from numerous open data websites, including Yixinli, Douban, etc. We retrieve about 1300 turns of counseling conversations from it. |
| 81 | +Emotional First Aid Raw Dataset (EFARD) is a paid psychological dataset. This dataset is composed of data crawled from numerous open data websites, including Yixinli, Douban, etc. We retrieve about 1300 turns of counseling conversations from it. |
78 | 82 |
|
79 |
| -* license [cskefu.com/licenses/v1.html](https://www.cskefu.com/licenses/v1.html) |
| 83 | +* Website:[EFARD](https://github.com/chatopera/efaqa-corpus-raw) |
| 84 | +* License [cskefu.com/licenses/v1.html](https://www.cskefu.com/licenses/v1.html) |
80 | 85 | * Crawler scripts: cn_pipeline
|
81 | 86 |
|
| 87 | +### [CBook](https://github.com/FudanNLPLAB/CBook-150K) |
| 88 | +A common crawled Chinese books dataset from Fudan NLP lab. We extract 10+ sessions from this dataset. |
| 89 | +* License: Apache 2.0 |
| 90 | +* Scripts: http://www.doc-ai.cn/ |
| 91 | +* Policy: It is allowed to be used for scientific research. |
| 92 | + |
| 93 | + |
| 94 | + |
0 commit comments