Skip to content

Commit 2ddd18d

Browse files
thomwolfLysandreJiklhoestq
authored
Starting to add some real doc (huggingface#358)
* Starting to add some doc * WIP on the doc * hide _type from __repr__ * WIP on loading * upgrading csv * working on doc * loading dataset should be ok * starting exploring * WIP exploring * finishing exploring and format * Add a `build_doc` job * Deploy doc * Update config.yml * Deploy should be executable * Deploy + Documentation * Only build on master branch * working on processing * fix bug in train_test_split * add a verbose option * finishing processing page * add dataset guide plus features * add create your dataset page and share your dataset * updating the sharing dataset section * fix doc style and quality * clean up * further clean ups * typos * start docs on faiss * various updates and clean-ups * fix doc * update faiss docs Co-authored-by: Lysandre <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>
1 parent 37289d6 commit 2ddd18d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+3769
-329
lines changed

.circleci/config.yml

+29
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,39 @@ jobs:
2626
- run: black --check --line-length 119 --target-version py36 tests src
2727
- run: isort --check-only --recursive tests src
2828
- run: flake8 tests src
29+
build_doc:
30+
working_directory: ~/nlp
31+
docker:
32+
- image: circleci/python:3.6
33+
steps:
34+
- checkout
35+
- run: sudo pip install .[docs]
36+
- run: cd docs && make html SPHINXOPTS="-W"
37+
- store_artifacts:
38+
path: ./docs/_build
39+
deploy_doc:
40+
working_directory: ~/nlp
41+
docker:
42+
- image: circleci/python:3.6
43+
steps:
44+
- add_ssh_keys:
45+
fingerprints:
46+
- "5b:7a:95:18:07:8c:aa:76:4c:60:35:88:ad:60:56:71"
47+
- checkout
48+
- run: sudo pip install .[docs]
49+
- run: ./.circleci/deploy.sh
50+
51+
workflow_filters: &workflow_filters
52+
filters:
53+
branches:
54+
only:
55+
- master
2956

3057
workflows:
3158
version: 2
3259
build_and_test:
3360
jobs:
3461
- check_code_quality
3562
- run_dataset_script_tests
63+
- build_doc
64+
- deploy_doc: *workflow_filters

.circleci/deploy.sh

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
cd docs
2+
3+
function deploy_doc(){
4+
echo "Creating doc at commit $1 and pushing to folder $2"
5+
git checkout $1
6+
if [ ! -z "$2" ]
7+
then
8+
if [ "$2" == "master" ]; then
9+
echo "Pushing master"
10+
make clean && make html && scp -r -oStrictHostKeyChecking=no _build/html/* $doc:$dir/$2/
11+
cp -r _build/html/_static .
12+
elif ssh -oStrictHostKeyChecking=no $doc "[ -d $dir/$2 ]"; then
13+
echo "Directory" $2 "already exists"
14+
scp -r -oStrictHostKeyChecking=no _static/* $doc:$dir/$2/_static/
15+
else
16+
echo "Pushing version" $2
17+
make clean && make html
18+
rm -rf _build/html/_static
19+
cp -r _static _build/html
20+
scp -r -oStrictHostKeyChecking=no _build/html $doc:$dir/$2
21+
fi
22+
else
23+
echo "Pushing stable"
24+
make clean && make html
25+
rm -rf _build/html/_static
26+
cp -r _static _build/html
27+
scp -r -oStrictHostKeyChecking=no _build/html/* $doc:$dir
28+
fi
29+
}
30+
31+
# You can find the commit for each tag on https://github.com/huggingface/nlp/tags
32+
# Deploys the master documentation on huggingface.co/nlp/master
33+
deploy_doc "master" master
34+
35+
# Example of how to deploy a doc on a certain commit (the commit doesn't have to be on the master branch).
36+
# The following commit would live on huggingface.co/nlp/v1.0.0
37+
#deploy_doc "b33a385" v1.0.0
38+
39+
# Replace this by the latest stable commit. It is recommended to pin on a version release rather than master.
40+
deploy_doc "master"

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,6 @@ venv.bak/
5151
# playground
5252
/playground
5353

54+
# Sphinx documentation
55+
docs/_build/
56+
docs/source/_build/

CONTRIBUTING.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@
2424
pip install -e ".[dev]"
2525
```
2626

27-
(If transformers was already installed in the virtual environment, remove
28-
it with `pip uninstall transformers` before reinstalling it in editable
27+
(If nlp was already installed in the virtual environment, remove
28+
it with `pip uninstall nlp` before reinstalling it in editable
2929
mode with the `-e` flag.)
3030

3131
Right now, we need an unreleased version of `isort` to avoid a

datasets/csv/csv.py

+16-8
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# coding=utf-8
22

33
from dataclasses import dataclass
4+
from typing import List
45

56
import pyarrow.csv as pac
67

@@ -11,26 +12,33 @@
1112
class CsvConfig(nlp.BuilderConfig):
1213
"""BuilderConfig for CSV."""
1314

14-
skip_rows: int = 0
15-
header_as_column_names: bool = True
16-
delimiter: str = ","
17-
quote_char: str = '"'
15+
skip_rows: int = None
16+
column_names: List[str] = None
17+
autogenerate_column_names: bool = None
18+
delimiter: str = None
19+
quote_char: str = None
1820
read_options: pac.ReadOptions = None
1921
parse_options: pac.ParseOptions = None
2022
convert_options: pac.ConvertOptions = None
2123

2224
@property
2325
def pa_read_options(self):
2426
read_options = self.read_options or pac.ReadOptions()
25-
read_options.skip_rows = self.skip_rows
26-
read_options.autogenerate_column_names = not self.header_as_column_names
27+
if self.skip_rows is not None:
28+
read_options.skip_rows = self.skip_rows
29+
if self.column_names is not None:
30+
read_options.column_names = self.column_names
31+
if self.autogenerate_column_names is not None:
32+
read_options.autogenerate_column_names = self.autogenerate_column_names
2733
return read_options
2834

2935
@property
3036
def pa_parse_options(self):
3137
parse_options = self.parse_options or pac.ParseOptions()
32-
parse_options.delimiter = self.delimiter
33-
parse_options.quote_char = self.quote_char
38+
if self.delimiter is not None:
39+
parse_options.delimiter = self.delimiter
40+
if self.quote_char is not None:
41+
parse_options.quote_char = self.quote_char
3442
return parse_options
3543

3644
@property

docs/Makefile

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = source
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/README.md

+211
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Generating the documentation
2+
3+
To generate the documentation, you first have to build it. Several packages are necessary to build the doc,
4+
you can install them with the following command, at the root of the code repository:
5+
6+
```bash
7+
pip install -e ".[docs]"
8+
```
9+
10+
---
11+
**NOTE**
12+
13+
You only need to generate the documentation to inspect it locally (if you're planning changes and want to
14+
check how they look like before committing for instance). You don't have to commit the built documentation.
15+
16+
---
17+
18+
## Packages installed
19+
20+
Here's an overview of all the packages installed. If you ran the previous command installing all packages from
21+
`requirements.txt`, you do not need to run the following commands.
22+
23+
Building it requires the package `sphinx` that you can
24+
install using:
25+
26+
```bash
27+
pip install -U sphinx
28+
```
29+
30+
You would also need the custom installed [theme](https://github.com/readthedocs/sphinx_rtd_theme) by
31+
[Read The Docs](https://readthedocs.org/). You can install it using the following command:
32+
33+
```bash
34+
pip install sphinx_rtd_theme
35+
```
36+
37+
The third necessary package is the `recommonmark` package to accept Markdown as well as Restructured text:
38+
39+
```bash
40+
pip install recommonmark
41+
```
42+
43+
## Building the documentation
44+
45+
Once you have setup `sphinx`, you can build the documentation by running the following command in the `/docs` folder:
46+
47+
```bash
48+
make html
49+
```
50+
51+
A folder called ``_build/html`` should have been created. You can now open the file ``_build/html/index.html`` in your
52+
browser.
53+
54+
---
55+
**NOTE**
56+
57+
If you are adding/removing elements from the toc-tree or from any structural item, it is recommended to clean the build
58+
directory before rebuilding. Run the following command to clean and build:
59+
60+
```bash
61+
make clean && make html
62+
```
63+
64+
---
65+
66+
It should build the static app that will be available under `/docs/_build/html`
67+
68+
## Adding a new element to the tree (toc-tree)
69+
70+
Accepted files are reStructuredText (.rst) and Markdown (.md). Create a file with its extension and put it
71+
in the source directory. You can then link it to the toc-tree by putting the filename without the extension.
72+
73+
## Preview the documentation in a pull request
74+
75+
Once you have made your pull request, you can check what the documentation will look like after it's merged by
76+
following these steps:
77+
78+
- Look at the checks at the bottom of the conversation page of your PR (you may need to click on "show all checks" to
79+
expand them).
80+
- Click on "details" next to the `ci/circleci: build_doc` check.
81+
- In the new window, click on the "Artifacts" tab.
82+
- Locate the file "docs/_build/html/index.html" (or any specific page you want to check) and click on it to get a
83+
preview.
84+
85+
## Writing Documentation - Specification
86+
87+
The `huggingface/transformers` documentation follows the
88+
[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style. It is
89+
mostly written in ReStructuredText
90+
([Sphinx simple documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html),
91+
[Sourceforge complete documentation](https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html))
92+
93+
### Adding a new section
94+
95+
A section is a page held in the `Notes` toc-tree on the documentation. Adding a new section is done in two steps:
96+
97+
- Add a new file under `./source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
98+
- Link that file in `./source/index.rst` on the correct toc-tree.
99+
100+
### Adding a new model
101+
102+
When adding a new model:
103+
104+
- Create a file `xxx.rst` under `./source/model_doc`.
105+
- Link that file in `./source/index.rst` on the `model_doc` toc-tree.
106+
- Write a short overview of the model:
107+
- Overview with paper & authors
108+
- Paper abstract
109+
- Tips and tricks and how to use it best
110+
- Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and
111+
every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow.
112+
The order is generally:
113+
- Configuration,
114+
- Tokenizer
115+
- PyTorch base model
116+
- PyTorch head models
117+
- TensorFlow base model
118+
- TensorFlow head models
119+
120+
These classes should be added using the RST syntax. Usually as follows:
121+
```
122+
XXXConfig
123+
~~~~~~~~~~~~~~~~~~~~~
124+
125+
.. autoclass:: transformers.XXXConfig
126+
:members:
127+
```
128+
129+
This will include every public method of the configuration. If for some reason you wish for a method not to be
130+
displayed in the documentation, you can do so by specifying which methods should be in the docs:
131+
132+
```
133+
XXXTokenizer
134+
~~~~~~~~~~~~~~~~~~~~~
135+
136+
.. autoclass:: transformers.XXXTokenizer
137+
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
138+
create_token_type_ids_from_sequences, save_vocabulary
139+
140+
```
141+
142+
### Writing source documentation
143+
144+
Values that should be put in `code` should either be surrounded by double backticks: \`\`like so\`\` or be written as
145+
an object using the :obj: syntax: :obj:\`like so\`.
146+
147+
When mentionning a class, it is recommended to use the :class: syntax as the mentioned class will be automatically
148+
linked by Sphinx: :class:\`transformers.XXXClass\`
149+
150+
When mentioning a function, it is recommended to use the :func: syntax as the mentioned method will be automatically
151+
linked by Sphinx: :func:\`transformers.XXXClass.method\`
152+
153+
Links should be done as so (note the double underscore at the end): \`text for the link <./local-link-or-global-link#loc>\`__
154+
155+
#### Defining arguments in a method
156+
157+
Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation.
158+
The argument should be followed by its type, with its shape if it is a tensor, and a line return.
159+
Another indentation is necessary before writing the description of the argument.
160+
161+
Here's an example showcasing everything so far:
162+
163+
```
164+
Args:
165+
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
166+
Indices of input sequence tokens in the vocabulary.
167+
168+
Indices can be obtained using :class:`transformers.AlbertTokenizer`.
169+
See :func:`transformers.PreTrainedTokenizer.encode` and
170+
:func:`transformers.PreTrainedTokenizer.__call__` for details.
171+
172+
`What are input IDs? <../glossary.html#input-ids>`__
173+
```
174+
175+
#### Writing a multi-line code block
176+
177+
Multi-line code blocks can be useful for displaying examples. They are done like so:
178+
179+
```
180+
Example::
181+
182+
# first line of code
183+
# second line
184+
# etc
185+
```
186+
187+
The `Example` string at the beginning can be replaced by anything as long as there are two semicolons following it.
188+
189+
#### Writing a return block
190+
191+
Arguments should be defined with the `Args:` prefix, followed by a line return and an indentation.
192+
The first line should be the type of the return, followed by a line return. No need to indent further for the elements
193+
building the return.
194+
195+
Here's an example for tuple return, comprising several objects:
196+
197+
```
198+
Returns:
199+
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
200+
loss (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
201+
Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
202+
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
203+
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
204+
```
205+
206+
Here's an example for a single value return:
207+
208+
```
209+
Returns:
210+
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
211+
```

0 commit comments

Comments
 (0)