PLSUM

Generating pt-br Wikipedia articles from refference documents from the web.

This is the official repository for the paper "PLSUM: Generating PT-BR Wikipedia by Summarizing Websites", by André Seidel Oliveira¹ and Anna Helena Reali Costa¹, that is going to be presented at ENIAC 2021. Our work is inspired by WikiSum (LIU, Peter J. et al., 2018), a similar work for the English language.

1 - researchers at the Department of Computer Engineering and Digital Systems (PCS) of University of São Paulo (USP)

The challenge: Generate Brazilian Wikipedia leads from multiple website texts!

PLSUM has as input (1) a Title and (2) a set of texts related to the title (both in Portuguese), and returns an original wiki-like summary about the title. PLSUM has two stages: The extractive stage will filter the set of related documents on input, returning a limited amound of sentence, while the abstractive stage generates an abstractive (authorial) summary given the title and extracted sentences. The model was fine-tuned and tested on BrWac2Wiki, a dataset with records associating a title, multiple documents from the web, and Wikipedia leads (the first section of a Wikipedia article).

Modules

Bellow a brief description of what you will find on src/ and notebooks/ folders:

extractive_stage:

The extractive_stage filter prominent sentences from the input documents. It returns a list of N sentences in order of importance, where N is a hyperparameter. On src/extractive_stage/sparse_models.py we implement TF-IDF, Random, and Cheating as described in the paper. On src/extractive_stage/cluster_embbeding.py and src/extractive_stage/generate_embeddings.py we implement an extractive stage based on sentence embeddings (IN PROGRESS).

abstractive_stage:

We compare two Transformer encoder-decoders, fine-tuned on BrWac2Wiki dataset for Multi-document Abstractive Summarization: PTT5 (CARMO, Diedre et al., 2020) and Longformer (BELTAGY, Iz; PETERS, Matthew E.; COHAN, Arman., 2020).

Our fine-tuned checkpoints for both models are on hugging-face:

PTT5 fine-tune: plsum-base-ptt5
Longformer fine-tune: plsum-base-led (IN PROGRESS)

search_tools:

Codes for searching for content related to a title on the web. On src/search_tools/get_web_urls.py we use googlesearch lib for searching the title on Google. On src/search_tools/get_urls_text.py we apply html2text, nltk, and langdetect to scrap and filter texts in Portuguese from the retrieved urls.

Usage of `src/app.py`:

Run summary inferences with:

python app.py -t '[TITLE_1]' ... '[TITLE_N]' -o [OUTPUT_FILE]

or

python app.py -f [FILE_NAME] -o [OUTPUT_FILE], where [FILE_NAME] is a file with one title per line.

The algorithm will google the list of titles, scrap texts from retrieved urls, and apply the PLSUM summarization framework to each title, printing the predicted summaries and storing them into [OUTPUT_FILE]. Our default extractive stage is TF-IDF and abstractive stage is plsum-base-ptt5.

Results

We compared ROUGE scores on 7 different combinations of extractive and abstractive stages on unseen examples from BrWac2Wiki. TF-IDF + PTT5 with J = 512 (number of input tokens) had the higher ROUGE L score.

Acknowledgements

This research was supported by Itaú Unibanco S.A., with the scholarship program of Programa de Bolsas Itaú (PBI), and partially financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Finance Code 001, and CNPQ (grant 310085/2020-9), Brazil. Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of the Itaú-Unibanco, CAPES and CNPq.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLSUM

Modules

extractive_stage:

abstractive_stage:

search_tools:

Usage of `src/app.py`:

Results

Acknowledgements

About

Releases

Packages

Languages

aseidelo/wiki_generator

Folders and files

Latest commit

History

Repository files navigation

PLSUM

Modules

extractive_stage:

abstractive_stage:

search_tools:

Usage of src/app.py:

Results

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Usage of `src/app.py`:

Packages