Skip to content

Latest commit



61 lines (40 loc) · 4.33 KB

File metadata and controls

61 lines (40 loc) · 4.33 KB


Generating pt-br Wikipedia articles from refference documents from the web.

This is the official repository for the paper "PLSUM: Generating PT-BR Wikipedia by Summarizing Websites", by André Seidel Oliveira¹ and Anna Helena Reali Costa¹, that is going to be presented at ENIAC 2021. Our work is inspired by WikiSum (LIU, Peter J. et al., 2018), a similar work for the English language.

1 - researchers at the Department of Computer Engineering and Digital Systems (PCS) of University of São Paulo (USP)

The challenge: Generate Brazilian Wikipedia leads from multiple website texts!

PLSUM has as input (1) a Title and (2) a set of texts related to the title (both in Portuguese), and returns an original wiki-like summary about the title. PLSUM has two stages: The extractive stage will filter the set of related documents on input, returning a limited amound of sentence, while the abstractive stage generates an abstractive (authorial) summary given the title and extracted sentences. The model was fine-tuned and tested on BrWac2Wiki, a dataset with records associating a title, multiple documents from the web, and Wikipedia leads (the first section of a Wikipedia article).


Bellow a brief description of what you will find on src/ and notebooks/ folders:


The extractive_stage filter prominent sentences from the input documents. It returns a list of N sentences in order of importance, where N is a hyperparameter. On src/extractive_stage/ we implement TF-IDF, Random, and Cheating as described in the paper. On src/extractive_stage/ and src/extractive_stage/ we implement an extractive stage based on sentence embeddings (IN PROGRESS).


We compare two Transformer encoder-decoders, fine-tuned on BrWac2Wiki dataset for Multi-document Abstractive Summarization: PTT5 (CARMO, Diedre et al., 2020) and Longformer (BELTAGY, Iz; PETERS, Matthew E.; COHAN, Arman., 2020).

Our fine-tuned checkpoints for both models are on hugging-face:


Codes for searching for content related to a title on the web. On src/search_tools/ we use googlesearch lib for searching the title on Google. On src/search_tools/ we apply html2text, nltk, and langdetect to scrap and filter texts in Portuguese from the retrieved urls.

Usage of src/

Run summary inferences with:

python -t '[TITLE_1]' ... '[TITLE_N]' -o [OUTPUT_FILE]


python -f [FILE_NAME] -o [OUTPUT_FILE], where [FILE_NAME] is a file with one title per line.

The algorithm will google the list of titles, scrap texts from retrieved urls, and apply the PLSUM summarization framework to each title, printing the predicted summaries and storing them into [OUTPUT_FILE]. Our default extractive stage is TF-IDF and abstractive stage is plsum-base-ptt5.


We compared ROUGE scores on 7 different combinations of extractive and abstractive stages on unseen examples from BrWac2Wiki. TF-IDF + PTT5 with J = 512 (number of input tokens) had the higher ROUGE L score.



This research was supported by Itaú Unibanco S.A., with the scholarship program of Programa de Bolsas Itaú (PBI), and partially financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Finance Code 001, and CNPQ (grant 310085/2020-9), Brazil. Any opinions, findings, and conclusions expressed in this manuscript are those of the authors and do not necessarily reflect the views, official policy or position of the Itaú-Unibanco, CAPES and CNPq.