View on GitHub

WikiBio (Wikipedia Biography Dataset)

This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).

Download this project as a .zip file Download this project as a tar.gz file

Citation Credit

Neural Text Generation from Structured Data with Application to the Biography Domain
RĂ©mi Lebret, David Grangier and Michael Auli, EMNLP 2016
http://arxiv.org/abs/1603.07771

This publication provides further information about the data, and we kindly ask you to cite this paper when using the data. The data was extracted from the English wikipedia dump (enwiki-20150901) relying on the articles referred by WikiProject Biography.

@inproceedings{Lebret_EMNLP2016,
  author    = {Lebret, R. and Grangier, D. and Auli, M.},
  title     = {{Neural Text Generation from Structured Data with Application to the Biography Domain }},
  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2016}
}

Dataset Description

For each article, we extracted the first paragraph (text) and the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%). We strongly recommend using test only for the final evaluation.

The data is organised in three subdirectories for train, valid and test. Each directory contains 7 files:

Hence all the file allows to access the information for one article relying on line numbers. It is necessary to use SET.nb to split the sentences (SET.sent) per article. The format for encoding the infobox data SET.box follows the following scheme: each line encode one box, each box is encoded as a list of tab separated tokens, each token has the following form fieldname_position:wordtype. We also indicates when a field is empty or contains no readable tokens with fieldname:. For instance the first box of the valid set starts with

type_1:pope name_1:michael  name_2:iii      name_3:of
name_4:alexandria title_1:56th    title_2:pope    title_3:of      title_4:alexandria
title_5:&       title_6:patriarch       title_7:of      title_8:the
title_9:see       title_10:of     title_11:st.    title_12:mark   image:

which indicates that the field "type" contains 1 token "pope", the field "name" contains 4 tokens "michael iii of alexandria", the field "title" contains 12 tokens "56th pope of alexandria & patriarch of the see of st. mark", the field "image" is empty.

Dataset Statistics

Mean Q-5% Q-95%
# tokens per sentence 26.1 13 46
# tokens per table 53.1 20 108
# table tokens per sentence 9.5 3 19
# fields per table 19.7 9 36
On average, the first sentence is twice as short as the table (26.1 vs 53.1 tokens), about a third of the sentence tokens (9.5) also occur in the table.

Published Results

Publication Model Perplexity BLEU ROUGE NIST
Lebret et al. (2016) Template Kneser-Ney 7.46 19.8 10.7 5.19
Lebret et al. (2016) Table Neural Language Model 4.40 34.7 25.8 7.98
For neural models we report the mean for five training runs with different initialization.
Decoding beam width is 5.

Version Information

v1.0 (this version) Initial Release.

License

License information is provided in License.txt

Decompressing zip files

We splitted the archive in multiple files. To extract, run

cat wikipedia-biography-dataset.z?? > tmp.zip
unzip tmp.zip
rm tmp.zip