Document

Description

Source

EXI-Telecomp.xml

variety of XML documents collected by the EXI Working Group.

Efficient XML Interchange Working Group

EXI-weblog.xml

EXI-Invoice.xml

EXI-Array.xml

EXI-factbook.xml

EXI-GeogCoord.xml

XMark1xml

an auction database with deeply-nested regular elements.

XMark - An XML Benchmark Project

XMark2xml

XMark3xml

DCSD-Small.xml

Data-centric documents contain data which are representing e-commerce catalog and transactional data

XBench - A Family of Benchmarks for XML DBMSs

DCSD-Normal.xml

TCSD-Small.xml

Text-centric documents.

TCSD-Normal.xml

EnWikiNews.xml

Variety of XML documents represents some backups of the Wikipedia sources.

Wikipedia

EnWikiQuote.xml

EnWikiSource.xml

EnWikiVersity.xml

EnWikiTionary.xml

DBLP.xml

database of bibliographic information of computer science journals and conference proceedings.

University of Washington Corpus

USHouse.xml

Legislative documents which provide information about the ongoing work of the U.S. House of Representatives.

SwissProt.xml

a protein sequence database which describes the DNA sequences.

NASA.xml

an astronomical database

Shakespare.xml

a collection of marked-up Shakespeare
plays in a single XML file.

Lineitem.xml

an XML representation of transactional
relational database benchmark (TPC-H)

Mondial.xml

the basic statistical information
on countries of the world.

Baseball.xml

the statistics of all players of each team that participated in the 1998 Major
League.

Treebank.xml

a large collection of parsed English sentences from the Wall Street Journal.

Random-R1.xml

irregular randomly XML documents
with arbitrary depths, large numbers of unique tags and no data values.

Java-Based Random XML Generator

Random-R2.xml

Random-R3.xml

 

* Each archive file contains the Structural and Original copies of the test XML document.

 

Corpus Characteristics