Abstract / Description of output
We established a working Extensible Markup Language (XML) compression benchmark based on text compression, and found that bzip2 compresses XML best, albeit more slowly than gzip. Our experiments verified that TXMILL speeds up and improves compression using gzip and bounded-context PPM by up to 15%, but found that it worsens the compression for bzip2 and PPM. We describe alternative approaches to XML compression that illustrate other tradeoffs between speed and effectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on prediction by partial match (PPM) called multiplexed hierarchical modeling (MHM) that compresses up to 35 % better than any existing method but is fairly slow
Original language | English |
---|---|
Title of host publication | Data Compression Conference, 2001. Proceedings. DCC 2001. |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 163-172 |
Number of pages | 10 |
ISBN (Print) | 0-7695-1031-0 |
DOIs | |
Publication status | Published - 2001 |
Keywords / Materials (for Non-textual outputs)
- adaptive codes
- data compression
- document image processing
- hypermedia markup languages
- multiplexing
- prediction theory
- PPM
- XMILL
- XML compression
- XML-conscious encoding
- adaptive encoding
- bounded-context PPM
- bzip2
- encoded SAX
- extensible markup language
- gzip
- multiplexed hierarchical PPM models
- multiplexed hierarchical modeling
- online binary encoding
- online encoding
- prediction by partial match
- text compression
- text compressors
- Computer industry
- Encoding
- Entropy
- HTML
- Markup languages
- SGML
- Software systems
- Testing
- Tree data structures
- XML