Injecting Information into Atomic Units of Text

Yannis Haralambous, Gábor Bella

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents a new approach to text processing, based on textemes. These are atomic text units generalising the concepts of character and glyph by merging them in a common data structure, together with an arbitrary number of user-defined properties. In the first part, we give a survey of the notions of character and glyph and their relation with Natural Language Processing models, some visual text representation issues and strategies adopted by file formats (SVG, PDF, DVI) and software (Uniscribe, Pango). In the second part we show applications of textemes in various text processing issues: ligatures, variant glyphs and other OpenType-related properties, hyphenation, color and other presentation attributes, Arabic form and morphology, CJK spacing, metadata, etc. Finally we describe how the Omega typesetting system implements texteme processing as an example of a generalised approach to input character stream parsing, internal representation of text, and modular typographic transformations. In the data flow from input to output, whether in memory or through serializations in auxiliary data files, textemes progressively accumulate information that is used by Omega's paragraph builder engine and included in the output DVI file. We show how this additional information increases efficiency of conversions to other file formats such as PDF or SVG. We conclude this paper by presenting interesting potential applications of texteme methods in document engineering.
Original languageEnglish
Title of host publicationProceedings of the 2005 ACM Symposium on Document Engineering
Place of PublicationNew York, NY, USA
PublisherACM
Pages134-142
Number of pages9
ISBN (Print)1-59593-240-2
DOIs
Publication statusPublished - 2005

Publication series

NameDocEng '05
PublisherACM

Fingerprint

Dive into the research topics of 'Injecting Information into Atomic Units of Text'. Together they form a unique fingerprint.

Cite this