VTD-XML Home
|
4. A Closer Look
In this section, we analyze some important
aspects of the processing model.
4.1 Memory Usage
The fact that the
processing model makes extensive use of 64-bit integers has strong
implications on its memory usage, which we summarize as follows:
-
Avoid per-object
memory overhead- Per-object allocation typically incurs a small amount
of memory overhead in many modern object-oriented VM-based languages. For
JDK 1.42, our measurement shows an 8-byte overhead associated with every
object allocation. For an array, that overhead goes up to 16 bytes. A VTD
record or a LC entry is immune to Java's per-object overhead because it is
an integer, not an object.
-
Use array whenever
possible- We feel that the biggest memory-saving factor is that both
VTD records and LC entries are constant in length and can be stored in
array-like memory chunks. For example, by allocating a large array for
4096 VTD records, one incurs the per-array overhead of 16 bytes only once
across 4096 records, and the per-record overhead is dramatically reduced
to almost nothing. Furthermore, by taking advantages of spatial locality
of associated records, one avoids the cost of explicitly stitching
together objects using pointers.
Our benchmark results
indicate that the total size of the internal representation is usually 1/3
~1/5 of a DOM tree. For document-oriented XML, the multiplying factor is
usually around 1.3, i.e. the internal representation is 30% larger than
the document (assuming single-byte character encoding). For data oriented
XML, the multiplying factor is between 1.5~2. Comparing with DOM, the
memory reduction is most significant when XML is "taggy" and DOM's memory
consumption is near the high end of its 5x~10x estimate.
4.2 Inherent Persistence
For applications that
demand frequent read-only access to XML documents, the processing model
provides the option of saving the internal representations on disk after
the XML documents are processed for the first time. Afterwards, those
applications only needs to load "pre-parsed" form of XML back in memory.
In other words, the in-memory representation of the processing model is
inherent persistent. This again is due to the fact that both VTD and
LC use 64-bit integers, not objects, to represent tokens and hierarchy.
When the processing performance is critical, one can potentially create
XML and VTD at the same time, and then package them into a single file
before sending it across the network. The downside of this approach is
that it will add to the network transmission cost, which may not be
suitable for bandwidth constrained environment. The upside is that
XML-aware network appliances sitting on the data path can directly use the
binary portion of the data to process XML (without parsing again). In
addition, the "pre-parsed" form contains the XML text, and therefore loses
no inherent benefits of XML, e.g. extensibility and human readability.
4.3 XML on a chip
Recently, "XML on a chip"
has gathered much attention as many companies are racing to deliver
ASIC-based XML processing solutions. A big reason is that the performance
of XML parsing is often cited as a major bottleneck for many
performance-sensitive applications. Matthias et al. [5] identify XML
parsing performance to be a key obstacle to a successful XML deployment,
and recommend hardware-assisted parsing engine to be a potential research
topic. This is hardly surprising, considering the fact that hardware
implementations of processing algorithms free CPU from performing
processor-intensive computations and sometimes provide the only ways to
guarantee QoS. However, one must take into account design constraints that
are specific to XML and custom hardware when thinking of hardware-based
XML parsing. In that context, our processing model possesses a few
attributes critical to the success of implementing "XML on a chip," which
we summarize below:
-
Inherent persistence-
For the custom hardware to co-process XML and still interface with the
application logic, the system most certainly has to transport "something"
from the hardware (e.g. a PCI card) into the main address space in which
the programs reside. To make the hardware work seamlessly with the
application logic, one of the fundamental assumptions of distributed
computing applies: The "something" must be persistent because the PCI bus,
though carrying a different name, is physically not different than a set
of wires like Ethernet, and therefore is subject to the same set of design
constraints. Notice that VTD and LCs satisfy this constraint, as both are
inherently persistent. Implementing DOM on a chip, on the other hand, is
prone to fail because a DOM tree is not persistent.
-
Constant record/entry
length- Optionally, the hardware can produce some serialized form of a
DOM tree or SAX events. The problem of this approach is that, to rebuild
the DOM tree or regenerate SAX events, the software application at the
receiving end will have no choice but to go through the most expensive
part of DOM/SAX parsing, i.e., to allocate and garbage-collect objects,
all over again, essentially defeating the purpose of hardware-based
parsing. The culprit is the variable length of "extractive" tokens. In
contrast, because VTD records and LC entries are constant in length and
addressable using integers, the persistence of the internal representation
is post-binding. In other words, the software application at the receiving
end does not incur the cost of allocating a large number of objects, thus
reaping maximum benefit of hardware-based parsing.
-
Minimum buffering-
In order to obtain maximum performance, a good design should strive to be
as stateless as possible. Our processing model, when implemented on chip,
features a flow-through, process-then-discard architecture that is quite
similar to that of DES (Data Encryption Standard)[11]—an algorithm well
known for its high performance hardware implementation. Similar to a DES
chip, a VTD chip scans through the entire document one page at a time to
record all the relevant information, and no character is accessed more
than once. In addition, to avoid the propagation delays associated with
accessing off-chip memory elements, the design should keep as many
variables/states on chip in fixed-size register files. In that regard, an
offset/length based tokenization plays right into the strength of custom
hardware.
To sum up, the
constant-length token representation is the most important attribute of
the processing model. We can also look at this problem from the
opposite direction. If tokens are not constant in length, then they are
not addressable using integers, and the processing model can no longer
bulk-allocate memory buffers to store tokens. Consequently, one probably
has no choice but to allocate lots of objects to represent tokens and
hierarchy, making per-object memory/processing overhead all but
inevitable. And in so doing, applications won't be able to reap
significant performance benefit from custom hardware.
|