1. If the non-standard but common "UTF-8 lead bytes" (0xef 0xbb 0xbf)
begin the file or data stream, TinyXML will read it as UTF-8.
2. If the declaration tag is read, and it has an encoding="UTF-8", then
TinyXML will read it as UTF-8.
3. If the declaration tag is read, and it has no encoding specified, then
TinyXML will read it as UTF-8.
4. If the declaration tag is read, and it has an encoding="something
else", then TinyXML will read it as Legacy Mode. In legacy mode,
TinyXML will work as it did before. It's not clear what that mode does
exactly, but old content should keep working.
5. Until one of the above criteria is met, TinyXML runs in Legacy Mode.
What happens if the encoding is incorrectly set or detected? TinyXML will
try to read and pass through text seen as improperly encoded. You may
get some strange results or mangled characters. You may want to force
TinyXML to the correct mode. You may force TinyXML to Legacy Mode
by using LoadFile( TIXML_ENCODING_LEGACY ) or LoadFile( filename,
TIXML_ENCODING_LEGACY ). You may force it to use legacy mode all
the time by setting TIXML_DEFAULT_ENCODING = TIXML_ENCODING_
LEGACY. Likewise, you may force it to TIXML_ENCODING_UTF8 with
the same technique. For English users, using English XML, UTF-8 is the
same as low-ASCII. You don't need to be aware of UTF-8 or change
your code in any way. You can think of UTF-8 as a "superset" of ASCII.
UTF-8 is not a double byte format - but it is a standard encoding of
Unicode! TinyXML does not use or directly support wchar, TCHAR, or
Microsoft's _UNICODE at this time. It is common to see the term
"Unicode" improperly refer to UTF-16, a wide byte encoding of unicode.
This is a source of confusion. For "high-ascii" languages - everything not
English, pretty much - TinyXML can handle all languages, at the same
time, as long as the XML is encoded in UTF-8. That can be a little tricky,
older programs and operating systems tend to use the "default" or
"traditional" code page. Many apps (and almost all modern ones) can
output UTF-8, but older or stubborn (or just broken) ones still output text
in the default code page. For example, Japanese systems traditionally
use SHIFT-JIS encoding. Text encoded as SHIFT-JIS can not be read by
TinyXML. A good text editor can import SHIFT-JIS and then save as
UTF-8. The Skew.org link does a great job covering the encoding issue.
The test file "utf8test.xml" is an XML containing English, Spanish,
Russian, and Simplified Chinese. (Hopefully they are translated correctly).
The file "utf8test.gif" is a screen capture of the XML file, rendered in IE.
Note that if you don't have the correct fonts (Simplified Chinese or
Russian) on your system, you won't see output that matches the GIF file
even if you can parse it correctly. Also note that (at least on my Windows
machine) console output is in a Western code page, so that Print() or
printf() cannot correctly display the file. This is not a bug in TinyXML - just
an OS issue. No data is lost or destroyed by TinyXML. The console just
doesn't render UTF-8.
Entities
TinyXML recognizes the pre-defined "character entities", meaning special
characters. Namely: @verbatim & & < < > > " " ' ' @endverbatim
These are recognized when the XML document is read, and translated to
there UTF-8 equivalents. For instance, text with the XML of: @verbatim
Far & Away @endverbatim will have the Value() of "Far & Away" when
queried from the TiXmlText object, and will be written back to the XML
stream/file as an ampersand. Older versions of TinyXML "preserved"
character entities, but the newer versions will translate them into
characters. Additionally, any character can be specified by its Unicode
code point: The syntax " " or " " are both to the non-breaking space
characher.
Printing
TinyXML can print output in several different ways that all have strengths
and limitations. - Print( FILE* ). Output to a std-C stream, which includes
all C files as well as stdout. - "Pretty prints", but you don't have control
over printing options. - The output is streamed directly to the FILE object,
so there is no memory overhead in the TinyXML code. - used by Print()
and SaveFile() - operator<<. Output to a c++ stream. - Integrates with
standart C++ iostreams. - Outputs in "network printing" mode without
line breaks. Good for network transmission and moving XML between
C++ objects, but hard for a human to read. - TiXmlPrinter. Output to a
std::string or memory buffer. - API is less concise - Future printing
options will be put here. - Printing may change slightly in future versions
as it is refined and expanded.
Streams
With TIXML_USE_STL on TinyXML supports C++ streams (operator
<<,>>) streams as well as C (FILE*) streams. There are some differences
that you may need to be aware of. C style output: - based on FILE* - the
Print() and SaveFile() methods Generates formatted output, with plenty of
white space, intended to be as human-readable as possible. They are
very fast, and tolerant of ill formed XML documents. For example, an
XML document that contains 2 root elements and 2 declarations, will still
print. C style input: - based on FILE* - the Parse() and LoadFile() methods
A fast, tolerant read. Use whenever you don't need the C++ streams.
C++ style output: - based on std::ostream - operator<< Generates
condensed output, intended for network transmission rather than
readability. Depending on your system's implementation of the ostream
class, these may be somewhat slower. (Or may not.) Not tolerant of ill
formed XML: a document should contain the correct one root element.
Additional root level elements will not be streamed out. C++ style input:
- based on std::istream - operator>> Reads XML from a stream, making
it useful for network transmission. The tricky part is knowing when the
XML document is complete, since there will almost certainly be other
data in the stream. TinyXML will assume the XML data is complete after
it reads the root element. Put another way, documents that are ill-
constructed with more than one root element will not read correctly. Also
note that operator>> is somewhat slower than Parse, due to both
implementation of the STL and limitations of TinyXML.
White space
The world simply does not agree on whether white space should be
kept, or condensed. For example, pretend the '_' is a space, and look at
"Hello____world". HTML, and at least some XML parsers, will interpret
this as "Hello_world". They condense white space. Some XML parsers
do not, and will leave it as "Hello____world". (Remember to keep
pretending the _ is a space.) Others suggest that __Hello___world__
should become Hello___world. It's an issue that hasn't been resolved to
my satisfaction. TinyXML supports the first 2 approaches. Call TiXmlBase
::SetCondenseWhiteSpace( bool ) to set the desired behavior. The
default is to condense white space. If you change the default, you should
call TiXmlBase::SetCondenseWhiteSpace( bool ) before making any calls
to Parse XML data, and I don't recommend changing it after it has been
set.
Handles
Where browsing an XML document in a robust way, it is important to
check for null returns from method calls. An error safe implementation
can generate a lot of code like: @verbatim TiXmlElement* root =
document.FirstChildElement( "Document" ); if ( root ) { TiXmlElement*
element = root->FirstChildElement( "Element" ); if ( element ) {
TiXmlElement* child = element->FirstChildElement( "Child" ); if ( child ) {
TiXmlElement* child2 = child->NextSiblingElement( "Child" ); if ( child2 ) {
// Finally do something useful. @endverbatim Handles have been
introduced to clean this up. Using the TiXmlHandle class, the previous
code reduces to: @verbatim TiXmlHandle docHandle( &document );
TiXmlElement* child2 = docHandle.FirstChild( "Document" ).FirstChild(
"Element" ).Child( "Child", 1 ).ToElement(); if ( child2 ) { // do something
useful @endverbatim Which is much easier to deal with. See
TiXmlHandle for more information.
Row and Column tracking
Being able to track nodes and attributes back to their origin location in
source files can be very important for some applications. Additionally,
knowing where parsing errors occured in the original source can be very
time saving. TinyXML can tracks the row and column origin of all nodes
and attributes in a text file. The TiXmlBase::Row() and
TiXmlBase::Column() methods return the origin of the node in the source
text. The correct tabs can be configured in TiXmlDocument::SetTabSize().
Using and Installing
To Compile and Run xmltest: A Linux Makefile and a Windows Visual
C++ .dsw file is provided. Simply compile and run. It will write the file
demotest.xml to your disk and generate output on the screen. It also
tests walking the DOM by printing out the number of nodes found using
different techniques. The Linux makefile is very generic and runs on many
systems - it is currently tested on mingw and MacOSX. You do not need
to run 'make depend'. The dependecies have been hard coded.