This section records some design and implementation details.
[TOC]
# Architecture {#Architecture}
## SAX and DOM
The basic relationships of SAX and DOM is shown in the following UML diagram.

The core of the relationship is the `Handler` concept. From the SAX side, `Reader` parses a JSON from a stream and publish events to a `Handler`. `Writer` implements the `Handler` concept to handle the same set of events. From the DOM side, `Document` implements the `Handler` concept to build a DOM according to the events. `Value` supports a `Value::Accept(Handler&)` function, which traverses the DOM to publish events.
With this design, SAX is not dependent on DOM. Even `Reader` and `Writer` have no dependencies between them. This provides flexibility to chain event publisher and handlers. Besides, `Value` does not depends on SAX as well. So, in addition to stringify a DOM to JSON, user may also stringify it to a XML writer, or do anything else.
## Utility Classes
Both SAX and DOM APIs depends on 3 additional concepts: `Allocator`, `Encoding` and `Stream`. Their inheritance hierarchy is shown as below.

# Value {#Value}
`Value` (actually a typedef of `GenericValue<UTF8<>>`) is the core of DOM API. This section describes the design of it.
`Value` is a [variant type](http://en.wikipedia.org/wiki/Variant_type). In RapidJSON's context, an instance of `Value` can contain 1 of 6 JSON value types. This is possible by using `union`. Each `Value` contains two members: `union Data data_` and a`unsigned flags_`. The `flags_` indicates the JSON type, and also additional information.
* Zero padding for 32-bit number may be placed after or before the actual type, according to the endianness. This makes possible for interpreting a 32-bit integer as a 64-bit integer, without any conversion.
* An `Int` is always an `Int64`, but the converse is not always true.
## Flags {#Flags}
The 32-bit `flags_` contains both JSON type and other additional information. As shown in the above tables, each JSON type contains redundant `kXXXType` and `kXXXFlag`. This design is for optimizing the operation of testing bit-flags (`IsNumber()`) and obtaining a sequential number for each type (`GetType()`).
String has two optional flags. `kCopyFlag` means that the string owns a copy of the string. `kInlineStrFlag` means using [Short-String Optimization](#ShortString).
Number is a bit more complicated. For normal integer values, it can contains `kIntFlag`, `kUintFlag`, `kInt64Flag` and/or `kUint64Flag`, according to the range of the integer. For numbers with fraction, and integers larger than 64-bit range, they will be stored as `double` with `kDoubleFlag`.
## Short-String Optimization {#ShortString}
[Kosta](https://github.com/Kosta-Github) provided a very neat short-string optimization. The optimization idea is given as follow. Excluding the `flags_`, a `Value` has 12 or 16 bytes (32-bit or 64-bit) for storing actual data. Instead of storing a pointer to a string, it is possible to store short strings in these space internally. For encoding with 1-byte character type (e.g. `char`), it can store maximum 11 or 15 characters string inside the `Value` type.
A special technique is applied. Instead of storing the length of string directly, it stores (MaxChars - length). This make it possible to store 11 characters with trailing `\0`.
This optimization can reduce memory usage for copy-string. It can also improve cache-coherence thus improve runtime performance.
// \param pointer to the memory block. Null pointer is permitted.
static void Free(void *ptr);
};
~~~
Note that `Malloc()` and `Realloc()` are member functions but `Free()` is static member function.
## MemoryPoolAllocator {#MemoryPoolAllocator}
`MemoryPoolAllocator` is the default allocator for DOM. It allocate but do not free memory. This is suitable for building a DOM tree.
Internally, it allocates chunks of memory from the base allocator (by default `CrtAllocator`) and stores the chunks as a singly linked list. When user requests an allocation, it allocates memory from the following order:
1. User supplied buffer if it is available. (See [User Buffer section in DOM](doc/dom.md))
2. If user supplied buffer is full, use the current memory chunk.
3. If the current block is full, allocate a new block of memory.
# Parsing Optimization {#ParsingOptimization}
## Skip Whitespaces with SIMD {#SkipwhitespaceWithSIMD}
When parsing JSON from a stream, the parser need to skip 4 whitespace characters:
To accelerate this process, SIMD was applied to compare 16 characters with 4 white spaces for each iteration. Currently RapidJSON supports SSE2, SSE4.2 and ARM Neon instructions for this. And it is only activated for UTF-8 memory streams, including string stream or *in situ* parsing.
To enable this optimization, need to define `RAPIDJSON_SSE2`, `RAPIDJSON_SSE42` or `RAPIDJSON_NEON` before including `rapidjson.h`. Some compilers can detect the setting, as in `perftest.h`:
Note that, these are compile-time settings. Running the executable on a machine without such instruction set support will make it crash.
### Page boundary issue
In an early version of RapidJSON, [an issue](https://code.google.com/archive/p/rapidjson/issues/104) reported that the `SkipWhitespace_SIMD()` causes crash very rarely (around 1 in 500,000). After investigation, it is suspected that `_mm_loadu_si128()` accessed bytes after `'\0'`, and across a protected page boundary.
In [Intel® 64 and IA-32 Architectures Optimization Reference Manual
> To support algorithms requiring unaligned 128-bit SIMD memory accesses, memory buffer allocation by a caller function should consider adding some pad space so that a callee function can safely use the address pointer safely with unaligned 128-bit SIMD memory operations.
> The minimal padding size should be the width of the SIMD register that might be used in conjunction with unaligned SIMD memory access.
This is not feasible as RapidJSON should not enforce such requirement.
To fix this issue, currently the routine process bytes up to the next aligned address. After tha, use aligned read to perform SIMD processing. Also see [#85](https://github.com/Tencent/rapidjson/issues/85).
During optimization, it is found that some compilers cannot localize some member data access of streams into local variables or registers. Experimental results show that for some stream types, making a copy of the stream and used it in inner-loop can improve performance. For example, the actual (non-SIMD) implementation of `SkipWhitespace()` is implemented as:
Depending on the traits of stream, `StreamLocalCopy` will make (or not make) a copy of the stream object, use it locally and copy the states of stream back to the original stream.
## Parsing to Double {#ParsingDouble}
Parsing string into `double` is difficult. The standard library function `strtod()` can do the job but it is slow. By default, the parsers use normal precision setting. This has has maximum 3 [ULP](http://en.wikipedia.org/wiki/Unit_in_the_last_place) error and implemented in `internal::StrtodNormalPrecision()`.
When using `kParseFullPrecisionFlag`, the parsers calls `internal::StrtodFullPrecision()` instead, and this function actually implemented 3 versions of conversion methods.
The naive algorithm for integer-to-string conversion involves division per each decimal digit. We have implemented various implementations and evaluated them in [itoa-benchmark](https://github.com/miloyip/itoa-benchmark).
Although SSE2 version is the fastest but the difference is minor by comparing to the first running-up `branchlut`. And `branchlut` is pure C++ implementation so we adopt `branchlut` in RapidJSON.
## Double-to-String conversion {#dtoa}
Originally RapidJSON uses `snprintf(..., ..., "%g")` to achieve double-to-string conversion. This is not accurate as the default precision is 6. Later we also find that this is slow and there is an alternative.
) implemented a newer, fast algorithm called Grisu3 (Loitsch, Florian. "Printing floating-point numbers quickly and accurately with integers."ACM Sigplan Notices45.6 (2010): 233-243.).
However, since it is not header-only so that we implemented a header-only version of Grisu2. This algorithm guarantees that the result is always accurate. And in most of cases it produces the shortest (optimal) string representation.
The header-only conversion function has been evaluated in [dtoa-benchmark](https://github.com/miloyip/dtoa-benchmark).
# Parser {#Parser}
## Iterative Parser {#IterativeParser}
The iterative parser is a recursive descent LL(1) parser
implemented in a non-recursive manner.
### Grammar {#IterativeParserGrammar}
The grammar used for this parser is based on strict JSON syntax:
~~~~~~~~~~
S -> array | object
array -> [ values ]
object -> { members }
values -> non-empty-values | ε
non-empty-values -> value addition-values
addition-values -> ε | , non-empty-values
members -> non-empty-members | ε
non-empty-members -> member addition-members
addition-members -> ε | , non-empty-members
member -> STRING : value
value -> STRING | NUMBER | NULL | BOOLEAN | object | array
~~~~~~~~~~
Note that left factoring is applied to non-terminals `values` and `members`
to make the grammar be LL(1).
### Parsing Table {#IterativeParserParsingTable}
Based on the grammar, we can construct the FIRST and FOLLOW set.
| non-empty-values | value addition-values | value addition-values | | | | | value addition-values | value addition-values | value addition-values | value addition-values |