osrm-backend/third_party/protozero/doc/tutorial.md

627 lines
20 KiB
Markdown

# Protozero Tutorial
## Getting to know Protocol Buffers
Protozero is a very low level library. You really have to know some of the
insides of Protocol Buffers to work with it!
So before reading any further in this document, read the following from the
Protocol Buffer documentation:
* [Developer Guide - Overview](https://developers.google.com/protocol-buffers/docs/overview)
* [Language Guide](https://developers.google.com/protocol-buffers/docs/proto)
* [Encoding](https://developers.google.com/protocol-buffers/docs/encoding)
Make sure you understand the basic types of values supported by Protocol
Buffers. Refer to this
[handy table](https://developers.google.com/protocol-buffers/docs/proto#scalar)
and [the cheat sheet](cheatsheet.md) if you are getting lost.
## Prerequisites
You need a C++11-capable compiler for Protozero to work. Copy the files in the
`include/protozero` directory somewhere where your build system can find them.
Keep the `protozero` directory and include the files in the form
```cpp
#include <protozero/FILENAME.hpp>
```
## Parsing protobuf-encoded messages
### Using `pbf_reader`
To use the `pbf_reader` class, add this include to your C++ program:
```cpp
#include <protozero/pbf_reader.hpp>
```
The `pbf_reader` class contains asserts that will detect some programming
errors. We encourage you to compile with asserts enabled in your debug builds.
### An introductory example
Lets say you have a protocol description in a `.proto` file like this:
```cpp
message Example1 {
required uint32 x = 1;
optional string s = 2;
repeated fixed64 r = 17;
}
```
To read messages created according to that description, you will have code that
looks somewhat like this:
```cpp
#include <protozero/pbf_reader.hpp>
// get data from somewhere into the input string
std::string input = get_input_data();
// initialize pbf message with this data
protozero::pbf_reader message{input};
// iterate over fields in the message
while (message.next()) {
// switch depending on the field tag (the field name is not available)
switch (message.tag()) {
case 1:
// get data for tag 1 (in this case an uint32)
auto x = message.get_uint32();
break;
case 2:
// get data for tag 2 (in this case a string)
std::string s = message.get_string();
break;
case 17:
// ignore data for tag 17
message.skip();
break;
default:
// ignore data for unknown tags to allow for future extensions
message.skip();
}
}
```
You always have to call `next()` and then either one of the accessor functions
(like `get_uint32()` or `get_string()`) to get the field value or `skip()` to
ignore this field. Then call `next()` again, and so forth. Never call `next()`
twice in a row or any if the accessor or skip functions twice in a row.
Because the `pbf_reader` class doesn't know the `.proto` file it doesn't know
which field names or tags there are and it doesn't known the types of the
fields. You have to make sure to call the right `get_...()` function for each
tag. Some `assert()s` are done to check you are calling the right functions,
but not all errors can be detected.
Note that it doesn't matter whether a field is defined as `required`,
`optional`, or `repeated`. You always have to be prepared to get zero, one, or
more instances of a field and you always have to be prepared to get other
fields, too, unless you want your program to break if somebody adds a new
field.
### If you only need a single field
If, out of a protocol buffer message, you only need the value of a single
field, you can use the version of the `next()` function with a parameter:
```cpp
// same .proto file and initialization as above
// get all fields with tag 17, skip all others
while (message.next(17)) {
auto r = message.get_fixed64();
std::cout << r << "\n";
}
```
### Handling scalar fields
As you saw in the example, handling scalar field types is reasonably easy. You
just check the `.proto` file for the type of a field and call the corresponding
function called `get_` + _field type_.
For `string` and `bytes` types the internal handling is exactly the same, but
both `get_string()` and `get_bytes()` are provided to make the code
self-documenting. Both theses calls allocate and return a `std::string` which
can add some overhead. You can call the `get_view()` function instead which
returns a `data_view` containing a pointer into the data (access with `data()`)
and the length of the data (access with `size()`).
### Handling repeated packed fields
Fields that are marked as `[packed=true]` in the `.proto` file are handled
somewhat differently. `get_packed_...()` functions returning an iterator range
are used to access the data.
So, for example, if you have a protocol description in a `.proto` file like
this:
```cpp
message Example2 {
repeated sint32 i = 1 [packed=true];
}
```
You can get to the data like this:
```cpp
protozero::pbf_reader message{input.data(), input.size()};
// set current field
message.next(1);
// get an iterator range
auto pi = message.get_packed_sint32();
// iterate to get to all values
for (auto it = pi.begin(); it != pi.end(); ++it) {
std::cout << *it << '\n';
}
```
Or, with a range-based for-loop:
```cpp
for (auto value : pi) {
std::cout << v << '\n';
}
```
So you are getting a pair of normal forward iterators wrapped in an iterator
range object. The iterators can be used with any STL algorithms etc.
Note that the previous only applies to repeated **packed** fields, normal
repeated fields are handled in the usual way for scalar fields.
### Handling embedded messages
Protocol Buffers can embed any message inside another message. To access an
embedded message use the `get_message()` function. So for this description:
```cpp
message Point {
required double x = 1;
required double y = 2;
}
message Example3 {
repeated Point point = 10;
}
```
you can parse with this code:
```cpp
protozero::pbf_reader message{input};
while (message.next(10)) {
protozero::pbf_reader point = message.get_message();
double x, y;
while (point.next()) {
switch (point.tag()) {
case 1:
x = point.get_double();
break;
case 2:
y = point.get_double();
break;
default:
point.skip();
}
}
std::cout << "x=" << x << " y=" << y << "\n";
}
```
### Handling enums
Enums are stored as varints and they can't be differentiated from them. Use
the `get_enum()` function to get the value of the enum, you have to translate
this into the symbolic name yourself. See the `enum` test case for an example.
### Asserts and exceptions in the Protozero library
Protozero uses `assert()` liberally to help you find bugs in your own code when
compiled in debug mode (ie with `NDEBUG` not set). If such an assert "fires",
this is a very strong indication that there is a bug in your code somewhere.
(Protozero will disable those asserts and "convert" them into exception in its
own test code. This is done to make sure the asserts actually work as intended.
Your test code will not need this!)
Exceptions, on the other hand, are thrown by Protozero if some kind of data
corruption was detected while it is trying to parse the data. This could also
be an indicator for a bug in the user code, but because it can happen if the
data was (intentionally or not intentionally) been messed with, it is reported
to the user code using exceptions.
Most of the functions on the writer side can throw a `std::bad_alloc`
exception if there is no space to grow a buffer. Other than that no exceptions
can occur on the writer side.
All exceptions thrown by the reader side derive from `protozero::exception`.
Note that all exceptions can also happen if you are expecting a data field of
a certain type in your code but the field actually has a different type. In
that case the `pbf_reader` class might interpret the bytes in the buffer in
the wrong way and anything can happen.
#### `end_of_buffer_exception`
This will be thrown whenever any of the functions "runs out of input data".
It means you either have an incomplete message in your input or some other
data corruption has taken place.
#### `unknown_pbf_wire_type_exception`
This will be thrown if an unsupported wire type is encountered. Either your
input data is corrupted or it was written with an unsupported version of a
Protocol Buffers implementation.
#### `varint_too_long_exception`
This exception indicates an illegal encoding of a varint. It means your input
data is corrupted in some way.
#### `invalid_tag_exception`
This exception is thrown when a tag has an invalid value. Tags must be
unsigned integers between 1 and 2^29-1. Tags between 19000 and 19999 are not
allowed. See
https://developers.google.com/protocol-buffers/docs/proto#assigning-tags
#### `invalid_length_exception`
This exception is thrown when a length field of a packed repeated field is
invalid. For fixed size types the length must be a multiple of the size of
the type.
### The `pbf_reader` class
The `pbf_reader` class behaves like a value type. Objects are reasonably small
(two pointers and two `uint32_t`, so 24 bytes on a 64bit system) and they can
be copied and moved around trivially.
`pbf_reader` objects can be constructed from a `std::string` or a `const char*`
and a length field (either supplied as separate arguments or as a `std::pair`).
In all cases objects of the `pbf_reader` class store a pointer into the input
data that was given to the constructor. You have to make sure this pointer
stays valid for the duration of the objects lifetime.
## Parsing protobuf-encoded messages using `pbf_message`
One problem in the code above are the "magic numbers" used as tags for the
different fields that you got from the `.proto` file. Instead of spreading
these magic numbers around your code you can define them once in an `enum
class` and then use the `pbf_message` template class instead of the
`pbf_reader` class.
Here is the first example again, this time using this new technique. So you
have the following in a `.proto` file:
```cpp
message Example1 {
required uint32 x = 1;
optional string s = 2;
repeated fixed64 r = 17;
}
```
Add the following declaration in one of your header files:
```cpp
enum class Example1 : protozero::pbf_tag_type {
required_uint32_x = 1,
optional_string_s = 2,
repeated_fixed64_r = 17
};
```
The message name becomes the name of the `enum class` which is always built
on top of the `protozero::pbf_tag_type` type. Each field in the message
becomes one value of the enum. In this case the name is created from the
type (including the modifiers like `required` or `optional`) and the name of
the field. You can use any name you want, but this convention makes it easier
later, to get everything right.
To read messages created according to that description, you will have code that
looks somewhat like this, this time using `pbf_message` instead of
`pbf_reader`:
```cpp
#include <protozero/pbf_message.hpp>
// get data from somewhere into the input string
std::string input = get_input_data();
// initialize pbf message with this data
protozero::pbf_message<Example1> message{input};
// iterate over fields in the message
while (message.next()) {
// switch depending on the field tag (the field name is not available)
switch (message.tag()) {
case Example1::required_uint32_x:
auto x = message.get_uint32();
break;
case Example1::optional_string_s:
std::string s = message.get_string();
break;
case Example1::repeated_fixed64_r:
message.skip();
break;
default:
// ignore data for unknown tags to allow for future extensions
message.skip();
}
}
```
Note the correspondance between the enum value (for instance
`required_uint32_x`) and the name of the getter function (for instance
`get_uint32()`). This makes it easier to get the correct types. Also the
naming makes it easier to keep different message types apart if you have
multiple (or embedded) messages.
See the `test/t/complex` test case for a complete example using this interface.
Using `pbf_message` in favour of `pbf_reader` is recommended for all code.
Note that `pbf_message` derives from `pbf_reader`, so you can always fall
back to the more generic interface if necessary.
One problem you might run into is the following: The enum class lists all
possible values you know about and you'll have lots of `switch` statements
checking those values. Some compilers will know that your `switch` covers
all possible cases and warn you if you have a `default` case that looks
unneccessary to the compiler. But you still want that `default` case to allow
for future extension of those messages (and maybe also to detect corrupted
data). You can switch of this warning with `-Wno-covered-switch-default`).
## Writing protobuf-encoded messages
### Using `pbf_writer`
To use the `pbf_writer` class, add this include to your C++ program:
```cpp
#include <protozero/pbf_writer.hpp>
```
The `pbf_writer` class contains asserts that will detect some programming
errors. We encourage you to compile with asserts enabled in your debug builds.
### An introductory example
Lets say you have a protocol description in a `.proto` file like this:
```cpp
message Example {
required uint32 x = 1;
optional string s = 2;
repeated fixed64 r = 17;
}
```
To write messages created according to that description, you will have code
that looks somewhat like this:
```cpp
#include <protozero/pbf_writer.hpp>
std::string data;
protozero::pbf_writer pbf_example{data};
pbf_example.add_uint32(1, 27); // uint32_t x
pbf_example.add_fixed64(17, 1); // fixed64 r
pbf_example.add_fixed64(17, 2);
pbf_example.add_fixed64(17, 3);
pbf_example.add_string(2, "foobar"); // string s
```
First you need a string which will be used as buffer to assemble the
protobuf-formatted message. The `pbf_writer` object contains a reference to
this string buffer and through it you add data to that buffer piece by piece.
The buffer doesn't have to be empty, the `pbf_writer` will simply append its
data to whatever is there already.
### Handling scalar fields
As you could see in the introductory example handling any kind of scalar field
is easy. The type of field doesn't matter and it doesn't matter whether it is
optional, required or repeated. You always call one of the `add_TYPE()` method
on the pbf writer object.
The first parameter of these methods is always the *tag* of the field (the
field number) from the `.proto` file. The second parameter is the value you
want to set. For the `bytes` and `string` types several versions of the add
method are available taking a `const std::string&` or a `const char*` and a
length.
For `enum` types you have to use the numeric value as the symbolic names from
the `.proto` file are not available.
### Handling repeated packed fields
Repeated packed fields can easily be set from a pair of iterators:
```cpp
std::string data;
protozero::pbf_writer pw{data};
std::vector<int> v = { 1, 4, 9, 16, 25, 36 };
pw.add_packed_int32(1, std::begin(v), std::end(v));
```
If you don't have an iterator you can use the alternative form:
```cpp
std::string data;
protozero::pbf_writer pw{data};
{
protozero::packed_field_int32 field{pw, 1};
field.add_element(1);
field.add_element(10);
field.add_element(100);
}
```
Of course you can add as many elements as you want. If you add no elements
at all, this code will still work, Protozero detects this special case and
pretends you never even initialized this field.
The nested scope is important in this case, because the destructor of the
`field` object will make sure the length stored inside the field is set to
the right value. You must close that scope before adding other fields to the
`pw` pbf writer.
If you know how many elements you will add to the field and your field contains
fixed length elements, you can tell Protozero and it can optimize this case:
```cpp
std::string data;
protozero::pbf_writer pw{data};
{
protozero::packed_field_fixed32 field{pw, 1, 2}; // exactly two elements
field.add_element(42);
field.add_element(13);
}
```
In this case you have to supply exactly as many elements as you promised,
otherwise you will get a broken protobuf message.
This works for `packed_field_fixed32`, `packed_field_sfixed32`,
`packed_field_fixed64`, `packed_field_sfixed64`, `packed_field_float`, and
`packed_field_double`.
You can abandon writing of the packed field if this becomes necessary by
calling `rollback()`:
```cpp
std::string data;
protozero::pbf_writer pw{data};
{
protozero::packed_field_int32 field{pw, 1};
field.add_element(42);
// some error occurs, you don't want to have this field at all
field.rollback();
}
```
The result is the same as if the lines inside the nested brackets had never
been called. Do not try to call `add_element()` after a rollback.
### Handling sub-messages
Nested sub-messages can be handled by first creating the submessage and then
adding to the parent message:
```cpp
std::string buffer_sub;
protozero::pbf_writer pbf_sub{buffer_sub};
// add fields to sub-message
pbf_sub.add_...(...);
// ...
// sub-message is finished here
std::string buffer_parent;
protozero::pbf_writer pbf_parent{buffer_parent};
pbf_parent.add_message(1, buffer_sub);
```
This is easy to do but it has the drawback of needing a separate `std::string`
buffer. If this concerns you (and why would you use protozero and not the
Google protobuf library if it doesn't?) there is another way:
```cpp
std::string data;
protozero::pbf_writer pbf_parent{data};
// optionally add fields to parent here
pbf_parent.add_...(...);
// open a new scope
{
// create new pbf_writer with parent and the tag (field number)
// as parameters
protozero::pbf_writer pbf_sub{pbf_parent, 1};
// add fields to sub here...
pbf_sub.add_...(...);
} // closing the scope will close the sub-message
// optionally add more fields to parent here
pbf_parent.add_...(...);
```
This can be nested arbitrarily deep.
Internally the sub-message writer re-uses the buffer from the parent. It
reserves enough space in the buffer to later write the length of the submessage
into it. It then adds the contents of the submessage to the buffer. When the
`pbf_sub` writer is destructed the length of the submessage is calculated and
written in the reserved space. If less space was needed for the length field
than was available, the rest of the buffer is moved over a few bytes.
You can abandon writing of submessage if this becomes necessary by
calling `rollback()`:
```cpp
std::string data;
protozero::pbf_writer pbf_parent{data};
// open a new scope
{
// create new pbf_writer with parent and the tag (field number)
// as parameters
protozero::pbf_writer pbf_sub{pbf_parent, 1};
// add fields to sub here...
pbf_sub.add_...(...);
// some problem occurs and you want to abandon the submessage:
pbf_sub.rollback();
}
// optionally add more fields to parent here
pbf_parent.add_...(...);
```
The result is the same as if the lines inside the nested brackets had never
been called. Do not try to call any of the `add_*` functions on the submessage
after a rollback.
## Writing protobuf-encoded messages using `pbf_builder`
Just like the `pbf_message` template class wraps the `pbf_reader` class, there
is a `pbf_builder` template class wrapping the `pbf_writer` class. It is
instantiated using the same `enum class` described above and used exactly
like the `pbf_writer` class but using the values of the enum instead of bare
integers.
See the `test/t/complex` test case for a complete example using this interface.