This tutorial provides a basic Java programmer's introduction to working with protocol buffers. By walking through creating a simple example application, it shows you how to
.proto
file.This isn't a comprehensive guide to using protocol buffers in Java. For more detailed reference information, see the Protocol Buffer Language Guide, the Java API Reference, the Java Generated Code Guide, and the Encoding Reference.
Why Use Protocol Buffers?
The example we're going to use is a very simple "address book" application that can read and write people's contact details to and from a file. Each person in the address book has a name, an ID, an email address, and a contact phone number.
How do you serialize and retrieve structured data like this? There are a few ways to solve this problem:
Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto
description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.
Where to Find the Example Code
The example code is included in the source code package, under the "examples" directory. Download it here.
Defining Your Protocol Format
To create your address book application, you'll need to start with a .proto
file. The definitions in a .proto
file are simple: you add a message for each data structure you want to serialize, then specify a name and a type for each field in the message. Here is the .proto
file that defines your messages, addressbook.proto
.
syntax = "proto2";
package tutorial;
option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phones = 4;
}
message AddressBook {
repeated Person people = 1;
}
As you can see, the syntax is similar to C++ or Java. Let's go through each part of the file and see what it does.
The .proto
file starts with a package declaration, which helps to prevent naming conflicts between different projects. In Java, the package name is used as the Java package unless you have explicitly specified a java_package
, as we have here. Even if you do provide a java_package
, you should still define a normal package
as well to avoid name collisions in the Protocol Buffers name space as well as in non-Java languages.
After the package declaration, you can see two options that are Java-specific: java_package
and java_outer_classname
. java_package
specifies in what Java package name your generated classes should live. If you don't specify this explicitly, it simply matches the package name given by the package
declaration, but these names usually aren't appropriate Java package names (since they usually don't start with a domain name). The java_outer_classname
option defines the class name which should contain all of the classes in this file. If you don't give a java_outer_classname
explicitly, it will be generated by converting the file name to camel case. For example, "my_proto.proto" would, by default, use "MyProto" as the outer class name.
Next, you have your message definitions. A message is just an aggregate containing a set of typed fields. Many standard simple data types are available as field types, including bool
, int32
, float
, double
, and string
. You can also add further structure to your messages by using other message types as field types – in the above example the Person
message contains PhoneNumber
messages, while the AddressBook
message contains Person
messages. You can even define message types nested inside other messages – as you can see, the PhoneNumber
type is defined inside Person
. You can also define enum
types if you want one of your fields to have one of a predefined list of values – here you want to specify that a phone number can be one of MOBILE
, HOME
, or WORK
.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional elements. Each element in a repeated field requires re-encoding the tag number, so repeated fields are particularly good candidates for this optimization.
Each field must be annotated with one of the following modifiers:
required
: a value for the field must be provided, otherwise the message will be considered "uninitialized". Trying to build an uninitialized message will throw a RuntimeException
. Parsing an uninitialized message will throw an IOException
. Other than this, a required field behaves exactly like an optional field.optional
: the field may or may not be set. If an optional field value isn't set, a default value is used. For simple types, you can specify your own default value, as we've done for the phone number type
in the example. Otherwise, a system default is used: zero for numeric types, the empty string for strings, false for bools. For embedded messages, the default value is always the "default instance" or "prototype" of the message, which has none of its fields set. Calling the accessor to get the value of an optional (or required) field which has not been explicitly set always returns that field's default value.repeated
: the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer. Think of repeated fields as dynamically sized arrays.You'll find a complete guide to writing .proto
files – including all the possible field types – in the Protocol Buffer Language Guide. Don't go looking for facilities similar to class inheritance, though – protocol buffers don't do that.
Compiling Your Protocol Buffers
Now that you have a .proto
, the next thing you need to do is generate the classes you'll need to read and write AddressBook
(and hence Person
and PhoneNumber
) messages. To do this, you need to run the protocol buffer compiler protoc
on your .proto
:
$SRC_DIR
), and the path to your .proto
. In this case, you...:protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/addressbook.proto
Because you want Java classes, you use the --java_out
option – similar options are provided for other supported languages.
This generates com/example/tutorial/AddressBookProtos.java
in your specified destination directory.
The Protocol Buffer API
Let's look at some of the generated code and see what classes and methods the compiler has created for you. If you look in AddressBookProtos.java
, you can see that it defines a class called AddressBookProtos
, nested within which is a class for each message you specified in addressbook.proto
. Each class has its own Builder
class that you use to create instances of that class. You can find out more about builders in the Builders vs. Messages section below.
Both messages and builders have auto-generated accessor methods for each field of the message; messages have only getters while builders have both getters and setters. Here are some of the accessors for the Person
class (implementations omitted for brevity):
// required string name = 1;
public boolean hasName();
public String getName();
// required int32 id = 2;
public boolean hasId();
public int getId();
// optional string email = 3;
public boolean hasEmail();
public String getEmail();
// repeated .tutorial.Person.PhoneNumber phones = 4;
public List<PhoneNumber> getPhonesList();
public int getPhonesCount();
public PhoneNumber getPhones(int index);
Meanwhile, Person.Builder
has the same getters plus setters:
// required string name = 1;
public boolean hasName();
public java.lang.String getName();
public Builder setName(String value);
public Builder clearName();
// required int32 id = 2;
public boolean hasId();
public int getId();
public Builder setId(int value);
public Builder clearId();
// optional string email = 3;
public boolean hasEmail();
public String getEmail();
public Builder setEmail(String value);
public Builder clearEmail();
// repeated .tutorial.Person.PhoneNumber phones = 4;
public List<PhoneNumber> getPhonesList();
public int getPhonesCount();
public PhoneNumber getPhones(int index);
public Builder setPhones(int index, PhoneNumber value);
public Builder addPhones(PhoneNumber value);
public Builder addAllPhones(Iterable<PhoneNumber> value);
public Builder clearPhones();
As you can see, there are simple JavaBeans-style getters and setters for each field. There are also has
getters for each singular field which return true if that field has been set. Finally, each field has a clear
method that un-sets the field back to its empty state.
Repeated fields have some extra methods – a Count
method (which is just shorthand for the list's size), getters and setters which get or set a specific element of the list by index, an add
method which appends a new element to the list, and an addAll
method which adds an entire container full of elements to the list.
Notice how these accessor methods use camel-case naming, even though the .proto
file uses lowercase-with-underscores. This transformation is done automatically by the protocol buffer compiler so that the generated classes match standard Java style conventions. You should always use lowercase-with-underscores for field names in your .proto
files; this ensures good naming practice in all the generated languages. See the style guide for more on good .proto
style.
For more information on exactly what members the protocol compiler generates for any particular field definition, see the Java generated code reference.
Enums and Nested Classes
The generated code includes a PhoneType
Java 5 enum, nested within Person
:
public static enum PhoneType {
MOBILE(0, 0),
HOME(1, 1),
WORK(2, 2),
;
...
}
The nested type Person.PhoneNumber
is generated, as you'd expect, as a nested class within Person
.
The message classes generated by the protocol buffer compiler are all immutable. Once a message object is constructed, it cannot be modified, just like a Java String
. To construct a message, you must first construct a builder, set any fields you want to set to your chosen values, then call the builder's build()
method.
You may have noticed that each method of the builder which modifies the message returns another builder. The returned object is actually the same builder on which you called the method. It is returned for convenience so that you can string several setters together on a single line of code.
Here's an example of how you would create an instance of Person
:
Person john =
Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe@example.com")
.addPhones(
Person.PhoneNumber.newBuilder()
.setNumber("555-4321")
.setType(Person.PhoneType.HOME))
.build();
Each message and builder class also contains a number of other methods that let you check or manipulate the entire message, including:
isInitialized()
: checks if all the required fields have been set.toString()
: returns a human-readable representation of the message, particularly useful for debugging.mergeFrom(Message other)
: (builder only) merges the contents of other
into this message, overwriting singular scalar fields, merging composite fields, and concatenating repeated fields.clear()
: (builder only) clears all the fields back to the empty state.These methods implement the Message
and Message.Builder
interfaces shared by all Java messages and builders. For more information, see the complete API documentation for Message
.
Finally, each protocol buffer class has methods for writing and reading messages of your chosen type using the protocol buffer binary format. These include:
byte[] toByteArray();
: serializes the message and returns a byte array containing its raw bytes.static Person parseFrom(byte[] data);
: parses a message from the given byte array.void writeTo(OutputStream output);
: serializes the message and writes it to an OutputStream
.static Person parseFrom(InputStream input);
: reads and parses a message from an InputStream
.These are just a couple of the options provided for parsing and serialization. Again, see the Message
API reference for a complete list.