Multi-host docker cluster using OVS/VxLAN on CentOS 7
Introduction
The community version of RHEL 7; Community Enterprise OS (CentOS) 7 comes with docker 1.8.x. Which does not support multi-host networking as it only support linux bridges. In other words a container on Host A can't talk to a container on Host B. If you have a cluster of 5 nodes then you can't utilize them all.
In this article I'm going to show you how to setup a multi-host docker cluster where containers on different hosts can talk to each other without NATing (Network Address Translation). I'm going to use OpenVSwitch using VxLAN tunnels (or GRE tunnels) both are Linux kernel technologies that are true and tested (they are used in production for years).
The Open Virtual Switch (OVS) is an L2 switch just like physical switch, it has ports where you plug ethernet plugs and it works at MAC address / ARP level to connect them.
Docker 1.9.x shipped in Fedora 23 do support this feature via something called overlay network but even in that case I still prefer OVS method.
I don't like creating tunnels in user space (like CoreOS Flannel) specially in languages with automatic garbage collection. Remember all inter-container-communication is going via this tunnel it would certainly be a bottleneck.
In this article I'll keep SELinux enabled and I'll use "firewalld"
My Setup - your mileage may vary!
I used some VM nodes, each has it's own LAN IP and docker bridge (called docker0 or br0) on each node should has it's own non-overlapping subnet addresses that belong to a bigger subnet in the virtual switch (called sw0).I used the same 172.x.y.z/16 for the bridge and 172.x.y.z/12 for the bigger switch, one might like to use 10.x.y.z/16 and 10.x.y.z/8.
so we have
- vm1:
- eth0: 192.168.122.81
- sw0: 172.30.0.1/12
- br0 / docker0: 172.18.0.1/16
- vm2:
- eth0: 192.168.122.82
- sw0: 172.30.0.2/12
- br0 / docker0: 172.19.0.1/16
- vm3:
- eth0: 192.168.122.83
- sw0: 172.30.0.3/12
- br0 / docker0: 172.20.0.1/16
- ... and so on
The address of each host in the virtual switch is 172.30.0.1/12 ,172.30.0.2/12, 172.30.0.3/12 as the number of the node and /12 means that it can route IPs in the range 172.16.0.0 - 172.31.255.255 as they belong to the same network.
I made the docker bridge starts from 172.18.y.z on node 1 then add 1 for each other node. and it's /16 which means it only routes IPs in the range 172.18.0.0-172.18.255.255 (in other words packets with IPs not in that range does not get routed to this bridge but to sw0 switch)
In a similar way if you choose the IP of docker bridge to be 10.1.0.1/16 for node 1 (and 10.2.0.1/16 for node two ..etc.) this will cover IPs in range 10.1.0.0 - 10.1.255.255 to be routed for single host communication and 10.200.0.1/8, 10.200.0.2/8, 10.200.0.3/8 for sw0 on each node which will cover range 10.0.0.0-10.255.255.255 for communications across multiple hosts
Step By Step
Prerequisites
You need to run all those steps on all nodes. Let's install some dependencies, we need to some repos, let's enable "Extra Packages for Enterprise Linux" (EPEL) and "RedHat Community Distribution of Openstack" (RDO)
yum install -y epel-release https://www.rdoproject.org/repos/rdo-release.rpm yum install -y firewalld docker openvswitch bridge-utils nc yum update -y systemctl start openvswitch firewalld systemctl enable openvswitch firewalld
Let's open VxLAN ports (on production you should restrict this to a trusted network interface or network as below)
firewall-cmd --add-port=4789/udp --add-port=8472/udp firewall-cmd --permanent --add-port=4789/udp --add-port=8472/udp
You might replace that with GRE ports if you want GRE instead of VxLAN.
firewall-cmd --permanent --zone=trusted --add-source=172.16.0.0/12
Now let's customize default docker bridge "docker0" on each node by editing /etc/sysconfig/docker-network so that it looks like this
# /etc/sysconfig/docker-network DOCKER_NETWORK_OPTIONS="--bip=172.18.0.1/16 --iptables=false --ip-masq=false --ip-forward=true"
be sure to replace 172.18.0.1/16 with 172.19.0.1/16 on node 2.
Another option is to create a different bridge br0 using
brctl addbr br0 ip addr add 172.18.0.1/16 dev br0 && ip link set dev br0 up
and to make that permanent instead of the above just create the file ifcfg-br0
cat /etc/sysconfig/network-scripts/ifcfg-br0 DEVICE=br0 TYPE=Bridge IPADDR=172.18.0.1 NETMASK=255.255.0.0 ONBOOT=yes BOOTPROTO=none DELAY=0
and pass "-b=br0" instead of "--bip" to docker. I prefer the first method of just customize "docker0" via "--bip".
Verifying
Now let's verify that works, let's start docker and run a a busybox container and type "ip a"
systemctl start docker systemctl enable docker ip a # verify docker0 exists and the needed IP address docker run -ti busybox /bin/sh # ip a
Now containers like the busybox above can't talk to each other from different hosts. Let's proceed.
Adding Virtual Switch
Let's create a switch (aka. bridge) on each node and connect it with other nodes using a VxLAN tunnel, just type
ovs-vsctl add-br sw0 ip addr add 172.30.0.1/12 dev sw0 && ip link set dev sw0 up ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.82 ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.83
here 172.30.0.1/12 is sw0 of node 1 and tell it how to reach other nodes like 192.168.122.82
on the second node
ovs-vsctl add-br sw0 ip addr add 172.30.0.2/12 dev sw0 && ip link set dev sw0 up ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.81 ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.83
and so on. You can use GRE instead of VxLAN.
The good thing about ovs-vsct is that they persist after reboot
Now on each node just create a pair of Virtual Ethernet that from one end connects to the docker bridge "docker0" and from the other end the virtual switch "sw0", we shall call them veth_sw0 and veth_d0 for the switch end and the docker end, this is done by typing:
ip link add veth_sw0 type veth peer name veth_d0 ovs-vsctl add-port sw0 veth_sw0 brctl addif br0 veth_d0 ip link set dev veth_sw0 up ip link set dev veth_d0 up
Verify that by typing "ovs-vsctl show" and "ip a"
# ovs-vsctl show
a3a48453-c88c-4df9-acb9-65496c65645c
Bridge "sw0"
Port "veth_sw0"
Interface "veth_sw0"
Port "vxlan0"
Interface "vxlan0"
type: vxlan
options: {remote_ip="192.168.122.82"}
Port "sw0"
Interface "sw0"
type: internal
Verifying cross host connectivity
now on on host start a container
docker run -ti busybox /bin/sh
# ip a
# nc -l -p 8080 0.0.0.0
and on the other host start a container and try to reach the first one (where OTHER_IP is the one seen above in "ip a" output)
docker run -ti busybox /bin/sh
# ip a
# echo test | nc $OTHER_IP 8080
containers can reach hosts using their sw0 addresses (like 172.30.0.1)
Talking to the outside world
Let's add NATing rule so that containers can talk to outside world, so if a packet coming from a container (172.16.0.0/12) want to talk something to the outside world (let's say google.com) in other words its destination does not belong to 172.16.0.0/12 then it needs NAT ie. it will go through hosts's 192.168.x.y then to the outside world
firewall-cmd --direct --add-rule ipv4 nat POSTROUTING 0 -s 172.16.0.0/12 ! -d 172.16.0.0/12 -j MASQUERADE
firewall-cmd --permanent --direct --add-rule ipv4 nat POSTROUTING 0 -s 172.16.0.0/12 ! -d 172.16.0.0/12 -j MASQUERADE
to test this
docker run -ti busybox /bin/sh
# ping google.com
The command to add interface `veth_d0` to docker0 bridge should be:
ReplyDelete`$ brctl addif docker0 veth_d0`
Great post. What if the LXC containers have more than one virtual interface with different IP range (and hosts have sw0, sw1, sw2, etc? If the physical hosts have only one interface, then only one GRE tunnel can be built for the sw0 traffic. How can the traffic on sw1, sw2 etc also be shared across containers on different hosts? Thanks.
ReplyDelete