Multi-host docker cluster using OVS/VxLAN on CentOS 7

Introduction



The community version of RHEL 7; Community Enterprise OS (CentOS) 7 comes with docker 1.8.x. Which does not support multi-host networking as it only support linux bridges. In other words a container on Host A can't talk to a container on Host B. If you have a cluster of 5 nodes then you can't utilize them all.

In this article I'm going to show you how to setup a multi-host docker cluster where containers on different hosts can talk to each other without NATing (Network Address Translation). I'm going to use OpenVSwitch using VxLAN tunnels (or GRE tunnels) both are Linux kernel technologies that are true and tested (they are used in production for years).

The Open Virtual Switch (OVS) is an L2 switch just like physical switch, it has ports where you plug ethernet plugs and it works at MAC address / ARP level to connect them.

Docker 1.9.x shipped in Fedora 23 do support this feature via something called overlay network but even in that case I still prefer OVS method.

I don't like creating tunnels in user space (like CoreOS Flannel) specially in languages with automatic garbage collection. Remember all inter-container-communication is going via this tunnel it would certainly be a bottleneck.

In this article I'll keep SELinux enabled and I'll use "firewalld"

My Setup - your mileage may vary!

I used some VM nodes, each has it's own LAN IP and docker bridge (called docker0 or br0) on each node should has it's own non-overlapping subnet addresses that belong to a bigger subnet in the virtual switch (called sw0).

I used the same 172.x.y.z/16 for the bridge and 172.x.y.z/12 for the bigger switch, one might like to use 10.x.y.z/16 and 10.x.y.z/8.

so we have

  • vm1:
    • eth0: 192.168.122.81
    • sw0: 172.30.0.1/12
    • br0 / docker0: 172.18.0.1/16
  • vm2:
    • eth0: 192.168.122.82
    • sw0: 172.30.0.2/12
    • br0 / docker0: 172.19.0.1/16
  • vm3:
    • eth0: 192.168.122.83
    • sw0: 172.30.0.3/12
    • br0 / docker0: 172.20.0.1/16
  • ... and so on


The address of each host in the virtual switch is 172.30.0.1/12 ,172.30.0.2/12, 172.30.0.3/12 as the number of the node and /12 means that it can route IPs in the range 172.16.0.0 - 172.31.255.255 as they belong to the same network.

I made the docker bridge starts from 172.18.y.z on node 1 then add 1 for each other node. and it's /16 which means it only routes IPs in the range 172.18.0.0-172.18.255.255 (in other words packets with IPs not in that range does not get routed to this bridge but to sw0 switch)

In a similar way if you choose the IP of docker bridge to be 10.1.0.1/16 for node 1 (and 10.2.0.1/16 for node two ..etc.) this will cover IPs in range 10.1.0.0 - 10.1.255.255 to be routed for single host communication and 10.200.0.1/8, 10.200.0.2/8, 10.200.0.3/8 for sw0 on each node which will cover range 10.0.0.0-10.255.255.255 for communications across multiple hosts

Step By Step

Prerequisites


You need to run all those steps on all nodes. Let's install some dependencies, we need to some repos, let's enable "Extra Packages for Enterprise Linux" (EPEL) and "RedHat Community Distribution of Openstack" (RDO)

yum install -y epel-release https://www.rdoproject.org/repos/rdo-release.rpm
yum install -y firewalld docker openvswitch bridge-utils nc
yum update -y
systemctl start openvswitch firewalld
systemctl enable openvswitch firewalld

Let's open VxLAN ports (on production you should restrict this to a trusted network interface or network as below)

firewall-cmd --add-port=4789/udp --add-port=8472/udp
firewall-cmd --permanent --add-port=4789/udp --add-port=8472/udp

You might replace that with GRE ports if you want GRE instead of VxLAN.

Let's trust our internal network so that our containers can talk to each other
firewall-cmd --permanent --zone=trusted --add-source=172.16.0.0/12

Now let's customize default docker bridge "docker0" on each node by editing /etc/sysconfig/docker-network so that it looks like this

# /etc/sysconfig/docker-network
DOCKER_NETWORK_OPTIONS="--bip=172.18.0.1/16 --iptables=false --ip-masq=false --ip-forward=true"

be sure to replace 172.18.0.1/16 with 172.19.0.1/16 on node 2.

Another option is to create a different bridge br0 using 

brctl addbr br0
ip addr add 172.18.0.1/16 dev br0 && ip link set dev br0 up

and to make that permanent instead of the above just create the file ifcfg-br0

cat /etc/sysconfig/network-scripts/ifcfg-br0 
DEVICE=br0
TYPE=Bridge
IPADDR=172.18.0.1
NETMASK=255.255.0.0
ONBOOT=yes
BOOTPROTO=none
DELAY=0


and pass "-b=br0" instead of "--bip" to docker. I prefer the first method of just customize "docker0" via "--bip".


Verifying

Now let's verify that works, let's start docker and run a a busybox container and type "ip a"

systemctl start docker
systemctl enable docker
ip a # verify docker0 exists and the needed IP address
docker run -ti busybox /bin/sh
# ip a 

Now containers like the busybox above can't talk to each other from different hosts. Let's proceed.

Adding Virtual Switch

Let's create a switch (aka. bridge) on each node and connect it with other nodes using a VxLAN tunnel, just type

ovs-vsctl add-br sw0
ip addr add 172.30.0.1/12 dev sw0 && ip link set dev sw0 up
ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.82
ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.83


here 172.30.0.1/12 is sw0 of node 1 and tell it how to reach other nodes like 192.168.122.82

on the second node

ovs-vsctl add-br sw0
ip addr add 172.30.0.2/12 dev sw0 && ip link set dev sw0 up
ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.81
ovs-vsctl add-port sw0 vxlan0 -- set Interface vxlan0 type=vxlan options:remote_ip=192.168.122.83


and so on. You can use GRE instead of VxLAN.

The good thing about ovs-vsct is that they persist after reboot

Now on each node just create a pair of Virtual Ethernet that from one end connects to the docker bridge "docker0" and from the other end the virtual switch "sw0", we shall call them veth_sw0 and veth_d0 for the switch end and the docker end, this is done by typing:

ip link add veth_sw0 type veth peer name veth_d0
ovs-vsctl add-port sw0 veth_sw0
brctl addif br0 veth_d0
ip link set dev veth_sw0 up
ip link set dev veth_d0 up



Verify that by typing "ovs-vsctl show" and "ip a"

# ovs-vsctl show
a3a48453-c88c-4df9-acb9-65496c65645c
    Bridge "sw0"
        Port "veth_sw0"
            Interface "veth_sw0"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {remote_ip="192.168.122.82"}
        Port "sw0"
            Interface "sw0"
                type: internal



Verifying cross host connectivity

now on on host start a container

docker run -ti busybox /bin/sh
# ip a
# nc -l -p 8080 0.0.0.0

and on the other host start a container and try to reach the first one (where OTHER_IP is the one seen above in "ip a" output)

docker run -ti busybox /bin/sh
# ip a
# echo test | nc $OTHER_IP 8080

containers can reach hosts using their sw0 addresses (like 172.30.0.1)

Talking to the outside world

Let's add NATing rule so that containers can talk to outside world, so if a packet coming from a container (172.16.0.0/12) want to talk something to the outside world (let's say google.com) in other words its destination does not belong to 172.16.0.0/12 then it needs NAT ie. it will go through hosts's 192.168.x.y then to the outside world

firewall-cmd --direct --add-rule ipv4 nat POSTROUTING 0 -s 172.16.0.0/12 ! -d 172.16.0.0/12 -j MASQUERADE
firewall-cmd --permanent --direct --add-rule ipv4 nat POSTROUTING 0 -s 172.16.0.0/12 ! -d 172.16.0.0/12 -j MASQUERADE

to test this

docker run -ti busybox /bin/sh
# ping google.com


Comments

  1. The command to add interface `veth_d0` to docker0 bridge should be:
    `$ brctl addif docker0 veth_d0`

    ReplyDelete
  2. Great post. What if the LXC containers have more than one virtual interface with different IP range (and hosts have sw0, sw1, sw2, etc? If the physical hosts have only one interface, then only one GRE tunnel can be built for the sw0 traffic. How can the traffic on sw1, sw2 etc also be shared across containers on different hosts? Thanks.

    ReplyDelete

Post a Comment

Popular posts from this blog

How to defuse XZ Backdoor (or alike) in SSH Daemon

creating CSS3 animations

CSS3 transformation with Rotating gears example