title | expires_at | tags | ||
---|---|---|---|---|
Large Deployment Best Practices |
never |
|
- Large Deployment best practices for CF-Networking and Silk Release
- Problem 0: Default overlay IP CIDR block too small when there are 250+ diego cells
- Problem 1: Silk Daemon uses too much CPU
- Problem 2: ARP Cache on diego-cell not large enough
- Problem 3: Too frequent and in-sync polling from the silk-daemon and the vxlan-policy-agent
- Problem 4: Reaching the Upper Limit of Network Policies
- Problem 5: NAT Gateway port exhaustion
Some users have larger deployments than we regularly test with. We have heard of large deployments with 500-1000 diego cells. These deployments have specific considerations that smaller deployments don't need to worry about.
Please submit a PR or create an issue if you have come across other large deployment considerations.
The silk daemon on some diego cells fails because it cannot get a lease.
Increase the size of the silk-controller.network
CIDR in the silk controller
spec.
The silk daemon begins using too much CPU on the cells. This causes the app health checks to fail, which causes the apps to evacuate the cell.
The silk daemon is deployed on every cell. It is in charge of getting the IP leases for every other cell from the silk controller. The silk daemon calls out to the silk controller every 5 seconds (by default) to get updated lease information. Every time it gets new information the silk daemon does some linux system calls to set up the networking. This can take a long time (relatively) and get expensive when there are a lot of cells with new leases. This causes the silk daemons to use a lot of CPU.
Change the property lease_poll_interval_seconds
on the silk-daemon job to be
greater than 5 seconds. This will cause the silk-daemon to poll the
silk-controller less frequently and thus make linux system calls less
frequently. However, increasing this property means that when a cell gets a new
lease (this happens when a cell is rolled, recreated, or for whatever reason it
doesn't renew it's lease properly) it will take longer for the other cells to
know how to route container-to-container traffic to it. To start with, we
suggest setting this property to 300 seconds (5 minutes). Then you can tweak
accordingly.
Silk daemon fails to converge leases. Errors in the silk-daemon logs might look like this:
{
"timestamp": "TIME",
"source": "cfnetworking.silk-daemon",
"message": "cfnetworking.silk-daemon.poll-cycle",
"log_level": 2,
"data": {
"error":"converge leases: del neigh with ip/hwaddr 10.255.21.2 : no such file or directory"
}
}
Also kernel logs might look like this:
neighbour: arp_cache: neighbor table overflow
ARP cache on the diego cell is not large enough to handle the number of entries the silk-daemon is trying to write.
Increase the ARP cache size on the diego cells.
-
Look at the current size of your ARP cache
- ssh onto a diego-cell and become root
- inspect following kernel variables
sysctl net.ipv4.neigh.default.gc_thresh1 sysctl net.ipv4.neigh.default.gc_thresh2 sysctl net.ipv4.neigh.default.gc_thresh3
-
Manually increase ARP cache size on the cell. This is good for fixing the issue in the moment, but isn't a good long term soluation because the values will be reset when the cell is recreated.
- set new, larger values for the kernel variables. These sizes were used successfully for a deployment of ~800 cells.
sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=4096; sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=2048;
- set new, larger values for the kernel variables. These sizes were used successfully for a deployment of ~800 cells.
-
For a more permanent solution, set these variables by adding the os-conf-release sysctl job to the deigo-cell instance group. A conf file will be autogenerated into
/etc/stsctl.d/71-bosh-os-conf-sysctl.conf
.- the manifest changes will look similar to this:
instance_groups: - name: diego-cell jobs: - name: sysctl properties: sysctl: - net.ipv4.neigh.default.gc_thresh3=8192 - net.ipv4.neigh.default.gc_thresh2=4096 - net.ipv4.neigh.default.gc_thresh1=2048 release: os-conf ... releases: - name: "os-conf" version: "20.0.0" url: "https://bosh.io/d/github.com/cloudfoundry/os-conf-release?v=20.0.0" sha1: "a60187f038d45e2886db9df82b72a9ab5fdcc49d"
- the manifest changes will look similar to this:
- All silk-daemons can't connect to the silk-controller
- Silk-controller is overwhelmed with connections
- All vxlan-policy-agents can't connect to the network-policy-server
- Network-policy-server is overwhlemed with connections
The silk-daemon and the vxlan-policy-agent both live on the Diego Cell. The silk-daemon polls the silk-controller every 30 seconds by default. The vxlan-policy-agent polls the network-policy-server every 5 seconds by default. If there is a high "max_in_flight" set for the Diego Cell instance group, then it is possible for many cells (50+) to start at the same time. This means that many silk-daemons and vxlan-policy-agents start polling at nearly the exact same time. This can overwhelm the jobs that they are polling.
- Lower max in flight
- Increase polling interval for the silk-daemon and/or the vxlan-polixy-agent
To our knowledge no one has actually run into this problem, even in the largest of deployments. However our team is often asked about this, so it seems important to cover it.
The quick answer is that you are limited to 65,635 apps used in network policies. This results in at least 32,767 network policies.
Container networking policies are implemented using linux marks. Each source and destination app in a networking policy is assigned a mark at the policy creation time. If the source or destination app already has a mark assigned to it from a different policy, then the app uses that mark and does not get a new one. The overlay network for container networking uses VXLAN. VXLAN limits the marks to 16-bits. With 16 bits there are 2^16 (or 65,536) distinct values for marks. The first mark is saved and not given to apps, so that results in 65,535 marks available for apps.
Let's imagine that there are 65,535 different apps. A user could create 32,767 network policies from appA --> appB, where appA and appB are only ever used in ONE network policy. Each of the 32,767 policies includes two apps (the source and the destination) and each of those apps needs a mark. This would result in 65,634 marks. This would reach the upper limits of network policies.
Let's imagine that there are 5 apps. Let's say a user wants all 5 apps to be able to talk to everyother app. This would result in 25 network policies. However, this would only use up 5 marks (one per app). There are still 65,630 marks available for other apps. This scenario shows how the more "overlapping" the policies are, the more policies you can have.
- Apps timeout while trying to connect to particular endpoints
- Multiple port allocation issues on NAT Gateways
- Multiple apps try to open multiple connections to a single service
Each foundation has a finite number of NAT Gateways each of which can open up to 216 = 65536 ports per destination IP and destination port (explanation). By default the number of outbound connections per app are not limited. This is grounds for the noisy neighbour problem where bad apps exhaust the number of connections that could be opened to a given service thus blocking access to it. The issue could be fixed by applying hard limits on the number of connections that could be opened by each app in order to incapacitate the badly behaving ones.
NAT Gateway ports could be exhausted in another way which is easier to implement. Instead of opening long lived connections a bad app could frequently open short lived ones. Because of the way TCP works, after each connection is closed the client-side ports would be kept in a TIME_WAIT state for a few minutes before they are released (explanation). The way to fix this is by applying rate limits on the number of outbound connections.
Currently the implementation of hard limits is blocked by a netfilter issue.
Rate limiting on the other hand is implemented as part of the silk-cni
job and could be used through optional parameters under the outbound_connections
field:
limit
is an on/off switch for the feature.burst
is the maximum number of outbound connections per destination host allowed to be opened at once per container.rate_per_sec
is the maximum number of outbound connections to be opened per second per destination host per container given that the burst is exhausted.
Additionally iptables
logging of connections denied due to rate limits is available when iptables_logging
is set to true
. Such a log message is expected to have a prefix in the format DENY_ORL_<container-id>
.