Tuesday, March 20, 2012

Traffic Shaping

Note : This is a good article about traffic shaping. Lets go to learn together… :) 
Source : http://blog.edseek.com/~jasonb/articles/traffic_shaping/scenarios.html

1. Introduction

1.1. Intended Audience

I have written this document with the experienced Linux administrator in mind. I make the assumption that the reader is familiar with compiling a kernel, downloading and configuring source packages from project Web sites, the Bash shell, patching sources, and so forth. It is also assumed that the reader has at least passing familiarity with Linux's Netfilter packet filter. Those already familiar with traffic control on other devices or platforms may wish to skip Introduction to QoS and start with Linux Traffic Control.

1.2. Copyright and License

This document, A Practical Guide to Linux Traffic Control, is copyrighted (c) 2004 by Jason Boxman.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1. 2 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is available at http://www.gnu.org/copyleft/fdl.html.
Linux is a registered trademark of Linus Torvalds.

1.3. Disclaimer

No liability for the contents of this document can be accepted. Use the concepts, examples and information at your own risk. There may be errors and inaccuracies, that could be damaging to your system. Proceed with caution, and although this is highly unlikely, the author(s) do not take any responsibility.
All copyrights are held by their respective owners, unless specifically noted otherwise. Use of a term in this document should not be regarded as affecting the validity of any trademark or service mark. Naming of particular products or brands should not be seen as endorsements.

1.4. Feedback

Please feel free to contact me with corrections, omissions, and questions. <jasonb@edseek.com>.

1.5. New Versions of this Document

The newest version of this HOWTO will always first be made available here.

1.6. Credits

I'd like to thank the following people for suggestions and corrections, in no particular order. Thanks to Tomasz Orzechowski, Andy Lievertz, Andreas Klauer, and anyone else I didn't mention. Special thanks to Adria, my girlfriend, for strongly encouraging me to spellcheck anything before I publicly release it.

2. Introduction to QoS

2.1. What is QoS?

What exactly is Quality of Service? In general, on a network, QoS generally refers to measurables like latency and throughput, things that directly affect the user experience. By default network traffic, packets, are generally handled in a best effort fashion. However, if you have ever whitnessed your interactive Internet applications experiencing network delays, it becomes clear that best effort is often not good enough. Some flows need preferential treatment. Fortunately, the possibility exists to handle different flows of packets differently; to recognize that some traffic requires low latency or a rate guarantee for the best user experience. QoS is sometimes referred to as traffic control, or traffic shaping.

2.2. Traffic Control Overview

Traffic classification and shaping take place at the egress of a gateway router. To understand the benefits of traffic classification and preferential treatment, let's take an example. Let us a assume a basic router is setup with the default FIFO scheduler for best effort handling of flows. The pipe is at or near capacity and a queue forms. You can think of the queue as a line at the entrance to a toll booth on the highway. The toll plaza can only accept a limited number of cars simultaneously. Any additional travelers queue up in a lengthy line. Now, imagine an ambulance with a critical patient is enqueued at the back of the line. A packet scheduler can make the determination, based upon some given criteria, that this particular flow, the ambulance, is important enough to warp to the front of the line, allowing it quick passage. (Now, if only such a device could warp my vehicle to the beginning of the line at such toll plazas.) The above scenario is beneficial when an artificial queue is employed at the local egress point you have control over to thrawt an unfriendly upstream queue, such as at a broadband Internet service provider.
Ingress traffic, flows that are already upon you, are a different story. It is generally considered difficult to shape ingress traffic, since you have no control over QoS policy decisions outside your network infrastructure.

3. Linux Traffic Control

Traffic control capabilities have been available in the Linux kernel since the 2.2 series. Since then it has matured greatly. Shaping configurations are implemented using a variety of available packet schedulers and shapers which can be configured using the tc binary included in the iproute2 package of tools. qdisc is the term used to refer to these schedulers under Linux.

3.1. The Qdisc

The Linux traffic shaping implementation allows you to build arbitrarily complicated configurations, based upon the building block of the qdisc. You can choose from two kinds of qdiscs: classless and classful. By default, all Ethernet interfaces get a classless qdisc for free, which is essentially a FIFO. You will likely replace this with something more interesting. Classful qdiscs, on the other hand, can contain classes to arbitrary levels of depth. Leaf classes, those which are not parents of additional classes, hold a default FIFO style qdisc. Classless qdiscs are schedulers. Classful qdiscs are generally shapers, but some schedule as well.

3.2. Hooks

Each network interface effectively has two places where you can attach qdiscs: root and ingress. root is essentially a synonym for egress. The distinction between these two hooks into Linux QoS is essential. The root hook sits on the inside of your Ethernet interface. You can effectively apply traffic classification and scheduling against the root hook as any queue you create is under your control. The ingress hook rests on the outside of your Ethernet interface. You cannot shape this traffic, but merely throttle it, because it has already arrived. What follows is a discussion of utilizing the former, the root hook.

4. Class Structure Revealed

Paramount to creating a useful traffic control configuration is understanding how to manipulate the class hierarchy you can attach to each root hook. You configure all aspects of the actual shaping using the tc binary. (Traffic classification can be done with the tc binary as well, but we will instead look at a more powerful method later.)

4.1. A Classless qdisc

Let us take an example, using the tc binary.
tc qdisc add dev eth2 parent root handle 1:0 pfifo
Now, that is a lot to take in at once. I have added some verbosity you seldom see to further attempt to clarify what is happening here.
First, we specify that we want to work with a qdisc. Next, we indicate we wish to add a new qdisc to the Ethernet device eth2. (You can specify del in place of add to remove the qdisc in question.) Then, we specify the special parent root. We discussed the importance of the root hook earlier. It is the hook on the egress side of your Ethernet interface. The handle is the magic userspace way of naming a particular qdisc, but more on that later. Finally, we specify the qdisc we wish to add. Because pfifo is a classless qdisc, there is nothing more to do.
A graphical representation of the structure we just created is depicted below. The color blue is used to represent a qdisc. Later, you will see green used to represent a class.
pfifo
It is important to discuss the naming convention for qdiscs before proceeding. qdiscs are always referred to using a combination of a major node number and a minor node number. For any given qdisc, the major node number has to be unique for the root hook for a given Ethernet interface. The minor number for any given qdisc will always be zero. By convention, the first qdisc created is named 1:0. You could, however, choose 7:0 instead. These numbers are actually in hexadecimal, so any values in within the range of 1 to ffff are also valid. For readability the digits 0 to 9 are generally used exclusively.

4.2. A Classful qdisc

Now, let us look at an example where we add a classful qdisc and a single class.
tc qdisc add dev eth2 parent root handle 1:0 htb default 1
Initially, it looks almost identical to the previous classless qdisc. The htb qdisc was used here. However, to actually benefit from using this classful qdisc, we need to add some classes. You will notice above the parameter default is specified. It is an option for htb qdiscs described later.
tc class add dev eth2 parent 1:0 classid 1:1 htb rate 1mbit
Now, we use the tc binary with the class argument, instead of the qdisc argument. The class argument allows you to create a class hierarchy. We are working with Ethernet device eth2 again. Theparent parameter enables you to attach this class to either an existing classful qdisc or another class of the same type. Above we attach to the htb qdisc we just attached to the root hook earlier.
The value for the parent parameter must be the major and minor node number of the qdisc or class you wish to attach to. Earlier, the identifier 1:0 was chosen for handle of the htb qdisc. That must be used as the argument to the parent parameter so tc knows where you are assigning this class.
The classid parameter serves the same purpose for classes that the handle parameter serves for qdiscs. It is the identifier for this specific class. The major node number has already been chosen for you. It must be the same as the major node number specified for the parent argument earlier. You are free to choose any minor node number you want for the classid parameter. Traditionally numbering starts at 1, but numbers from 1 to ffff are valid. For this class the classid 1:1 was chosen, because the qdisc it is being attached to has a major node number 1 for its handle parameter.
Finally we specify the type of class and any options that class requires. In this instance, the htb class was chosen, as the htb qdisc can only have htb classes assigned to it. (This is generally true of classful qdiscs.) The rate option for htb is discussed later.
Another graphical representation, this time of an htb qdisc and an associated class in a simple hierarchy, is shown below. The qdisc is blue and the class green.
htb

4.3. A Combined Example

Finally, we can reveal a complete example using both classful and classless qdiscs.
tc qdisc add dev eth2 parent root handle 1:0 htb default 20
tc class add dev eth2 parent 1:0 classid 1:1 htb rate 1000kbit
tc class add dev eth2 parent 1:1 classid 1:10 htb rate 500kbit
tc class add dev eth2 parent 1:1 classid 1:20 htb rate 500kbit
tc qdisc add dev eth2 parent 1:20 handle 2:0 sfq
Now we have a nested structure, with a htb classful qdisc assigned to the root hook, three htb classes, and asfq qdisc as a leaf qdisc for one htb class. The other has an implicit pfifo attached. The careful reader will notice each qdisc has a minor node number of zero, as is required.
A graphical representation of the class hierarchy just created should be beneficial.
htb - sfq

First, notice the at the top of the hierarchy is a htb qdisc. Three classes are assigned to it. Only the first is immediately attached to it, using the parent 1:0. The other two classes are children of the first class. If you examine the tc command with the class option, you will see that the parent refers to the parent class in the hierarchy via its classid.
Each of the three htb classes attached to the htb qdisc are assigned a major node number of 1 for the classid, as the qdisc they are attached to has a handle with 1 as the major node number. The minor node number for each classid must merely be a unique number between 1 and ffff in hexadecimal.
Finally, a sfq qdisc is attached to the leaf class with classid 1:20. Notice the qdisc is added nearly the same as the htb. However, instead of being assigned to the magic root hook, the target is 1:20. The handle is chosen based on the rules discussed earlier. Briefly, the major node number must be a unique number between 1 and ffff and the minor node must be 0.
Last, the whole structure can be deleted simply by deleting the root hook as demonstrated below.
tc qdisc del dev eth2 root

4.4. Displaying qdisc and class Details

Now we can take a look at the details of the hierarchy we have created. Using the tc tool again the following output is produced.
# tc -s -d qdisc show dev eth2
qdisc sfq 2: quantum 1514b limit 128p flows 128/1024
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0)
 
qdisc htb 1: r2q 10 default 0 direct_packets_stat 0 ver 3.16
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0)
Each qdisc in your hierarchy is shown along with statistics and the details of its parameters. The information available for each qdisc varies. The number of sent bytes and packets is self explanatory. For classful qdiscs, the totals are class wide and include the dequeue totals from children and siblings.
# tc -s -d class show dev eth2
class htb 1:10 root leaf 2: prio 0 quantum 13107 rate 1Mbit ceil 1Mbit
  burst 2909b/8 mpu 0b cburst 2909b/8 mpu 0b level 0
 Sent 0 bytes 0 pkts (dropped 0, overlimits 0)
 lended: 0 borrowed: 0 giants: 0
 tokens: 28791 ctokens: 2879
The detailed output for classes is similar to that of qdiscs. The information available varies depending on the type of class. The above is typical of a htb qdisc's class.
Next, we will discuss some of the more interesting qdiscs available, including pfifo, sfq, tbf, prio, htb.

5. An Overview of Common qdiscs

Linux traffic control affords you access to many different shapers and schedulers. A brief overview of each and its capabilities is needed before you can employ them.

5.1. Classless qdiscs

5.1.1. pfifo / bfifo

The most basic scheduler available, pfifo is your standard first-in first-out queue. pfifo is the scheduler all leaf nodes of classful qdiscs are assigned by default if you choose not to override it with a different scheduler.
Usage: ... [p|b]fifo [ limit NUMBER ]
The only available option is a limit on the queue size. It is assumed to be packets if you are using pfifo and bytes if you are using bfifo.
Additional information on the pfifo qdisc:

5.1.2. sfq

A scheduler designed to be CPU and flow friendly, sfq employs a stochastic algorithm to ensure reasonable fairness amongst flows. Stochastic is a fancy way of saying probabilities are employed instead of exacting precision. In a nutshell, sfq uses a constantly changing hashing algorithm over packets which it files into internal FIFOs which are pulled from round robin style.
Usage: ... sfq [ perturb SECS ] [ quantum BYTES ]
The perturb parameter allows you to specify how often sfq changes its hashing algorithm. The quantum parameter controls how many bytes are released from each internal FIFO in round robin fashion. You cannot set this below your maximum transmission unit (MTU) size.
Additional information on the sfq qdisc:

5.1.3. tbf

An excellent scheduler for throttling traffic, the token bucket filter is just that, a bucket. Each token corresponds to a byte. Each byte is paired up with a token. When a flow is less than the specified rate, these tokens add up, allowing for bursts if it exceeds the specified rate. When an excess of packets build up without matching tokens for each byte, new packets are dropped.
As of Linux 2.6.1, tbf is now a classful qdisc. By default it will behave as it did in prior versions of Linux. The classful variant automatically creates a class with a minor node of 1. The major node number will be what you assigned to the tbf qdisc. You can attach both classless and classful qdiscs to the new tbf in 2.6.1 and later.
Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ]
[ peakrate KBPS ] [ latency TIME ]
The rate parameter allows you to specify the speed limit for this scheduler. You must also specify the burst parameter, which is essentially the size of your token bucket in bytes. The limitparameter is the number of bytes to queue before packets are tail dropped. (In Linux 2.6.1 and later if you attach a qdisc to your tbf class, the limit is ignored in favor of your attached qdisc.) A detailed explanation of using the peakrate and latency parameters is beyond the scope of this document.
Additional information for the tbf qdisc:

5.2. Classful qdiscs

5.2.1. prio

The prio qdisc, a priority scheduler, is an interesting classful scheduler. The prio qdisc automatically creates three classes when used, with the first's minor node number starting at 1 and incrementing from there. Each of these is queried in turn, starting with the first, and all available packets are sent. It continues this until it runs out of classes or available bandwidth has been exhausted. By default, traffic is assigned to each of the three priority classes based on type of service (TOS) bits set in each packet, but you can override this behavior.
Usage: ... prio bands NUMBER priomap P1 P2...
You can specify the number of bands using the bands parameter. The priomap parameter is a bit more complicated. The priomap parameter takes sixteen integers as an argument, with each corresponding to a priority class. While the prio qdisc classes begin at 1, priorities begin at 0. Each position in this map of numbers corresponds to a specific TOS bit. The ordering is hardcoded. In practice you likely will not need to change these assignments.
Traffic must be classified into one of the classes created automatically by prio and not into any attached qdisc or the classification will silently fail to work as expected.
More detailed information on the prio qdisc:

5.2.2. htb

The htb qdisc, or hierarchical token bucket, is a classful shaping qdisc. It packs a lot of flexibility and has numerous options available. It calculates reasonable default values for anything you do not specify, which is generally fine. A htb qdisc accepts the default parameter, which specifies which class to direct unclassified flows to. All other options pertain to the actual classes.
Usage:
... qdisc add ... htb [default N] [r2q N]
... class add ... htb rate R1 burst B1 [prio P] [slot S] [pslot PS]
                      [ceil R2] [cburst B2] [mtu MTU] [quantum Q]
Generally, a htb parent class is given a rate and a ceil (ceiling) in bits per second. This defines how much bandwidth is available. If not specified, ceil defaults to whatever you specified rate to be. The parent class serves as a container for its children and should have a rate equal to the practical amount of bandwidth available.
If you choose a value as your rate for a class that is less than your ceil, that class can borrow up to your ceil value. For example, if your ceil is 512Kbps and your rate is 128Kbps, a total of 384Kbps could be borrowed. Its important to note that unlike borrowing something in meat space, bandwidth borrowed amongst htb classes is not returned. Because rate is the bandwidth guaranteed to the class, it cannot exceed the number specified for the outer most parent class. If a particular class requests bandwidth greater than its specified rate, it will borrow based on the proportion of its rate to its parent.
For example, if a parent class with 100Kbps rate has two children with rates of 80Kbps and 20Kbps respectively, the the former can borrow bandwidth in a proportion of 4:1 with respect to the latter. In other words, classes with a higher rate in comparison to other siblings can borrow more.
The default option specifies which class unclassified traffic should be associated with. Any traffic not classified via one of the methods described later will be assigned to this class. If you do not specify the default option at all, any unclassified traffic with not be subject to traffic control. In otherwords, you probably want to specify a default.
More detailed information is available on the htb qdisc:

6. Classifying Flows

A variety of methods exist to classify flows. You can use tc to classify traffic, but it suffers from being entirely stateless. The Netfilter framework is a stateful firewall which can be used to classify flows in additional to providing firewalling services. Moreover, it's often more convenient to simply add a classification chain to an existing Netfilter configuration. The simplest method for classifying traffic with Netfilter is the CLASSIFY target, although the MARK target combined with tc filters is also effective. In either case, you should target your classifications at the most specific class possible in your hierarchy.

6.1. Using tc and the u32 selector

The u32 selector allows you to match specific bits in the headers of IP, TCP, UDP, and ICMP packets. Most commonly it is used to classify packets based on the usual suspects, including source or destination address and source or destination port number. I only intend to cover aliases for commonly sought after bit offsets included within the tc binary. Matching portions of packets by hand is covered in light detail in the Linux Advanced Routing and Traffic Control HOWTO.
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip sport 80 0xffff classid 1:10
The syntax for tc when adding filters is verbose, but few of the values will change. First we specify that we want to work with a filter. We indicate we want to add a new filter for dev eth0. The parent is generally the parent qdisc for the specified interface, often 1:0 unless you choose a different major node number. Next, protocol is specified with ip as its value. Last, the u32selector is specified.
With that, the stage is set for values to be passed off to the u32 selector itself. These values will always take the format of the keyword match followed by keyword ip. Next, for readability a selector of srcdstsport, or dport is specified along with relevant arguments. Finally, classid is specified and should correspond to the qdisc or class you wish to assign the packet to.
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dport 22 0xffff \
  match ip dst 192.168.0.70/32 classid 1:20
For example, the above u32 match will assign packets with a destination port of 22 and a destination IP address of 192.168.0.70 to the qdisc described by 1:20. Notice you can include multiple instances of match in a single filter. The syntax for matching a source IP address and source port are the same, with only the selector name changing. You can match entire ranges of IP addresses using standard CIDR notation.
Deleting filters is an involved process. The syntax presented above lacks some of the keywords necessary to allow effective, consistent deletion of individual filters. You can delete all filterentries by deleting the egress qdisc for a device, demonstrated earlier.

6.2. Using the Netfilter CLASSIFY Target

Since Linux 2.6 the CLASSIFY target has been part of the standard distribution, so you need not patch your kernel. The CLASSIFY extension was added to Netfilter in version 1.2.9.
The CLASSIFY target is simple to use, provided you have some existing familiarity with Netfilter.
iptables -t mangle -A POSTROUTING -o eth2 -p tcp --sport 80 -j CLASSIFY --set-class 1:10
Briefly, iptables is being instructed to append a rule to the POSTROUTING section of mangle table. The rule matches TCP packets with a source port of 80 that are passing out of the eth2 network interface. The target of this rule is the CLASSIFY extension, which is directed to classify this traffic into the class described by the major node number 1 and the minor node number 10. The careful reader will notice that, based on the minor node number being greater than zero, the target must be a class assigned to a classful qdisc. A detailed discussion of how to use Netfilter and iptables is beyond this document's scope.
You can only use CLASSIFY from the POSTROUTING chain of the mangle table. It is prohibited elsewhere. If you find you need to classify packets elsewhere, you may need to use the MARK target instead.

6.3. Using the Netfilter MARK Target

If you cannot use the CLASSIFY target, you can use the mark target in conjunction with tc to classify flows.
iptables -t mangle -A POSTROUTING -o eth2 -p tcp --sport 80 -j MARK --set-mark 1
The above iptables rule will set an invisible mark on any packet it matches. The mark exists in kernel space only. The packet is not actually modified. The tc binary can be used to classify flows based on these marks.
tc filter add dev eth0 protocol ip parent 1:0 prio 1 handle 1 fw classid 1:10
The above tc command is not unlike the familiar qdisc and class variants, except now you're adding a filter instead. The parent parameter will always refer to the root qdisc for the given interface, which must exist prior to creating the filter. The actual parameter handle refers to the mark that you gave the flow earlier. The parameter classid refers to, unsurprisingly, the handle of the class you wish to assign this flow to. It's generally only useful to add filters for interfaces which have classful qdiscs configured.

6.4. Using the L7 Filter Netfilter Module

L7 is an exciting new extension module for Netfilter which lets you match layer 7 traffic. L7 will investigate the contents of one or more packets in a flow to match patterns you specify. L7 already has a rich database of matching patterns for different application layer protocols. L7 is very useful for matching the flows of peer to peer (p2p) applications, many of which use random ports, but generally do not (yet) encrypt their data payloads to avoid detection.
Since L7 is not yet officially part of the standard Netfilter distribution, you will need to download the sources and patch both your kernel and iptables. There is an excellent HOWTO that details the steps to compile L7. Once you have installed your patched kernel and iptables with L7 support, you can start using it to match flows based on layer 7 protocols.
iptables -t mangle -A POSTROUTING -m layer7 --l7proto edonkey \
  -j CLASSIFY --set-class 2:1
For example, assuming the patterns are installed in the default location, the above will match flows generated by the popular p2p application eMule and use the CLASSIFY target described above to classify the flow appropriately. The POSTROUTING and PREROUTING chains are recommended since a flow's packets pass through either in both directions, so you aren't listening to half a conversation, which will break some layer 7 matches.

7. Building a QoS Ready Kernel

In order to make use of these splendid traffic control features, you need to build your kernel with appropriate support. If you are interested in L7 Filter, make sure you patch your kernel accordingly.

7.1. Kernel Options for Traffic Control Support

The selections for traffic control for a 2.6 series kernel are listed under Device Drivers -> Networking support -> Networking options -> QoS and/or fair queuing. At a minimum you will want to enable the options selected below. Unselected options have been pruned.
[*] QoS and/or fair queueing
<M>   HTB packet scheduler
<M>   The simplest PRIO pseudoscheduler
<M>   RED queue
<M>   SFQ queue
<M>   TBF queue
<M>   Ingress Qdisc
[*]   QoS support
[*]     Rate estimator
[*]   Packet classifier API
<M>     Firewall based classifier
<M>     U32 classifier
[*]     Traffic policing (needed for in/egress)
For 2.6.9 and later kernels, you have the additional option of specifying the scheduler clock source directly during kernel configuration. The option is available from within the QoS and/or fair queuing configuration section above.
Packet scheduler clock source (Timer interrupt)  --->
 ( ) Timer interrupt
 ( ) gettimeofday
 (X) CPU cycle counter
For modern x86 machines you can select CPU cycle counter without incident. The scheduler clock source selection above replaces the need to hand edit the include/net/pkt_sched.h file as described below.

7.2. Kernel Options for Netfilter Support

The selections for Netfilter for a 2.6 series kernel are listed under Device Drivers -> Networking support -> Networking options -> Network packet filtering (replaces ipchains) -> IP: Netfilter Configuration. You will want to enable at least the options selected below to use Netfilter effectively to classify traffic flows. Include anything else you use for firewalling, too. Unselected options have been pruned.
<M> Connection tracking (required for masq/NAT + layer7)
<M>   FTP protocol support
<M>   IRC protocol support
<M> IP tables support (required for filtering/masq/NAT)
<M>   limit match support
<M>   IP range match support
<M>   Layer 7 match support (EXPERIMENTAL)
[ ]     Layer 7 debugging output(2048)  Buffer size for application layer data
<M>   Packet type match support
<M>   netfilter MARK match support
<M>   Multiple port match support
<M>   TOS match support
<M>   LENGTH match support
<M>   Helper match support
<M>   Connection state match support
<M>   Connection tracking match support
<M>   Packet filtering
<M>     REJECT target support
<M>   Full NAT
<M> MASQUERADE target support
<M> Packet mangling
<M>   TOS target support
<M>   MARK target support
<M>   CLASSIFY target support
<M> LOG target support
<M> ULOG target support

7.3. Kernel Source Changes

To get the most mileage out of your kernel, there are a few options in the kernel source you will want to change. You can find all these files under your kernel source tree. Paths are specified from that root.

7.3.1. PSCHED_CPU

If you have a Pentium class CPU or better and are using a kernel prior to 2.6.9, in include/net/pkt_sched.h, change PSCHED_CLOCK_SOURCE to PSCHED_CPU and save the file. If your CPU supports tsc you can use PSCHED_CPU. You can skip this modification if you are using a kernel newer than 2.6.8.1.
# cat /proc/cpuinfo
...
flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov \
  pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
While in days of old, the default was PSCHED_GETTIMEOFDAY, today PSCHED_JIFFIES is used and it isn't terribly bad. PSCHED_CPU can't hurt, though.

7.3.2. HTB_HYSTERESIS

When working with peak bandwidth rates less than 1Mbps, the HTB_HYSTERESIS option is set to your detriment. It trades accuracy for faster calculations, but with really slow network links this is not necessary. Open up net/sched/sch_htb.c and change HTB_HYSTERESIS to 0. This setting also effects bursts.
#define HTB_HYSTERESIS 0/* whether to use mode hysteresis for speedup */

7.3.3. SFQ_DEPTH

When dealing with smaller bandwidth quantities, the default queue length of 128 is too long. Flows that demand low latency can suffer if sfq begins to fill up its queue. You can edit SFQ_DEPTH innet/sched/sch_sfq.c and shorten the queue to your liking. A popular depth is 10.
#define SFQ_DEPTH               128

 

8. A Traffic Control Journey: Real World Scenarios

Having read the previous sections and familiarizing yourself with traffic control concepts and the tools available under GNU/Linux to deploy QoS, you should be ready to rock. Now, let us examine some real world scenarios and effective resolutions.
Below I overview two popular scenarios, guaranteeing a specific rate and guaranteeing flow priority. The first involves a basic Web server, the second a consumer broadband Internet connection. First, let us examine a few strategies to deal with situations that exist in many environments that may wish to employ traffic control.

8.1. Common Traffic Control Situations

Whether you're trying to guarantee a specific rate or priority for flows, you need to handle situations where TOS flags are improperly set (especially in the case of the prio qdisc), handle TCP handshake packets, and classify network resource intensive p2p traffic flows. What follows are Netfilter based solutions, although Netfilter need not be employed for actual classification. It is often easier to classify with Netfilter if you are already using it for stateful packet inspection.

8.1.1. Handling TOS Flags

Some applications, specifically OpenSSH, provide incorrect type of service (TOS) information which can result in misclassification of tunnels and bulk data transfers. With reliable remote shell connectivity typically being a must for servers, this can be a problem. What's more, p2p applications like to mask bulk data packets as TCP acknowledgment packets. Erik Hensema has an excellent two pronged Netfilter based solution for this situation.
iptables -t mangle -N tosfix
iptables -t mangle -A tosfix -p tcp -m length --length 0:512 -j RETURN
iptables -t mangle -A tosfix -m limit --limit 2/s --limit-burst 10 -j RETURN
iptables -t mangle -A tosfix -j TOS --set-tos Maximize-Throughput
iptables -t mangle -A tosfix -j RETURN
...
iptables -t mangle -A POSTROUTING -p tcp -m tos --tos Minimize-Delay -j tosfix
First, a new chain is created to examine Minimize-Delay packets. They are evaluated for length and then a short burst is allowed for. When both of these sanity checks are violated, packets larger than 512 bytes with the TOS Minimize-Delay flag set have TOS reclassified to Maximize-Throughput instead. The underlying assumption is that packets that need Minimum-Delay priority are small and only exceed 512 bytes for short bursts. Traffic flows from OpenSSH mesh well with this rule. Without it, using OpenSSH tunneling and copying files with scp or sftp can render your OpenSSHsession rather useless for the duration if your packets are queuing.
iptables -t mangle -N ack
iptables -t mangle -A ack -m tos ! --tos Normal-Service -j RETURN
iptables -t mangle -A ack -p tcp -m length --length 0:128 \
  -j TOS --set-tos Minimize-Delay
iptables -t mangle -A ack -p tcp -m length --length 128: \
  -j TOS --set-tos Maximize-Throughput
iptables -t mangle -A ack -j RETURN
...
iptables -t mangle -A POSTROUTING -p tcp -m tcp --tcp-flags SYN,RST,ACK ACK -j ack
Last, a new chain is created specifically for modifying the TOS bits if they are not sane. TCP packets with the ACK flag set that already have TOS assigned are ignored. If the TCP packet is no larger than 128 bytes, it is considered a candidate for Minimize-Delay and elevated accordingly. Strange TCP packets with the ACK flag set, like those generated by p2p applications generally fall into the category of being larger than 128 bytes and are flagged Maximize-Throughput accordingly. The chain is only applied to TCP packets with the ACK flag set.

8.1.2. Prioritizing TCP Handshake Packets

To prevent establishing and breaking down connections from encountering potentially lengthy delays, it's useful to assign these packets a higher priority. It's not strictly necessary to elevate these packets, as they will be properly classified for any specific flows you classify and treated the same as unclassified traffic otherwise. Reclassifying these packets is more a matter of personal taste.
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp -m tcp --tcp-flags ! SYN,RST,ACK ACK \
        -j CLASSIFY --set-class 4:1
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp -m tcp --tcp-flags SYN,RST,ACK ACK \
        -m length --length :128 -m tos --tos Minimize-Delay \
        -j CLASSIFY --set-class 4:1
The first rule matches TCP SYN and RST packets and classifies them using the CLASSIFY Netfilter target discussed earlier. The second rule builds on the TOS reclassification chain discussion above and again the CLASSIFY target is used on TCP packets with the ACK flag set that don't exceed 128 bytes and have a TOS flag of Minimize-Delay.

8.1.3. Handling Pervasive p2p Traffic

p2p traffic can very easily saturate a network's entire upstream bandwidth. Fortunately, with L7 Filter it is now rather easy to classify these flows and grant them priority below that of all other traffic. p2p applications are always evolving, so L7 Filter is no magic bullet. It can help pin down p2p traffic, however.
iptables -t mangle -A POSTROUTING -m layer7 --l7proto edonkey -j CLASSIFY --set-class 4:5
iptables -t mangle -A POSTROUTING -m layer7 --l7proto fasttrack -j CLASSIFY --set-class 4:5
iptables -t mangle -A POSTROUTING -m layer7 --l7proto gnutella -j CLASSIFY --set-class 4:5
iptables -t mangle -A POSTROUTING -m layer7 --l7proto audiogalaxy -j CLASSIFY --set-class 4:5
iptables -t mangle -A POSTROUTING -m layer7 --l7proto bittorrent -j CLASSIFY --set-class 4:5
There is no single pattern match for all known p2p applications, so you will need to specify a rule for each that's present now or you believe may be in the future. You may have to create your own patterns for p2p traffic that does not yet have an L7 Filter pattern match. Packet analysis is beyond the scope of this document.

8.2. Guaranteeing Rate

When guaranteeing a minimum bandwidth rate is necessary, the classful htb qdisc is your friend. In this scenario our objective is to guarantee a specific rate for HTTP traffic while sharing the link with SMTP, POP3, and OpenSSH traffic. All other traffic is assigned to the default class.

8.2.1. Designing the Classful qdisc Hierarchy

A Web server networked via Ethernet has 8Mbps of bandwidth available to it. Web traffic is most important. Other traffic is secondary. Accordingly, the class hierarchy created below allocates 6000Kbps for HTTP traffic. The remaining bandwidth is split into three more classes. SMTP and POP3 are allocated 1000Kbps. OpenSSH gets 500Kbps as does the default class. All classes except the default class can use excess bandwidth up to the full speed of the line. The careful reader will note that all the rates add up to the overall rate specified in the first htb parent class, as they always should.
 
#!/bin/bash
 
RATE=8000
 
if [ x$1 = 'xstop' ]
then
        tc qdisc del dev eth0 root >/dev/null 2>&1
fi
 
tc qdisc add dev eth0 root handle 1: htb default 90
tc class add dev eth0 parent 1: classid 1:1 htb rate ${RATE}kbit ceil ${RATE}kbit
 
tc class add dev eth0 parent 1:1 classid 1:10 htb rate 6000kbit ceil ${RATE}kbit
tc class add dev eth0 parent 1:1 classid 1:20 htb rate 1000kbit ceil ${RATE}kbit
tc class add dev eth0 parent 1:1 classid 1:50 htb rate 500kbit ceil ${RATE}kbit
tc class add dev eth0 parent 1:1 classid 1:90 htb rate 500kbit ceil 500kbit
 
tc qdisc add dev eth0 parent 1:10 handle 10: sfq perturb 10
tc qdisc add dev eth0 parent 1:20 handle 20: sfq perturb 10
tc qdisc add dev eth0 parent 1:50 handle 50: sfq perturb 10
tc qdisc add dev eth0 parent 1:90 handle 90: sfq perturb 10
The above shell script will create the class structure described above. It is rather simplistic and no deep nesting occurs. The parent class only has immediate children and no additional ancestors. For fairness, the sfq scheduling qdisc is attached to each leaf htb class.

8.2.2. Classifying Flows

Classification of flows is done using tc's filter combined with the u32 selector, discussed earlier.
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip sport 80 0xffff classid 1:10
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip sport 22 0xffff classid 1:20
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip sport 25 0xffff classid 1:50
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip sport 110 0xffff classid 1:50
The tc commands above classify flows using the u32 selector based on TCP source port number. HTTP, SSH, SMTP, and POP3 are classified based on their traditional source ports. Any unclassified traffic is assigned to classid 1:90 as specified earlier when the htb class hierarchy was created.

8.2.3. Observations

The classful htb qdisc is excellent at accurately guaranteeing rates for classified flows. Each htb class can dequeue at its assigned rate and, if allowed, exceed that in proportion to its parent's rate. It's especially useful for guaranteeing particular rates for specific services or entire ranges of network traffic.

8.3. Guaranteeing Priority

When guaranteeing flow priority is necessary, the classful prio qdisc is your friend. In this scenario our objective is to guarantee interactive applications have priority over bulk transfers and p2p applications.

8.3.1. Designing the Classful qdisc Hierarchy

The prio qdisc only knows about bands, where each band corresponds to a level of priority. While band numbering starts at zero, each band is described by major:band+1. To ensure that the priority classifications stick, the classful shaping qdisc tbf must be employed in conjunction with the prio qdisc. tbf will ensure that if link speed is exceeded a queue fills locally, where it is still controllable. Such a configuration is possible using tbf qdisc with Linux 2.6.1 and beyond.
Structurally, the class hierarchy utilizes a tbf qdisc and serves as a container for the prio qdisc, ensuring any packet queue remains local. The prio qdisc is then assigned to the only tbf class, with an extra band added. As described earlier, the prio qdisc automatically creates a class structure for as many bands as you create, with the default being three. Finally we assign the sfq scheduling qdisc as the leaf for three of the four new prio qdisc classes. The fourth, which is for p2p traffic, is assigned the tbf scheduling qdisc, with a pfifo qdisc attached to the tbf class.
It's important to note that the prio qdisc is merely a scheduler. As such, it cannot perform any shaping. Therefore, if one or more higher priority bands consume the link, lower priority bands will never have an opportunity to dequeue packets. In other words, starvation occurs. To combat this careful planning is necessary. If starvation cannot occur, you should look instead at guaranteeing rates above.
The proposed configuration is effective for residential consumer broadband, in the form of ADSL or Cable Internet services, where one must often suffer an asymmetrical connection. The example below assumes a usable rate of 160Kbps on a residential ADSL connection with an advertised rate of 256Kbps. The tricky part is guessing what your actual bandwidth rate is.  With overhead it's usually 60% to 90% of your rated connection.
tc qdisc add dev eth0 root handle 1: tbf rate 160kbit burst 1600 limit 1
tc qdisc add dev eth0 parent 1:1 handle 2: prio bands 4
tc qdisc add dev eth0 parent 2:1 handle 10: sfq perturb 20
tc qdisc add dev eth0 parent 2:2 handle 20: sfq perturb 20
tc qdisc add dev eth0 parent 2:3 handle 30: sfq perturb 20
tc qdisc add dev eth0 parent 2:4 handle 40: tbf rate 144kbit burst 1600 limit 3000
tc qdisc add dev eth0 parent 40:1 handle 41: pfifo limit 10
The above commands will create the class structure described above. The actual hierarchy is more complex than immediately obvious, due to the prio qdisc automatically creating a class for each band it manages.

8.3.2. Classifying Flows

Now we can use Netfilter and its CLASSIFY target to classify traffic. We handle packets with type of service set as described earlier. TCP packets with the ACK flag are also handled as described above. As you may recall, the prio qdisc uses the TOS flags to classify packets by default. Most importantly, Minimize-Cost is assigned priority level zero, Normal-Service priority level one, and Maximize-Throughput priority level two. Ensuring packets have a proper TOS flag is obviously of paramount importance.
# Is our TOS broken? Fix it for TCP ACK and OpenSSH.
 
iptables -t mangle -A POSTROUTING -p tcp -m tcp --tcp-flags SYN,RST,ACK ACK -j ack
iptables -t mangle -A POSTROUTING -p tcp -m tos --tos Minimize-Delay -j tosfix
 
# Here we deal with ACK, SYN, and RST packets
 
# Match SYN and RST packets
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp -m tcp --tcp-flags ! SYN,RST,ACK ACK \
        -j CLASSIFY --set-class 2:1
 
# Match ACK packets
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp -m tcp --tcp-flags SYN,RST,ACK ACK \
        -m length --length :128 -m tos --tos Minimize-Delay \
        -j CLASSIFY --set-class 2:1
 
# Match packets with TOS Minimize-Delay
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp -m tos --tos Minimize-Delay \
        -j CLASSIFY --set-class 2:1
The first packets classified are those that can delay flows if not handled expediently. All TCP flows are handled the same in that packets with handshake flags set are promoted. Later, some of these flows will be entirely demoted. The most generic rule wins. Each time a rule matches, the packet is reassigned to the associated traffic control class. Classification progression with Netfilter should proceed from the least to the most specific.
### Actual traffic shaping classifications with CLASSIFY
 
# ICMP (ping)
 
iptables -t mangle -A POSTROUTING -o $LOCALIF -p icmp -j CLASSIFY --set-class 2:1
 
# Outbound client requests for HTTP, IRC and AIM (dport matches)
 
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp --dport 80 -j CLASSIFY --set-class 2:2
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp --dport 6667 -j CLASSIFY --set-class 2:2
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp --dport 5190 -j CLASSIFY --set-class 2:2
 
# Enemy Territory (UDP, realtime gaming packets)
 
iptables -t mangle -A POSTROUTING -o $LOCALIF -p udp --dport 27960:27970 \
        -j CLASSIFY --set-class 2:2
After the earlier magic, classification of most flows is generally as easy and straightforward as using iptables matching rules. Above we assign ICMP traffic, which includes things like the packets sent in association with the ping command, to class described by 2:1. We assign all other interactive traffic to the class described by 2:2. Notice we have classified both ICMP and UDP flows in additional to more common TCP flows.
# SSH
 
# The most general rule always wins, so list specific rules _LAST_
 
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp --dport 22 -j CLASSIFY --set-class 2:2
iptables -t mangle -A POSTROUTING -o $LOCALIF -p tcp --sport 22 -j CLASSIFY --set-class 2:2
 
iptables -t mangle -A POSTROUTING -p tcp -m tos --tos Maximize-Throughput \
        --sport ssh -j CLASSIFY --set-class 2:3
iptables -t mangle -A POSTROUTING -p tcp -m tos --tos Maximize-Throughput \
        --dport ssh -j CLASSIFY --set-class 2:3
 
# Matches for Edonkey and Overnet
 
iptables -t mangle -A POSTROUTING -m layer7 --l7proto edonkey -j CLASSIFY --set-class 2:4
Finally, we handle flows generated by OpenSSH and a p2p application. The former is assigned to the interactive class for sessions originating both within the local network and destined for the local network from the Internet. (Or perhaps from another segment on a WAN.) Earlier, packets with the TOS flag Minimize-Delay larger than 512 bytes had their TOS altered to a more reasonable Maximize-Throughput. That is taken advantage of now implicitly in the second pair of rules relating to OpenSSH. Tunnels and transfers using scp and sftp will now correctly be assigned to the class described by 2:3. The final rule uses L7-Filter to match packets sent by the p2p application eMule by applying a regular expression against each packet and matching the protocol at the application layer. The traffic is then assigned to the class represented by 2:4, the p2p class.

8.3.3. Observations

The classful prio qdisc paired with the classful tbf qdisc is an excellent way of guaranteeing priority for flows in situations where you can live with one or more bands dominating lower priority bands, possibly starving them entirely at times.

9. Graphing and Monitoring Traffic Control

Traffic control is a complex beast to fly without proper instrumentation. Fortunately, there are a number of tools at your disposal to help you formulate and monitor your configurations. Often taking measurements is as simple as polling the tc binary for statistics and generating graphs. You can certainly write your own tools, as many have. I intend to discuss a few tools I use to monitor my own configurations.

9.1. Discovering Traffic Flows

Before you can create an effective traffic control hierarchy, you may be wondering what kind of traffic exists on your network. You can use iptraf to explore your network traffic interactively. You can explore TCP flows using tcptrack and libpcap expressions. Both sport ncurses interfaces. Below is output from iptraf's Statistical breakdown -> By TCP/UDP port output.
/ Proto/Port --------- Pkts --- Bytes -- PktsTo - BytesTo  PktsFrom BytesFrom --\
| TCP/ftp-data           12       640        10       560         2        80   |
| TCP/ftp                67      5024        44      2972        23      2052   |
| TCP/ssh               119     16992        63      7832        56      9160   |
| UDP/domain             42      3859         0         0        42      3859   |
| TCP/60                  9       508         7       404         2       104   |
| TCP/gopher              4       240         4       240         0         0   |
| TCP/www                60      8399        41      6750        19      1649   |
| TCP/81                  4       240         4       240         0         0   |
| TCP/85                  4       240         4       240         0         0   |
| TCP/rtelnet             4       240         4       240         0         0   |
| TCP/pop3               30      3980        16      1552        14      2428   |
| TCP/auth               32      2652        20      1488        12      1164   |
| UDP/113                13       689        13       689         0         0   |
| TCP/140                12      7728         6      7452         6       276   |
| TCP/466                 4       240         4       240         0         0   |
| TCP/505                31      2441        16      1278        15      1163   |
| TCP/662                29      2026        15       974        14      1052   |
| TCP/moira_upda          9       540         9       540         0         0   |
| TCP/778                11       660        11       660         0         0   |

9.2. Graphing Your Traffic Control Hierarchy

It's exceeding helpful to have a graphical representation of any complex class hierarchy you create. I suggest Andreas Klauer's tc-graph.pl for the task. You can obtain a copy from his FairNATproject pageFairNAT is a Netfilter based Linux firewall with extensive QoS features.
tc-graph.pl is easy to use. It's written in Perl. Once you've downloaded a copy, ensure it's executable. You will want to edit the script and verify the correct path for your system's tc is specified. Also, ensure the interface specified is the one you want to build a class hierarchy graph for.
Next, run the script and pipe the output to a file. The file will contain GraphViz commands. You can pass the file off to dot, part of the GraphViz suite of tools, to generate an image of your hierarchy.
jasonb@rebecca:~/src$ perl tc-graph.pl > mygraph.dot
jasonb@rebecca:~/src$ cat mygraph.dot | dot -Tgif > mygraph.gif
After running the above commands, you should have an image containing a graph of your traffic control class hierarchy. Some example graphs can be found in the section about class hierarchies, discussed earlier. I modified the copy of the script I used so a smaller font size was used and extra details were removed for simplicity's sake. Other than that, your images will look the same.

9.3. Monitoring Leaf qdisc Bandwidth Utilization

When working with configurations, I find it helpful to monitor the bandwidth utilization of each leaf qdisc, to verify the configuration is responding as anticipated. The simplest way to monitor bandwidth utilization is to parse the output of the tc binary and insert the values into a RRDTool database in the same fashion you would monitor the counters for a switch or router.
I wrote a utility in Perl to parse tc output and insert the transferred bytes value for a configuration's leaf qdiscs into a RRDTool database. The utility, polltc, can be obtained from my software Web page.
polltc can operate in either of two modes. In diagnostic mode, it will update a RRDTool database directly and generate a graph. It can also operate as a plugin for Munin, for long term trend analysis.

9.3.1. Configuring polltc

Before you can start using polltc_, you need to modify a few values near the start of the script. Specifically, you need to specify a path where you want files to be stored. I access my graphs via a Web server, so I dump all my files in a directory readable by my Web server.
You may not wish for polltc_ to create a graph when it runs every ten seconds in diagnostic mode. If such is the case, change $do_graph to 0.
You must specify the path to your tc binary. For testing purposes I use my own tc binary in my home directory, so you will need to change this or nothing will work.

9.3.2. Running polltc

Once you have changed the necessary options above, you can start using polltc_. The interface being probed for information is gathered from the name of the file itself. Create a symlink with the interface name so polltc_ knows what interface to probe. (eth0, ppp1, ect.)
$ ln -s polltc_ polltc_eth0
Now, you can run polltc_eth0 to gather information about your traffic control configuration on eth0.
$ perl polltc_eth0 test &
An RRD database will be created and populated with values every 10 seconds. If you chose to enable graphing, a one hour and a twenty-four hour graph will be created and written to disk in the same location as the RRD database.
traffic control leafs

If you wish to use it as a Munin plugin, you will want to symlink into your /etc/munin/plugins directory.
# ln -s polltc_ /etc/munin/plugins/polltc_eth0
polltc_ supports Munin 'autoconfig' and 'config' and when run without any arguments, polltc_ will return values for the interface it is being run against as expected by Munin.

10. Suggested Reading for Further Study

·         Traffic Control HOWTO
·         Iptables Tutorial
·         Intermediate Queuing Device
·         Traffic Control FAQ
·         L7 Filter mailing list archives

No comments:

Post a Comment