From an email sent to the rtp-congestion list on October 13, 2011
Personal Opinion
-----------------------
Well, here's my honest opinion, formed over the last 15 months.
We can't really make our jitter buffers so big as to make for decent
audio/video when bufferbloat is present, unless you like talking to
someone half way to the moon (or further). Netalyzr shows the problem
in broadband, but our OS's and home routers are often even worse.
Even one TCP connection (moving big data), can induce severe latency on
a large fraction of the existing broadband infrastructure; as Windows XP
retires and more and more applications deploy (e.g. backup, etc.), I
believe we'll hurt more and more.
It's impossible to make any servo system work faster than the RTT time;
and bufferbloat causes that to sometimes go insane.
We can't forklift upgrade all the TCP implementations, which will
compete with low-latency audio/video. So "fixing" TCP isn't going to
happen fast enough to be useful. That doesn't mean it shouldn't happen,
just that it's a 5-15 year project to do so.
We do have to do congestion avoidance, as well as TCP would do (if
bufferbloat weren't endemic).
Delay based congestion avoidance algorithms are likely to lose relative
to loss based ones, as far as I understand. So that means that the same
issue applies as "fixing" TCP.
So the conclusion I came to this time last year was that bufferbloat was
a disaster for the immersive teleconferencing I'm supposed to be working
on, and I switched to working solely on bufferbloat, and getting it
fixed. Because to make any of this work well (which means not
generating service calls), we have to fix it.
Timestamps are *really* useful to detect bufferbloat, and detecting
bufferbloat suffering is key to getting people aware of it and motivated
to fix it. I'd really like to be able to tell people what's going on in
a reliable way, to motivate them to at least fix the gear under their
control and/or provide pressure on those they pay to provide service.
Identifying where the bottleneck is that is at fault is key to this.
So we have to provide "back pressure" into the economic system to get
people to fix the network. But trying to engineer around this entirely
I believe is futile and counter-productive: we have to fix the
Internet. To fix the broadband edge will cost of order $100/subscriber:
this isn't an insane price to pay, as even one or two service calls cost
more than that.
Does this mean we're doomed?
-----------------------------------------
I hope not. I think there is going to have to be a multi-prong attack on
the problem.
My sense is that the worst problem is in the home and on wireless
networks. As I can't work on the wireless networks except 802.11, I've
focussed there. But in the home, courtesy of Linux being commonly used
in home routers, we have the ability to do a whole lot.
In the short/immediate term, mitigations are possible. My home network
now works tremendously better than it did a year ago, and yours can
immediately too, even with many existing home routers. But doing so is
probably beyond non-network wizards today.
The CeroWrt build of OpenWrt is a place where we're working on both
mitigations and solutions for bufferbloat in home routers (along with
other things that have really annoyed us about what we can buy
commercially). See: http://www.bufferbloat.net/news/19 . Please come
help out. The immediate mitigations include just tuning the router's
buffering to something more sensible.
Over the next several months, we hope to start testing AQM algorithms.
Note that even if the traditional 100ms "rule of thumb" buffersizing
http://gettys.wordpress.com/2011/07/06/rant-warning-there-is-no-single-right-answer-for-buffering-ever/
is still too high; we really need AQM in both our home routers, our
broadband gear, and in our operating systems. The long term telephony
standard for "good enough" latency is 150ms, and to leave the gate
having lost 100ms isn't a good; if both ends are congested, you are at
200ms + the delay in the network.
Now, if you are willing to bandwidth shape your broadband service
strongly, you can already do much better than 100ms today. That
requires you to tune your home router (if it is capable). I'll be
posting a more "how to" entry in the blog sometime soon; but network
geeks should be able to hack what I wrote before at:
http://gettys.wordpress.com/2010/12/13/mitigations-and-solutions-of-bufferbloat-in-home-routers-and-operating-systems/
Your other strategy as I'll outline in my how-toish document is to
ensure your wireless bandwidth is *always* higher than that of your
broadband bandwidth; ensuring the bottleneck is at a point in the
network you can control. This still doesn't solve downstream transient
bufferbloat at bad web sites, but I think it will solve
upstream/downstream elephant flows killing you.
Another form of mitigation is getting the broadband buffering back under
control. That will get us back to the vicinity of the traditional "rule
of thumb".
http://gettys.wordpress.com/2011/07/13/progress-on-the-cable-front/
which is a lot better than where we are now.
Since I wrote that, I've confirmed (most recently last week) that the
cable modem and CMTS changes are well under way; it appears sometime
mid/late 2012 that will start deployment. You may need to buy a new
cable modem when the time comes (though ones with the upgrade will
probably start shipping this year). I have no clue if older existing
cable modems will ever see firmware upgrades, though I predict DOCSIS 2
modems almost certainly will not. I am hopeful that their motion will
force mitigation into DSL and fiber at least eventually. But this just
gets us back to the 100ms range (maybe worse, given powerboost).
Obviously, if your network operator doesn't run AQM, then they should
and you should help educate them.
Solutions
======
Solutions come in a number of forms.
We need AQM that works and is self tuning. And we need it even in our
operating systems. The challenge here is that classic RED 93 and
similar algorithms won't work in the face of highly variable bandwidth.
Traffic classification can at best move who suffers when, but doesn't
fix the problem. I still want it, however. To do it for real, in the
broadband edge will be "interesting", as to who classifies traffic how;
today, you typically only get one queue however (though the technologies
will often support multiple queues).
Classification would also be really nice: but today, most broadband
systems have exactly one queue that you have access to. Carrier's VOIP
is generally provisioned separately; they have a (unintended, I believe)
fundamental advantage right now. It turns out that diffserv has been
discovered by (part of) the gaming industry noticing that Linux's
PFIFO-FAST queue discipline implements diffserv. So you can get some
help by marking traffic. Andrew McGregor had the interesting idea that
maybe the broadband headends could observe how traffic is being marked
and classify similarly in the downstream direction.
Even with only one queue, at least we can control what happens in the
upstream direction (at least if we can keep the buffers from filling in
the broadband gear). In the short term, bandwidth shaping is our best
tool, and I'm working on other ideas as well.
Getting all the queues lined up is still going to take some effort,
between diffserv marking, 802.11 queues, ethernet queues, etc...
I also believe that we need the congestion exposure stuff going on in
the IETF in the long term, to provide disincentives for abuse of the
network, as well as proper accounting of congestion.
What should this group do?
================
I have not seen a way to really engineer around bufferbloat at the
application layer, nor even in the network stack. It's why I'm working
on bufferbloat rather than teleconferencing, which I was hired to work
on; if we don't fix that, we can't really succeed properly on the
teleconferencing front.
I believe therefore:
o work on the real time applications problem should not stop in the
meanwhile; it is the compelling set of applications to motivate fixing
the Internet.
o exposing the bloat problem so that blame can be apportioned is
*really* important. Timestamps would help greatly here in rtp in doing
so. Modern TCP's (may) have the TCP timestamp option turned on (I know
modern Linux systems do), so I don't know of anything needed there
beyond ensuring the TCP information is made available somehow, if it
isn't already. Being able to reliably tell people: "The network is
broken, you need to fix (your OS/your router/your broadband gear)." is
productive. and to deploy IPv6 we're looking to deploying new home kit
anyway.
o designing good congestion avoidance that will work in in an
unbroken, unbloated network is clearly needed. But I don't think heroic
engineering around bufferbloat is worthwhile right now for RTP; that
effort is better put into the solutions outlined above, I think. Trying
to do so when we've already lost the war (teleconferencing isn't
interesting when talking half way to the moon) is not productive, and
getting stable servo systems to work not just at the 100ms level, but
the multi-second level, when multi-second level isn't even usable for
the application is a waste. RTP == Real-Time Transport Protocol, when
the network is no longer real time, is an oxymoron.
o worrying about how to get diffserv actually usable (so that we can
classify at the broadband head end) seems worthwhile to me. I'd like to
get the web mice (transient bufferbloat), to not interfere with
audio/video traffic. I like Andrew McGregor's idea, but don't know if
will hold water. That we can expect diffserv to sort of work in the
upstream direction already is good news; but we also need downstream to
work.
o come help on the home router problem; if you want teleconferencing
to really work well, it needs lots of TLC. And we have the ability to
not just write specs, but to demonstrate working code here.