《TCP/IP 卷一》笔记、ping和traceroute 的实现思路


Normally TCP does not send an ACK the instant it receives data. Instead, it delays the ACK, hoping to have data going in the same direction as the ACK, so the ACK can be sent along with the data. (This is sometimes called having the ACK piggyback with the data.) Most implementations use a 200-ms delay-that is, TCP will delay an ACK up to 200 ms to see if there is data to send with the ACK.

The Nagle algorithm  says that a TCP connection can have only one outstanding small segment that has not yet been acknowledged. No additional small segments can be sent until the acknowledgment is received. Instead, small amounts of data are collected by TCP and sent in a single segment when the acknowledgment arrives. The beauty of this algorithm is that it is self-clocking: the faster the ACKs come back, the faster the data is sent. But on a slow WAN, where it is desired to reduce the number of tinygrams, fewer segments are sent.

This shows that TCP can acknowledge received data before the application has read and processed that data. The TCP

acknowledgment just means TCP has correctly received the data. We also have an indication that the server process has not read these 3 bytes of data because the advertised window in the final segment is 8189, not 8192.

There are times when the Nagle algorithm needs to be turned off. The classic example is the X Window System server : small messages (mouse movements) must be delivered without delay to provide real-time feedback for interactive users doing certain operations.

The sockets API uses the TCP_NODELAY socket option to disable the Nagle algorithm.

Here we'll show another example that's easier to demonstrate-typing one of the terminal's special function keys during an interactive login. The function keys normally generate multiple bytes of data, often beginning with the ASCII escape character. If TCP gets the data 1 byte at a time, it's possible for it to send the first byte (the ASCII ESC) and then hold the remaining bytes of the sequence waiting for the ACK of this byte. But when the server receives this first byte it doesn't generate an echo until the remaining bytes are received. This often triggers the delayed ACK algorithm on the server, meaning that the remaining bytes aren't sent for up to 200 ms. This can lead to noticeable delays to the interactive user.


With TCP's sliding-window protocol the receiver does not have to acknowledge every received packet. With TCP, the ACKs are cumulative-they acknowledge that the receiver has correctly received all bytes up through the acknowledged sequence number minus one.

The ordering of the packets that we see on the wire depends on many factors, most of which we have no control over: the sending TCP implementation, the receiving TCP implementation, the reading of data by the receiving process (which depends on the process scheduling by the operating system), and the dynamics of the network (i.e., Ethernet collisions and backoffs). There is no single correct way for two TCPs to exchange a given amount of data.

The sockets API allows a process to set the sizes of the send buffer and the receive buffer. The size of the receive buffer is the maximum size of the advertised window (也就是Tcp头部的滑动窗口大小)for that connection. Some applications change the socket buffer sizes to increase performance.


In all the examples we've seen so far in this chapter, the sender starts off by injecting multiple segments into the network, up to the window size advertised by the receiver. While this is OK when the two hosts are on the same LAN, if there are routers and slower links between the sender and the receiver, problems can arise. Some intermediate router must queue the packets, and it's possible for that router to run out of space.

TCP is now required to support an algorithm called slow start. It operates by observing that the rate at which new packets should be injected into the network is the rate at which the acknowledgments are returned by the other end.

Slow start adds another window to the sender's TCP: the congestion window, called cwnd. When a new connection is established with a host on another network, the congestion window is initialized to one segment (i.e., the segment size announced by the other end).Each time an ACK is received, the congestion window is increased by one segment, (cwnd is maintained in bytes, but slow start always increments it by the segment size.) The sender can transmit up to the minimum of the congestion window and the advertised window. The congestion window is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver.

The sender starts by transmitting one segment and waiting for its ACK. When that ACK is received, the congestion window is incremented from one to two, and two segments can be sent. When each of those two segments is acknowledged, the congestion window is increased to four. This provides an exponential increase. At some point the capacity of the internet can be reached, and an intermediate router will start discarding packets. This tells the sender that its congestion window , we'll see how this is handled, and what happens to the congestion window. For now, let's watch slow start in action.


Congestion can occur when data arrives on a big pipe (a fast LAN) and gets sent out a smaller pipe (a slower WAN). Congestion can also occur when multiple input streams arrive at a router whose output capacity is less than the sum of the inputs.


TCP itself says little more about urgent data. There is no way to specify where the urgent data starts in the data stream. The only information sent across the connection by TCP is that urgent mode has begun (the URG bit in the TCP header) and the pointer to the last byte of urgent data. Everything else is left to the application.

Unfortunately many implementations incorrectly call TCP's urgent mode out-of-band data. If an application really wants a separate out-of-band channel, a second TCP connection is the easiest way to accomplish this. (Some transport layers do provide what most people consider true out-of-band data: a logically separate data path using the same connection as the normal data path. This is not what TCP provides.)

The confusion between TCP's urgent mode and out-of-band data is also because the predominant programming interface, the sockets API, maps TCP's urgent mode into what sockets calls out-of-band data.


TCP manages four different timers for each connection.

1. A retransmission timer is used when expecting an acknowledgment from the other end. 

2. A persist timer keeps window size information flowing even if the other end closes its receive window.

3. A keepalive timer detects when the other end on an otherwise idle connection crashes or reboots. 

4. A 2MSL timer measures the time a connection has been in the TIME_WAIT state. 


TCP calculates the round-trip time and then uses these measurements to keep track of a smoothed RTT estimator and a smoothed mean deviation estimator. These two estimators are then used to calculate the next retransmission timeout value. Many  implementations only measure a single RTT per window. Karn's algorithm removes the retransmission ambiguity problem by preventing us from measuring the RTT when a packet is lost.


Indeed, Berkeley-derived implementations count the number of duplicate ACKs received, and when the third one is received, assume that a segment has been lost and retransmit only one segment, starting with that sequence number. This is Jacobson's fast retransmit algorithm, which is followed by his fast recovery algorithm.


Slow start,  is the way to initiate data flow across a connection. But at some point we'll reach the limit of an intervening router, and packets can be dropped. Congestion avoidance is a way to deal with lost packets. 

The assumption of the algorithm is that packet loss caused by damage is very small (much less than 1%), therefore the loss of a packet signals congestion somewhere in the network between the source and destination. There are two indications of packet loss: a timeout occurring and the receipt of duplicate ACKs. ( If we are using a timeout as an indication of congestion, we can see the need for a good RTT algorithm(retransmission timeout (RTO)).)

Congestion avoidance and slow start are independent algorithms with different objectives. But when congestion occurs we want to slow down the transmission rate of packets into the network, and then invoke slow start to get things going again. In practice they are implemented together.

Congestion avoidance and slow start require that two variables be maintained for each connection: a congestion window, cwnd, and a slow start threshold size, ssthresh. The combined algorithm operates as follows:

1. Initialization for a given connection sets cwnd to one segment and ssthresh to 65535 bytes.

2. The TCP output routine never sends more than the minimum of cwnd and the receiver's advertised window.

Congestion avoidance is flow control imposed by the sender, while the advertised window is flow control imposed by the receiver. The former is based on the sender's assessment of perceived network congestion; the latter is related to the amount of available buffer space at the receiver for this connection.

3. When congestion occurs (indicated by a timeout or the reception of duplicate ACKs), one-half of the current window size (the minimum of cwnd and the receiver's advertised window, but at least two segments) is saved in ssthresh. Additionally, if the congestion is indicated by a timeout, cwnd is set to one segment (i.e., slow start).

4. When new data is acknowledged by the other end, we increase cwnd, but the way it increases depends on whether we're performing slow start or congestion avoidance. If cwnd is less than or equal to ssthresh, we're doing slow start; otherwise we're doing congestion avoidance. Slow start continues until we're halfway to where we were when congestion occurred (since we recorded half of the window size that got us into trouble in step 2), and then congestion avoidance takes over.

Slow start has cwnd start at one segment, and be incremented by one segment every time an ACK is received.This opens the window exponentially: send one segment, then two, then four, and so on.

Congestion avoidance dictates that cwnd be incremented by 1/ cwnd each time an ACK is received. This is an additive increase, compared to slow start's exponential increase. We want to increase cwnd by at most one segment each round-trip time (regardless how many ACKs are received in that RTT), whereas slow start will increment cwnd by the number of ACKs received in a round-trip time.

Before describing the change, realize that TCP is required to generate an immediate acknowledgment (a duplicate ACK) when an out-of-order segment is received. This duplicate ACK should not be delayed. The purpose of this duplicate ACK is to let the other end know that a segment was received out of order, and to tell it what sequence number is expected.


Since we don't know whether a duplicate ACK is caused by a lost segment or just a reordering of segments, we wait for a small number of duplicate ACKs to be received. It is assumed that if there is just a reordering of the segments, there will be only one or two duplicate ACKs before the reordered segment is processed, which will then generate a new ACK. If three or more duplicate ACKs are received in a row, it is a strong indication that a segment has been lost.  We then perform a retransmission of what appears to be the missing segment, without waiting for a retransmission timer to expire. This is the fast retransmit algorithm. Next, congestion avoidance, but not slow start is performed. This is the fast recovery algorithm.

1. When the third duplicate ACK is received, set ssthresh to one-half the current congestion window, cwnd. Retransmit the missing segment. Set cwnd to ssthresh plus 3 times the segment size.

2. Each time another duplicate ACK arrives, increment cwnd by the segment size and transmit a packet (if allowed by the new value of cwnd).

3. When the next ACK arrives that acknowledges new data, set cwnd to ssthresh (the value set in step 1). This should be the ACK of the retransmission from step 1, one round-trip time after the retransmission. Additionally, this ACK should acknowledge all the intermediate segments sent between the lost packet and the receipt of the first duplicate ACK. This step is congestion avoidance, since we're slowing down to onehalf the rate we were at when the packet was lost.


If an acknowledgment is lost, we could end up with both sides waiting for the other: the receiver waiting to receive data (since it provided the sender with a nonzero window) and the sender waiting to receive the window update allowing it to send. To prevent this form of deadlock from occurring the sender uses a persist timer that causes it to query the receiver periodically, to find out if the window has been increased. These segments from the sender are called window probes.

TCP's persist timer is set by one end of a connection when it has data to send, but has been stopped because the other end has advertised a zero-sized window. The sender keeps probing the closed window using a retransmission interval. This probing of the closed window continues indefinitely.


There are times, however, when a server wants to know if the client's host has either crashed and is down, or crashed and rebooted. The keepalive timer, a feature of many implementations, provides this capability.

If there is no activity on a given connection for 2 hours, the server sends a probe segmentto the client.

The client host must be in one of four states.

1. The client host is still up and running and reachable from the server. The client's TCP responds normally and the server knows that the other end is still up. The server's TCP will reset the keepalive timer for 2 hours in the future. If there is application traffic across the connection before the next 2-hour timer expires, the timer is reset for 2 hours in the future, following the exchange of data.

2. The client's host has crashed and is either down or in the process of rebooting. In either case, its TCP is not responding. The server will not receive a response to its probe and it times out after 75 seconds. The server sends a total of 10 of these probes, 75 seconds apart, and if it doesn't receive a response, the server considers the client's host as down and terminates the connection.

3. The client's host has crashed and rebooted. Here the server will receive a response to its keepalive probe, but the response will be a reset, causing the server to terminate the connection.

4. The client's host is up and running, but unreachable from the server. This is the same as scenario 2, because TCP can't distinguish between the two. All it can tell is that no replies are received to its probes.



(一)、ping 的实现

1. 相关ICMP协议概述   这里只讲解与ping有关的ICMP消息类型,主机发送回送消息(Type =8),被请求主机回送响应消息(Type= 0),基本格式如下:



其中•Code = 0, •CheckSum为校验和,重点注意从ICMP的头部(即Type开始),到data结束(即到整个数据包结束),具体计算见代码。 •Identifier为标识符,由主机设定,一般设置为进程号,回送响应消息与回送消息中identifier保持一致 •Sequence Number为序列号,由主机设定,一般设为由0递增的序列,回送响应消息与回送消息中Sequence Number保持一致 •data为数据,由主机设定,回送响应消息与回送消息中data保持一致

2. Ping流程 Ping实际上利用的就是ICMP ECHO和ICMP ECHOREPLY包来探测主机是否存在,所以Ping程序的流程十分简单:发送ICMP ECHO包---->接收ICMP ECHOREPLY包

发送ICMP ECHO包时填充Identifier为进程ID,Sequence Number为从0递增计数,data填充为发送时间 接收ICMP ECHOREPLY包时检查Identifier,Sequence Number是否正确,通过IP报头的源地址字段获得回送报文的主机地址是否正确

3. 模拟Ping实现pingy Ping的基本流程已讲解完

由于要自己构造ICMP包,因此创建需要创建原始套接字(即需要自己填充报头): sockfd= socket(AF_INET, SOCK_RAW, IPPROTO_ICMP) SOCK_RAW用于直接访问网络层,应用程序负责构造自己的协议首部;IPPROTO_ICMP表示ICMP报头由程序构造 构造ICMP报头,注意各个字段的填充,特别是校验和(可以参照icmp的结构定义) icmp->icmp_type icmp->icmp_code icmp->icmp_cksum icmp->icmp_id icmp->icmp_seq icmp->icmp_data;


程序实现主要参考 http://blog.csdn.net/qy532846454/article/details/5429700,修改了部分代码,现在可以ping 域名。



1. 相关ICMP协议概述 这里只讲解与tracert有关的ICMP消息类型,网关发送超时报文(type = 11),主机发送目标不可达报文(type = 3),基本格式如下: 超时报文

其中code = 0,表示由网关发送

这里提一下,当code = 1时表示是IP层组装分片超时。

The one we've been describing is generated when the TTL reaches 0, and is specified by a code of 0.

It's also possible for a host to send an ICMP "time exceeded during reassembly" when it timesout during the reassembly of a fragmented datagram. This error is specified by a code of 1.


其中code = 3,表示在目的主机,端口不可用

2. Traceroute流程 首先明确TTL是IP报头中的字段,TTL表示了数据包的time to live,即还能经由多少跳,所以TTL = 1表示数据包将在下个路由或主机被丢弃,并发送超时报文; 其次为了明确已到达主机,发送时目的端口设为非法端口(如58127),这样主机收到报文后会发送目标不可达报文。 下面是tracert的流程:      1> 构造UDP数据包,设置TTL = 1      2> 发送UDP数据包,记录发送时间t1      3> 接收ICMP差错包,如果是超时报文,则是经过的中间路由,记录路由信息,记录接收时间t2,计算时间(t2 - t1);如果是目标不可达报文,则抵达目的主机,记录接收时间t2,打印信息,退出      4> 构造UDP数据包,设置TTL += 1,返回第二步 其中,TTL的每个数值(如TTL = 1)发送3次UDP包,即重复2~3步3次;    接收超时,打印"*"表示报文丢失

程序实现主要参考http://blog.csdn.net/qy532846454/article/details/5443718, 但测试发现有点问题,故自己修改了部分代码,对返回的ICMP报文增加了一些类型判断,比如type为3的不可达报文实际上有16种情况,即使不是目的端口不可达code=3,其他不可达情况也得退出循环打印提示,否则就卡在那边了。如果不是ICMP类型报文则继续接收直到recvfrom超时,打印'*'。



《TCP/IP Illustrated Volume 1: The protocols》





0 条评论
登录 后参与评论


来自专栏Play & Scala 技术分享

PlayScala 2.5.x - Filter开发指南



继上两期的ActiveMQ的概念原理和经典示例之后,本期将隆重推出ActiveMQ是如何与SpringMVC进行集成使用的。 老规矩,在开始之前,先来点与之相关...





本文主要讲述如何解决HDFS运行Balancer提示“Another Balancer is running..”导致Balancer失败的问题。






当Linux服务器的TIME_WAIT过多时, 通常会想到去修改参数降低TIME_WAIT时长, 以减少TIME_WAIT数量,但Linux并没有提供这样...


WCF中的Binding模型之一: Binding模型简介

一、 信道层与服务模型层(Channel Layer and Service Mode Layer) 对于一个分布式应用的开发与设计来说,通信问题是不得不考虑,...



NSDI 2014年中,由斯坦福大学”SDN之父”,Nick Mckeown教授带领的实验室发表了题为《I Know What Yout Packet Did ...


Flash Actionscript 多线程Worker 压缩图片