linux多线程下载工具 axel 下载中止的问题
linux多线程下载工具 axel 下载中止的问题
1 axel是什么
axel是一个linux下的多线程下载工具, 官网 http://axel.alioth.debian.org/
2 我遇到的问题
$> axel -a -n 10 -s 409600 "myurl"
下载一段时间再无进度. 而且这种现象很难重现.
3 axel 源码的逻辑
-> 新分配的线程都停在了 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | v -> 显然是死锁了. | 为什么死锁? v -> 查看源码, 并没有使用锁机制, 整个代码中调用的线程函数有: | pthread_create() | pthread_join() | pthread_cancel() | pthread_setcancelstate() | pthread_setcanceltype() | v -> 说明futex()不是代码显示调用的, 可能是某个函数内部调用了 futex() | 究竟是哪个函数调用了 futex() 呢? v -> 跟踪执行成功的axel的线程: | 10363 16:28:41 <... clone resumed> child_stack=0xb6b4c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6b4cbd8, {entry_number:6, base_addr:0xb6b4cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6b4cbd8) = 10370 | 10370 16:28:41 set_robust_list(0xb6b4cbe0, 0xc <unfinished ...> | 10370 16:28:41 <... set_robust_list resumed> ) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 0 | 10370 16:28:41 open("/etc/resolv.conf", O_RDONLY <unfinished ...> | 10370 16:28:41 <... open resumed> ) = 4 | 10370 16:28:41 fstat64(4, <unfinished ...> | 10370 16:28:41 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000 | 10370 16:28:41 read(4, "nameserver 219.141.136.10\nnamese"..., 4096) = 52 | 10370 16:28:41 read(4, "", 4096) = 0 | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 munmap(0xb7711000, 4096) = 0 | 10370 16:28:41 uname({sys="Linux", node="201221021JM93X", ...}) = 0 | 10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4 | 10370 16:28:41 fstat64(4, {st_mode=S_IFREG|0644, st_size=3382, ...}) = 0 | 10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000 | 10370 16:28:41 read(4, "127.0.0.1\tlocalhost\n10.2.30.159\t"..., 4096) = 3382 | 10370 16:28:41 read(4, "", 4096) = 0 | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 munmap(0xb7711000, 4096) = 0 | 10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4 | 10370 16:28:41 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, 16) = 0 | 10370 16:28:41 gettimeofday({1345624121, 428914}, NULL) = 0 | 10370 16:28:41 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) | 10370 16:28:41 send(4, "#\301\1\0\0\1\0\0\0\0\0\0\5cacti\6bokecc\3com\0\0\1"..., 34, MSG_NOSIGNAL) = 34 | 10370 16:28:41 poll([{fd=4, events=POLLIN}], 1, 5000 <unfinished ...> | 10370 16:28:41 <... poll resumed> ) = 1 ([{fd=4, revents=POLLIN}]) | 10370 16:28:41 ioctl(4, FIONREAD, [188]) = 0 | 10370 16:28:41 recvfrom(4, "#\301\201\200\0\1\0\1\0\2\0\6\5cacti\6bokecc\3com\0\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, [16]) = 188 | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 1 | 10370 16:28:41 socket(PF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...> | 10370 16:28:41 <... socket resumed> ) = 6 | 10370 16:28:41 connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("114.113.152.135")}, 16 <unfinished ...> | 10370 16:28:41 <... connect resumed> ) = 0 | 10370 16:28:41 gettimeofday( <unfinished ...> | 10370 16:28:41 <... gettimeofday resumed> {1345624121, 435907}, NULL) = 0 | 10370 16:28:41 write(6, "GET /test/test.flv HTTP/1.0\r\nHos"..., 116 <unfinished ...> v -> 发现在进入线程执行的第二个系统调用就是 futex() | 10370 16:28:41 set_robust_list(0xb6b4cbe0, 0xc <unfinished ...> | 10370 16:28:41 <... set_robust_list resumed> ) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 0 | 10370 16:28:41 open("/etc/resolv.conf", O_RDONLY <unfinished ...> | 10370 16:28:41 <... open resumed> ) = 4 | 10370 16:28:41 fstat64(4, <unfinished ...> | 10370 16:28:41 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000 | ... | 10370 16:28:41 close(4) = 0 | ... | 10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4 | ... | 10370 16:28:41 close(4) = 0 | ... | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 1 | 10370 16:28:41 socket(PF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...> | 10370 16:28:41 <... socket resumed> ) = 6 | 10370 16:28:41 connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("114.113.152.135")}, 16 <unfinished ...> v -> 很明显是 gethostbyname() 函数内部调用了 futex() | v -> 我记得老师讲课时说过 gethostbyname() 不是线程安全的. 会不会跟这有关系,试用 gethostbyname_r() 或 getaddrinfo() 替换再次测试 | 32559 16:39:39 clone(child_stack=0xb7357494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7357bd8, {entry_number:6, base_addr:0xb7357b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7357bd8) = 32619 | 32619 16:39:39 set_robust_list(0xb7357be0, 0xc <unfinished ...> | 32619 16:39:39 <... set_robust_list resumed> ) = 0 | 32619 16:39:39 open("/etc/resolv.conf", O_RDONLY <unfinished ...> | 32619 16:39:39 <... open resumed> ) = 4 | 32619 16:39:39 fstat64(4, <unfinished ...> | 32619 16:39:39 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 32619 16:39:39 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> | 32619 16:39:39 <... mmap2 resumed> ) = 0xb6355000 | 32619 16:39:39 read(4, <unfinished ...> | 32619 16:39:39 <... read resumed> "nameserver 219.141.136.10\nnamese"..., 4096) = 52 | 32619 16:39:39 read(4, <unfinished ...> | 32619 16:39:39 <... read resumed> "", 4096) = 0 | 32619 16:39:39 close(4 <unfinished ...> | 32619 16:39:39 <... close resumed> ) = 0 | 32619 16:39:39 munmap(0xb6355000, 4096 <unfinished ...> | 32619 16:39:39 <... munmap resumed> ) = 0 | 32619 16:39:39 uname( <unfinished ...> | 32619 16:39:39 <... uname resumed> {sys="Linux", node="201221021JM93X", ...}) = 0 | 32619 16:39:39 futex(0x98e1e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | 32619 16:39:39 <... futex resumed> ) = 0 | 32619 16:39:39 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 6 | 32619 16:39:39 fstat64(6, {st_mode=S_IFREG|0644, st_size=3382, ...}) = 0 | 32619 16:39:39 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7713000 | 32619 16:39:39 read(6, "127.0.0.1\tlocalhost\n10.2.30.159\t"..., 4096) = 3382 | 32619 16:39:39 read(6, "", 4096) = 0 | 32619 16:39:39 close(6) = 0 | 32619 16:39:39 munmap(0xb7713000, 4096) = 0 | 32619 16:39:39 futex(0x98e1e8, FUTEX_WAKE_PRIVATE, 1) = 1 v -> 发现gethostbyname_r也会调用futex() | | 可能不是 gethostbyname() 或 gethostbyname_r() 的问题, 验证方法 <= 把 gethostbyname_r() 单独写一个程序然后strace跟踪不会产生futex()调用 | 之所以会有futex() 调用是因为 编译时连接了 pthread 库,但是死锁肯定和这里有关系, 也就是 线程执行到 gethostbyname() 内, 刚刚执行完 | futex(0x98e1e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> (加锁) | 还没来得及解锁就被 cancel 掉了, 其他线程执行到gethostbyname() 同样要加锁,由于没人解锁,所以一个个线程就阻塞到这儿直到20s超时,被主线程cancel掉, | 主线程会再次创建新的线程, 线程还是死锁在那儿, 再次被cancel,这样周而复始就产生了我们开头描述的axel下载一段时间再无进度的情况. v -> 由此可以推断导致开头描述现象的原因是: | 线程被cancel的不是时候,也就是在线程加锁后和解锁前不能被cancel, 显然这是由于线程函数中将cancel模式 | "pthread_setcanceltype( PTHREAD_CANCEL_DEFERRED, &oldstate );" | 设置为了 异步取消模式 导致的, 在<<UNIX环境高级编程>>第2版 12.7 取消选项(P333)一节中写到: | "异步取消和延迟取消不同, 试用异步取消时, 线程可以在任意时候被取消,而不是非得遇到取消点才能被取消" v -> 作者为什么这么做呢? | 我想是为了控制 线程处理中的 gethostbyname() 和 connect() 超时时间. 我们直到 connect()可以设置超时,但是 gethostbyname() 是没办法设置超时的. v -> 应该做怎样的修改呢 | 1 将 gethostbyname() 改为线程安全的函数 gethostbyname_r() | 2 将 线程处理函数中设置cancel模式 由 异步取消模式 改为 延迟取消模式 | 3 使用 非阻塞IO 和 select 实现 connect() 超时 10s 的设置. v -> gethostbyname_r() 函数的超时不用设置吗? | dns协议有自己的默认超时时间, 可通过 man 5 resolv.conf, 查看: | "the default is RES_TIMEOUT (currently 5, see <resolv.h>)" | 所以这里可以不考虑 gethostbyname_r() 的超时. v 结束
------
GS