diff options
| author | Eric Dumazet <[email protected]> | 2014-09-26 06:04:56 +0000 |
|---|---|---|
| committer | David S. Miller <[email protected]> | 2014-09-29 04:04:55 +0000 |
| commit | 3d9a0d2f8212879407e58d67f460d8920eb6543d (patch) | |
| tree | 2f21d1f2173c017fabddec9009de8aba855b7c22 /net/ipv4/tcp_output.c | |
| parent | net_sched: fix another regression in cls_tcindex (diff) | |
| download | kernel-3d9a0d2f8212879407e58d67f460d8920eb6543d.tar.gz kernel-3d9a0d2f8212879407e58d67f460d8920eb6543d.zip | |
dql: dql_queued() should write first to reduce bus transactions
While doing high throughput test on a BQL enabled NIC,
I found a very high cost in ndo_start_xmit() when accessing BQL data.
It turned out the problem was caused by compiler trying to be
smart, but involving a bad MESI transaction :
0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued
0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count
58.23 │ add %edx,%edi
0.58 │ cmp %edi,0xc4(%rax)
0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count
0.72 │ js bd8
I got an incredible 10 % gain [1] by making sure cpu do not attempt
to get the cache line in Shared mode, but directly requests for
ownership.
New code :
mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count
add %edx,0xc0(%rax) // RMW dql->num_queued += count
mov 0xc4(%rax),%ecx // LOAD dql->adj_limit
mov 0xc0(%rax),%edx // LOAD dql->num_queued
cmp %edx,%ecx
The TX completion was running from another cpu, with high interrupts
rate.
Note that I am using barrier() as a soft hint, as mb() here could be
too heavy cost.
[1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Diffstat (limited to 'net/ipv4/tcp_output.c')
0 files changed, 0 insertions, 0 deletions
