You can subscribe to this list here.
1999 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(8) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2000 |
Jan
(19) |
Feb
(11) |
Mar
(56) |
Apr
(31) |
May
(37) |
Jun
(21) |
Jul
(30) |
Aug
(31) |
Sep
(25) |
Oct
(60) |
Nov
(28) |
Dec
(57) |
2001 |
Jan
(47) |
Feb
(119) |
Mar
(279) |
Apr
(198) |
May
(336) |
Jun
(201) |
Jul
(136) |
Aug
(123) |
Sep
(123) |
Oct
(185) |
Nov
(66) |
Dec
(97) |
2002 |
Jan
(318) |
Feb
(101) |
Mar
(167) |
Apr
(233) |
May
(249) |
Jun
(134) |
Jul
(195) |
Aug
(99) |
Sep
(278) |
Oct
(435) |
Nov
(326) |
Dec
(325) |
2003 |
Jan
(214) |
Feb
(309) |
Mar
(142) |
Apr
(141) |
May
(210) |
Jun
(86) |
Jul
(133) |
Aug
(218) |
Sep
(315) |
Oct
(152) |
Nov
(162) |
Dec
(288) |
2004 |
Jan
(277) |
Feb
(267) |
Mar
(182) |
Apr
(168) |
May
(254) |
Jun
(131) |
Jul
(168) |
Aug
(177) |
Sep
(262) |
Oct
(309) |
Nov
(262) |
Dec
(255) |
2005 |
Jan
(258) |
Feb
(169) |
Mar
(282) |
Apr
(208) |
May
(262) |
Jun
(187) |
Jul
(207) |
Aug
(171) |
Sep
(283) |
Oct
(216) |
Nov
(307) |
Dec
(107) |
2006 |
Jan
(207) |
Feb
(82) |
Mar
(192) |
Apr
(165) |
May
(121) |
Jun
(108) |
Jul
(120) |
Aug
(126) |
Sep
(101) |
Oct
(216) |
Nov
(95) |
Dec
(125) |
2007 |
Jan
(176) |
Feb
(117) |
Mar
(240) |
Apr
(120) |
May
(81) |
Jun
(82) |
Jul
(62) |
Aug
(120) |
Sep
(103) |
Oct
(109) |
Nov
(181) |
Dec
(87) |
2008 |
Jan
(145) |
Feb
(69) |
Mar
(31) |
Apr
(98) |
May
(91) |
Jun
(43) |
Jul
(68) |
Aug
(135) |
Sep
(48) |
Oct
(18) |
Nov
(29) |
Dec
(16) |
2009 |
Jan
(26) |
Feb
(15) |
Mar
(83) |
Apr
(39) |
May
(23) |
Jun
(35) |
Jul
(11) |
Aug
(3) |
Sep
(11) |
Oct
(2) |
Nov
(28) |
Dec
(8) |
2010 |
Jan
(4) |
Feb
(40) |
Mar
(4) |
Apr
(46) |
May
(35) |
Jun
(46) |
Jul
(10) |
Aug
(4) |
Sep
(50) |
Oct
(70) |
Nov
(31) |
Dec
(24) |
2011 |
Jan
(17) |
Feb
(8) |
Mar
(35) |
Apr
(50) |
May
(75) |
Jun
(55) |
Jul
(72) |
Aug
(272) |
Sep
(10) |
Oct
(9) |
Nov
(11) |
Dec
(15) |
2012 |
Jan
(36) |
Feb
(49) |
Mar
(54) |
Apr
(47) |
May
(8) |
Jun
(82) |
Jul
(20) |
Aug
(50) |
Sep
(51) |
Oct
(20) |
Nov
(10) |
Dec
(25) |
2013 |
Jan
(34) |
Feb
(4) |
Mar
(24) |
Apr
(40) |
May
(101) |
Jun
(30) |
Jul
(55) |
Aug
(84) |
Sep
(53) |
Oct
(49) |
Nov
(61) |
Dec
(36) |
2014 |
Jan
(26) |
Feb
(22) |
Mar
(30) |
Apr
(4) |
May
(43) |
Jun
(33) |
Jul
(44) |
Aug
(61) |
Sep
(46) |
Oct
(154) |
Nov
(16) |
Dec
(12) |
2015 |
Jan
(18) |
Feb
(2) |
Mar
(122) |
Apr
(23) |
May
(56) |
Jun
(29) |
Jul
(35) |
Aug
(15) |
Sep
|
Oct
(45) |
Nov
(94) |
Dec
(38) |
2016 |
Jan
(50) |
Feb
(39) |
Mar
(39) |
Apr
(1) |
May
(14) |
Jun
(12) |
Jul
(19) |
Aug
(12) |
Sep
(9) |
Oct
(1) |
Nov
(13) |
Dec
(7) |
2017 |
Jan
(6) |
Feb
(1) |
Mar
(16) |
Apr
(5) |
May
(61) |
Jun
(18) |
Jul
(43) |
Aug
(1) |
Sep
(8) |
Oct
(25) |
Nov
(30) |
Dec
(6) |
2018 |
Jan
(5) |
Feb
(2) |
Mar
(25) |
Apr
(15) |
May
(2) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2019 |
Jan
|
Feb
(2) |
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Geert U. <ge...@li...> - 2017-10-09 20:35:00
|
On Mon, Oct 9, 2017 at 9:19 PM, Geert Uytterhoeven <ge...@li...> wrote: > JFYI, when comparing v4.14-rc4[1] to v4.14-rc3[3], the summaries are: > - build errors: +1/-2 + error: "devm_ioremap_resource" [drivers/auxdisplay/img-ascii-lcd.ko] undefined!: => N/A um-x86_64/um-all{yes,mod}config (patch available) > - build warnings: +817/-384 Thanks to the addition of um-x86_64 build coverage, a lot of warnings appeared about missing CRCs for memcpy in modules, e.g. + warning: "memcpy" [crypto/af_alg.ko] has no CRC!: => N/A > [1] http://kisskb.ellerman.id.au/kisskb/head/8a5776a5f49812d29fe4b2d0a2d71675c3facf3f/ (all 269 configs) > [3] http://kisskb.ellerman.id.au/kisskb/head/9e66317d3c92ddaab330c125dfe9d06eee268aff/ (267 out of 269 configs) Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@li... In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds |
From: Richard W. <ri...@no...> - 2017-10-08 10:43:50
|
Am Sonntag, 8. Oktober 2017, 12:31:58 CEST schrieb Thomas Meyer: > UMLs current_thread_info() unconditionally assumes that the top of the stack > contains the thread_info structure. But on UML the __sanitizer_cov_trace_pc > function is called for *all* functions! This results in an early crash: > > Prevent kcov from using invalid curent_thread_info() data by checking > the system_state. > > Signed-off-by: Thomas Meyer <th...@m3...> > --- > kernel/kcov.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/kcov.c b/kernel/kcov.c > index 3f693a0f6f3e..d601c0e956f6 100644 > --- a/kernel/kcov.c > +++ b/kernel/kcov.c > @@ -56,6 +56,12 @@ void notrace __sanitizer_cov_trace_pc(void) > struct task_struct *t; > enum kcov_mode mode; > > +#ifdef CONFIG_UML > + if(!(system_state == SYSTEM_SCHEDULING || > + system_state == SYSTEM_RUNNING)) > + return; > +#endif Hmm, and why does it work on all other archs then? Thanks, //richard |
From: <ant...@ko...> - 2017-10-06 21:34:02
|
From: Anton Ivanov <ant...@ca...> 1. TSO/GSO support where applicable or available RX - raw and tapraw TX - tap only (raw appears to be hitting a bug in the af_packet family in the kernel resulting in it being stuck in a -ENOBUFS loop. This results in TX/RX TCP performance ~ 2-3 times higher than qemu on same hardware (measured with iperf). 2. Cleanup and unification of the RX/TX code to use the same skb and msg prep routines. Adds two new transport arguments applicable to all transports gro - enable/disable GRO in driver vec - enable/disable multi-message vector IO 3. Adds change/set device features support. Gro,gso,gso,sg,etc can now be adjusted via ethtool. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/vector_kern.c | 167 ++++++++++++++++++++++++++---------- arch/um/drivers/vector_kern.h | 1 + arch/um/drivers/vector_transports.c | 15 ++-- arch/um/drivers/vector_user.c | 5 +- 4 files changed, 135 insertions(+), 53 deletions(-) diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c index f0ea7f98b86c..268862d8f915 100644 --- a/arch/um/drivers/vector_kern.c +++ b/arch/um/drivers/vector_kern.c @@ -75,7 +75,7 @@ static void vector_eth_configure(int n, struct arglist *def); #define SAFETY_MARGIN 32 #define DEFAULT_VECTOR_SIZE 64 #define TX_SMALL_PACKET 128 -#define MAX_IOV_SIZE 8 +#define MAX_IOV_SIZE (MAX_SKB_FRAGS + 1) static const struct { const char string[ETH_GSTRING_LEN]; @@ -162,15 +162,45 @@ static int get_headroom(struct arglist *def) return DEFAULT_HEADROOM; } +static int get_req_size(struct arglist *def) +{ + char *gro = uml_vector_fetch_arg(def, "gro"); + long result; + + if (gro != NULL) { + if (kstrtoul(gro, 10, &result) == 0) { + if (result > 0) + return 65536; + } + } + return get_mtu(def) + ETH_HEADER_OTHER + get_headroom(def) + SAFETY_MARGIN; +} + + static int get_transport_options(struct arglist *def) { char *transport = uml_vector_fetch_arg(def, "transport"); + char *vector = uml_vector_fetch_arg(def, "vec"); + + int vec_rx = VECTOR_RX; + int vec_tx = VECTOR_TX; + long parsed; + + if (vector != NULL) { + if (kstrtoul(vector, 10, &parsed) == 0) { + if (parsed == 0) { + vec_rx = 0; + vec_tx = 0; + } + } + } + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) - return (VECTOR_RX | VECTOR_BPF); + return (vec_rx | VECTOR_BPF); if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) - return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); - return (VECTOR_TX | VECTOR_RX); + return (vec_rx | vec_tx | VECTOR_BPF); + return (vec_rx | vec_tx); } @@ -547,13 +577,59 @@ static struct vector_queue *create_queue( * just read into a prepared queue filled with skbuffs. */ +static struct sk_buff *prep_skb(struct vector_private *vp, struct user_msghdr *msg) +{ + int linear = vp->max_packet + vp->headroom + SAFETY_MARGIN; + struct sk_buff *result; + int iov_index = 0, len; + struct iovec *iov = msg->msg_iov; + int err, nr_frags, frag; + skb_frag_t *skb_frag; + + if (vp->req_size <= linear) + len = linear; + else + len = vp->req_size; + result = alloc_skb_with_frags(linear, len - vp->max_packet, 3, &err, GFP_ATOMIC); + if (vp->header_size > 0) + iov_index++; + if (result == NULL) { + iov[iov_index].iov_base = NULL; + iov[iov_index].iov_len = 0; + goto done; + } + skb_reserve(result, vp->headroom); + result->dev = vp->dev; + skb_put(result, vp->max_packet); + result->data_len = len - vp->max_packet; + result->len += len - vp->max_packet; + skb_reset_mac_header(result); + result->ip_summed = CHECKSUM_NONE; + iov[iov_index].iov_base = result->data; + iov[iov_index].iov_len = vp->max_packet; + iov_index++; + + nr_frags = skb_shinfo(result)->nr_frags; + for (frag = 0; frag < nr_frags; frag++) { + skb_frag = &skb_shinfo(result)->frags[frag]; + iov[iov_index].iov_base = skb_frag_address_safe(skb_frag); + if (iov[iov_index].iov_base != NULL) + iov[iov_index].iov_len = skb_frag_size(skb_frag); + else + iov[iov_index].iov_len = 0; + iov_index++; + } +done: + msg->msg_iovlen = iov_index; + return result; +} + + /* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ static void prep_queue_for_rx(struct vector_queue *qi) { struct vector_private *vp = netdev_priv(qi->dev); - struct sk_buff *skb; - struct iovec *iov; struct mmsghdr *mmsg_vector = qi->mmsg_vector; void **skbuff_vector = qi->skbuff_vector; int i; @@ -566,26 +642,7 @@ static void prep_queue_for_rx(struct vector_queue *qi) * This allows us stop faffing around with a "drop buffer" */ - skb = netdev_alloc_skb( - vp->dev, - vp->max_packet + vp->headroom + SAFETY_MARGIN); - iov = mmsg_vector->msg_hdr.msg_iov; - mmsg_vector->msg_len = 0; - if (vp->header_size > 0) - iov++; - if (skb != NULL) { - skb_reserve(skb, vp->headroom); - skb->dev = qi->dev; - skb_put(skb, vp->max_packet); - skb_reset_mac_header(skb); - skb->ip_summed = CHECKSUM_NONE; - iov->iov_base = skb->data; - iov->iov_len = vp->max_packet; - } else { - iov->iov_base = NULL; - iov->iov_len = 0; - } - *skbuff_vector = skb; + *skbuff_vector = prep_skb(vp, &mmsg_vector->msg_hdr); skbuff_vector++; mmsg_vector++; } @@ -738,7 +795,7 @@ static int vector_legacy_rx(struct vector_private *vp) { int pkt_len; struct user_msghdr hdr; - struct iovec iov[2]; /* header + data use case only */ + struct iovec iov[2 + MAX_IOV_SIZE]; /* header + data use case only */ int iovpos = 0; struct sk_buff *skb; int header_check; @@ -746,34 +803,25 @@ static int vector_legacy_rx(struct vector_private *vp) hdr.msg_name = NULL; hdr.msg_namelen = 0; hdr.msg_iov = (struct iovec *) &iov; - hdr.msg_iovlen = 1; hdr.msg_control = NULL; hdr.msg_controllen = 0; hdr.msg_flags = 0; if (vp->header_size > 0) { - iov[iovpos].iov_base = vp->header_rxbuffer; - iov[iovpos].iov_len = vp->rx_header_size; - hdr.msg_iovlen++; - iovpos++; + iov[0].iov_base = vp->header_rxbuffer; + iov[0].iov_len = vp->header_size; } - skb = netdev_alloc_skb(vp->dev, - vp->max_packet + vp->headroom + SAFETY_MARGIN); + skb = prep_skb(vp, &hdr); + if (skb == NULL) { /* Read a packet into drop_buffer and don't do * anything with it. */ iov[iovpos].iov_base = drop_buffer; iov[iovpos].iov_len = DROP_BUFFER_SIZE; + hdr.msg_iovlen = 1; vp->dev->stats.rx_dropped++; - } else { - skb_reserve(skb, vp->headroom); - skb->dev = vp->dev; - skb_put(skb, vp->max_packet); - skb_reset_mac_header(skb); - iov[iovpos].iov_base = skb->data; - iov[iovpos].iov_len = vp->max_packet; } pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); @@ -794,7 +842,7 @@ static int vector_legacy_rx(struct vector_private *vp) skb->ip_summed = CHECKSUM_UNNECESSARY; } } - skb_trim(skb, pkt_len - vp->rx_header_size); + pskb_trim(skb, pkt_len - vp->rx_header_size); skb->protocol = eth_type_trans(skb, skb->dev); vp->dev->stats.rx_bytes += skb->len; vp->dev->stats.rx_packets++; @@ -898,7 +946,7 @@ static int vector_mmsg_rx(struct vector_private *vp) skb->ip_summed = CHECKSUM_UNNECESSARY; } } - skb_trim(skb, + pskb_trim(skb, mmsg_vector->msg_len - vp->rx_header_size); skb->protocol = eth_type_trans(skb, skb->dev); /* @@ -1109,7 +1157,7 @@ static int vector_net_open(struct net_device *dev) if ((vp->options & VECTOR_RX) > 0) { vp->rx_queue = create_queue( - vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp, get_depth(vp->parsed), vp->rx_header_size, MAX_IOV_SIZE); vp->rx_queue->queue_depth = get_depth(vp->parsed); } else { vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); @@ -1200,6 +1248,30 @@ static void vector_net_tx_timeout(struct net_device *dev) schedule_work(&vp->reset_tx); } +static netdev_features_t vector_fix_features(struct net_device *dev, + netdev_features_t features) +{ + features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM); + return features; +} + +static int vector_set_features(struct net_device *dev, + netdev_features_t features) +{ + struct vector_private *vp = netdev_priv(dev); + /* Adjust buffer sizes for GSO/GRO. Unfortunately, there is + * no way to negotiate it on raw sockets, so we can change + * only our side. + */ + if (features & NETIF_F_GRO) + /* All new frame buffers will be GRO-sized */ + vp->req_size = 65536; + else + /* All new frame buffers will be normal sized */ + vp->req_size = vp->max_packet + vp->headroom + SAFETY_MARGIN; + return 0; +} + #ifdef CONFIG_NET_POLL_CONTROLLER static void vector_net_poll_controller(struct net_device *dev) { @@ -1303,6 +1375,8 @@ static const struct net_device_ops vector_netdev_ops = { .ndo_tx_timeout = vector_net_tx_timeout, .ndo_set_mac_address = eth_mac_addr, .ndo_validate_addr = eth_validate_addr, + .ndo_fix_features = vector_fix_features, + .ndo_set_features = vector_set_features, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = vector_net_poll_controller, #endif @@ -1394,10 +1468,11 @@ static void vector_eth_configure( .opened = false, .transport_data = NULL, .in_write_poll = false, - .coalesce = 2 + .coalesce = 2, + .req_size = get_req_size(def) }); - dev->features = NETIF_F_SG; + dev->features = dev->hw_features = (NETIF_F_SG | NETIF_F_FRAGLIST); tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); INIT_WORK(&vp->reset_tx, vector_reset_tx); diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h index a9ade0851fda..699696deb396 100644 --- a/arch/um/drivers/vector_kern.h +++ b/arch/um/drivers/vector_kern.h @@ -90,6 +90,7 @@ struct vector_private { void *transport_data; /* transport specific params if needed */ int max_packet; + int req_size; /* different from max packet - used for TSO */ int headroom; int options; diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c index 9f07d585f71b..57aa9cb5434c 100644 --- a/arch/um/drivers/vector_transports.c +++ b/arch/um/drivers/vector_transports.c @@ -187,9 +187,8 @@ static int raw_verify_header ( { struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; - if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { - printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); - return -1; /* GSO, we cannot process this */ + if ((vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) && (vp->req_size != 65536)) { + printk(KERN_INFO "Incoming GSO frames and GRO disabled on the interface"); } if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) return 1; @@ -389,8 +388,9 @@ static int build_raw_transport_data(struct vector_private *vp) vp->verify_header = &raw_verify_header; vp->header_size = sizeof(struct virtio_net_hdr); vp->rx_header_size = sizeof(struct virtio_net_hdr); - vp->dev->features |= NETIF_F_HW_CSUM; /* TSO does not work on RAW */ - printk(KERN_INFO "raw: using vnet headers to offload checksum"); + vp->dev->hw_features |= (NETIF_F_GRO); /* TSO does not work on RAW */ + vp->dev->features |= (NETIF_F_RXCSUM | NETIF_F_HW_CSUM | NETIF_F_GRO); + printk(KERN_INFO "raw: using vnet headers for tso and tx/rx checksum"); } return 0; } @@ -402,7 +402,10 @@ static int build_tap_transport_data(struct vector_private *vp) vp->verify_header = &raw_verify_header; vp->header_size = sizeof(struct virtio_net_hdr); vp->rx_header_size = sizeof(struct virtio_net_hdr); - vp->dev->features |= (NETIF_F_HW_CSUM | NETIF_F_TSO); + vp->dev->hw_features |= (NETIF_F_TSO | NETIF_F_GSO | NETIF_F_GRO); + vp->dev->features |= + (NETIF_F_RXCSUM | NETIF_F_HW_CSUM | + NETIF_F_TSO | NETIF_F_GSO | NETIF_F_GRO_BIT); printk(KERN_INFO "tap/raw: using vnet headers for tso and tx/rx checksum"); } else { return 0; /* do not try to enable tap too if raw failed */ diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c index 9210bf2db569..259c3c639eab 100644 --- a/arch/um/drivers/vector_user.c +++ b/arch/um/drivers/vector_user.c @@ -115,7 +115,7 @@ static struct vector_fds *user_init_tap_fds(struct arglist *ifspec) struct ifreq ifr; int fd = -1; struct sockaddr_ll sock; - int err = -ENOMEM; + int err = -ENOMEM, offload; char *iface; struct vector_fds *result = NULL; @@ -153,6 +153,9 @@ static struct vector_fds *user_init_tap_fds(struct arglist *ifspec) goto tap_cleanup; } + offload = TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6; + ioctl(fd, TUNSETOFFLOAD, offload); + /* RAW */ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); -- 2.11.0 |
From: <ant...@ko...> - 2017-10-06 21:34:02
|
From: Anton Ivanov <ant...@ca...> 1. Provides infrastructure for vector IO using recvmmsg/sendmmsg. 1.1. Multi-message read. 1.2. Multi-message write. 1.3. Optimized queue support for multi-packet enqueue/dequeue. 1.4. BQL/DQL support. 2. Implements transports for several transports as well support for direct wiring of PWEs to NIC. Allows direct connection of VMs to host, other VMs and network devices with no switch in use. 2.1. Raw socket >4 times faster than existing pcap based transport 2.2. Tap transport using socket RX and tap xmit. >3 times faster RX than existing tap, 10+ times faster on TX. 2.3. GRE transport - direct wiring to GRE PWE, >3 times faster RX than any existing transport. 2.4. L2TPv3 transport - direct wiring to L2TPv3 PWE, > 3 times faster RX than any existing transport. 2.5. TX in all cases shows only minor improvement, but consumes significantly less CPU under load. 3. Tuning and performance related information via ethtool. 4. Rudimentary BPF support - used in tap only to avoid software looping 5. Scatter Gather support. 6. VNET and checksum offload support for raw socket transport. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/Kconfig.net | 11 + arch/um/drivers/Makefile | 4 +- arch/um/drivers/net_kern.c | 4 +- arch/um/drivers/vector_kern.c | 1531 +++++++++++++++++++++++++++++++++++ arch/um/drivers/vector_kern.h | 128 +++ arch/um/drivers/vector_transports.c | 430 ++++++++++ arch/um/drivers/vector_user.c | 528 ++++++++++++ arch/um/drivers/vector_user.h | 90 ++ arch/um/include/asm/irq.h | 12 + arch/um/include/shared/net_kern.h | 2 + 10 files changed, 2737 insertions(+), 3 deletions(-) create mode 100644 arch/um/drivers/vector_kern.c create mode 100644 arch/um/drivers/vector_kern.h create mode 100644 arch/um/drivers/vector_transports.c create mode 100644 arch/um/drivers/vector_user.c create mode 100644 arch/um/drivers/vector_user.h diff --git a/arch/um/Kconfig.net b/arch/um/Kconfig.net index 820a56f00332..0516dc76e6aa 100644 --- a/arch/um/Kconfig.net +++ b/arch/um/Kconfig.net @@ -108,6 +108,17 @@ config UML_NET_DAEMON more than one without conflict. If you don't need UML networking, say N. +config UML_NET_VECTOR + bool "Vector I/O high performance network devices" + depends on UML_NET + help + This User-Mode Linux network driver uses multi-message send + and receive functions. The host running the UML guest must have + a linux kernel version above 3.0 and a libc version > 2.13. + This driver provides tap, raw, gre and l2tpv3 network transports + with up to 4 times higher network throughput than the UML network + drivers. + config UML_NET_VDE bool "VDE transport" depends on UML_NET diff --git a/arch/um/drivers/Makefile b/arch/um/drivers/Makefile index e7582e1d248c..16b3cebddafb 100644 --- a/arch/um/drivers/Makefile +++ b/arch/um/drivers/Makefile @@ -9,6 +9,7 @@ slip-objs := slip_kern.o slip_user.o slirp-objs := slirp_kern.o slirp_user.o daemon-objs := daemon_kern.o daemon_user.o +vector-objs := vector_kern.o vector_user.o vector_transports.o umcast-objs := umcast_kern.o umcast_user.o net-objs := net_kern.o net_user.o mconsole-objs := mconsole_kern.o mconsole_user.o @@ -43,6 +44,7 @@ obj-$(CONFIG_STDERR_CONSOLE) += stderr_console.o obj-$(CONFIG_UML_NET_SLIP) += slip.o slip_common.o obj-$(CONFIG_UML_NET_SLIRP) += slirp.o slip_common.o obj-$(CONFIG_UML_NET_DAEMON) += daemon.o +obj-$(CONFIG_UML_NET_VECTOR) += vector.o obj-$(CONFIG_UML_NET_VDE) += vde.o obj-$(CONFIG_UML_NET_MCAST) += umcast.o obj-$(CONFIG_UML_NET_PCAP) += pcap.o @@ -61,7 +63,7 @@ obj-$(CONFIG_BLK_DEV_COW_COMMON) += cow_user.o obj-$(CONFIG_UML_RANDOM) += random.o # pcap_user.o must be added explicitly. -USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o +USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o vector_user.o CFLAGS_null.o = -DDEV_NULL=$(DEV_NULL_PATH) include arch/um/scripts/Makefile.rules diff --git a/arch/um/drivers/net_kern.c b/arch/um/drivers/net_kern.c index 1669240c7a25..03cc5857a3b6 100644 --- a/arch/um/drivers/net_kern.c +++ b/arch/um/drivers/net_kern.c @@ -288,7 +288,7 @@ static void uml_net_user_timer_expire(unsigned long _conn) #endif } -static void setup_etheraddr(struct net_device *dev, char *str) +void uml_net_setup_etheraddr(struct net_device *dev, char *str) { unsigned char *addr = dev->dev_addr; char *end; @@ -412,7 +412,7 @@ static void eth_configure(int n, void *init, char *mac, */ snprintf(dev->name, sizeof(dev->name), "eth%d", n); - setup_etheraddr(dev, mac); + uml_net_setup_etheraddr(dev, mac); printk(KERN_INFO "Netdevice %d (%pM) : ", n, dev->dev_addr); diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c new file mode 100644 index 000000000000..f0ea7f98b86c --- /dev/null +++ b/arch/um/drivers/vector_kern.c @@ -0,0 +1,1531 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright (C) 2001 Lennert Buytenhek (bu...@gn...) and + * James Leu (jl...@mi...). + * Copyright (C) 2001 by various other people who didn't put their name here. + * Licensed under the GPL. + */ + +#include <linux/version.h> +#include <linux/bootmem.h> +#include <linux/etherdevice.h> +#include <linux/ethtool.h> +#include <linux/inetdevice.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/rtnetlink.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <linux/interrupt.h> +#include <init.h> +#include <irq_kern.h> +#include <irq_user.h> +#include <net_kern.h> +#include <os.h> +#include "mconsole_kern.h" +#include "vector_user.h" +#include "vector_kern.h" + +/* + * Adapted from network devices with the following major changes: + * All transports are static - simplifies the code significantly + * Multiple FDs/IRQs per device + * Vector IO optionally used for read/write, falling back to legacy + * based on configuration and/or availability + * Configuration is no longer positional - L2TPv3 and GRE require up to + * 10 parameters, passing this as positional is not fit for purpose. + * Only socket transports are supported + */ + + +#define DRIVER_NAME "uml-vector" +#define DRIVER_VERSION "01" +struct vector_cmd_line_arg { + struct list_head list; + int unit; + char *arguments; +}; + +struct vector_device { + struct list_head list; + struct net_device *dev; + struct platform_device pdev; + int unit; + int opened; +}; + +static LIST_HEAD(vec_cmd_line); + +static DEFINE_SPINLOCK(vector_devices_lock); +static LIST_HEAD(vector_devices); + +static int driver_registered; + +static void vector_eth_configure(int n, struct arglist *def); + +/* Argument accessors to set variables (and/or set default values) + * mtu, buffer sizing, default headroom, etc + */ + +#define DEFAULT_HEADROOM 2 +#define SAFETY_MARGIN 32 +#define DEFAULT_VECTOR_SIZE 64 +#define TX_SMALL_PACKET 128 +#define MAX_IOV_SIZE 8 + +static const struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { "rx_queue_max" }, + { "rx_queue_running_average" }, + { "tx_queue_max" }, + { "tx_queue_running_average" }, + { "rx_encaps_errors" }, + { "tx_timeout_count" }, + { "tx_restart_queue" }, + { "tx_kicks" }, + { "tx_flow_control_xon" }, + { "tx_flow_control_xoff" }, + { "rx_csum_offload_good" }, + { "rx_csum_offload_errors"}, + { "sg_ok"}, + { "sg_linearized"}, +}; + +#define VECTOR_NUM_STATS ARRAY_SIZE(ethtool_stats_keys) + +/* Used in the 4.4 OpenWRT backport */ + +#if LINUX_VERSION_CODE <= KERNEL_VERSION(4, 7, 0) +static inline void netif_trans_update(struct net_device *dev) +{ + struct netdev_queue *txq = netdev_get_tx_queue(dev, 0); + + if (txq->trans_start != jiffies) + txq->trans_start = jiffies; +} +#endif + +static void vector_reset_stats(struct vector_private *vp) +{ + vp->estats.rx_queue_max = 0; + vp->estats.rx_queue_running_average = 0; + vp->estats.tx_queue_max = 0; + vp->estats.tx_queue_running_average = 0; + vp->estats.rx_encaps_errors = 0; + vp->estats.tx_timeout_count = 0; + vp->estats.tx_restart_queue = 0; + vp->estats.tx_kicks = 0; + vp->estats.tx_flow_control_xon = 0; + vp->estats.tx_flow_control_xoff = 0; + vp->estats.sg_ok = 0; + vp->estats.sg_linearized = 0; +} + +static int get_mtu(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "mtu"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return ETH_MAX_PACKET; +} + +static int get_depth(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "depth"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_VECTOR_SIZE; +} + +static int get_headroom(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "headroom"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_HEADROOM; +} + +static int get_transport_options(struct arglist *def) +{ + char *transport = uml_vector_fetch_arg(def, "transport"); + + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return (VECTOR_RX | VECTOR_BPF); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); + return (VECTOR_TX | VECTOR_RX); +} + + +/* A mini-buffer for packet drop read + * All of our supported transports are datagram oriented and we always + * read using recvmsg or recvmmsg. If we pass a buffer which is smaller + * than the packet size it still counts as full packet read and will + * clean the incoming stream to keep sigio/epoll happy + */ + +#define DROP_BUFFER_SIZE 32 + +static char *drop_buffer; + +/* Array backed queues optimized for bulk enqueue/dequeue and + * 1:N (small values of N) or 1:1 enqueuer/dequeuer ratios. + * For more details and full design rationale see + * http://foswiki.cambridgegreys.com/Main/EatYourTailAndEnjoyIt + */ + + +/* + * Advance the mmsg queue head by n = advance. Resets the queue to + * maximum enqueue/dequeue-at-once capacity if possible. Called by + * dequeuers. Caller must hold the head_lock! + */ + +static int vector_advancehead(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->head = + (qi->head + advance) + % qi->max_depth; + + + spin_lock(&qi->tail_lock); + qi->queue_depth -= advance; + + /* we are at 0, use this to + * reset head and tail so we can use max size vectors + */ + + if (qi->queue_depth == 0) { + qi->head = 0; + qi->tail = 0; + } + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +/* Advance the queue tail by n = advance. + * This is called by enqueuers which should hold the + * head lock already + */ + +static int vector_advancetail(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->tail = + (qi->tail + advance) + % qi->max_depth; + spin_lock(&qi->head_lock); + qi->queue_depth += advance; + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + return queue_depth; +} + +static int prep_msg(struct vector_private *vp, struct sk_buff *skb, struct iovec *iov) +{ + int iov_index = 0; + int nr_frags, frag; + skb_frag_t *skb_frag; + + nr_frags = skb_shinfo(skb)->nr_frags; + if (nr_frags > MAX_IOV_SIZE) { + if (skb_linearize(skb) != 0) + goto drop; + } + if (vp->header_size > 0) { + iov[iov_index].iov_len = vp->header_size; + vp->form_header(iov[iov_index].iov_base, skb, vp); + iov_index++; + } + iov[iov_index].iov_base = skb->data; + if (nr_frags > 0) { + iov[iov_index].iov_len = skb->len - skb->data_len; + vp->estats.sg_ok++; + } else + iov[iov_index].iov_len = skb->len; + iov_index++; + for (frag = 0; frag < nr_frags; frag++) { + skb_frag = &skb_shinfo(skb)->frags[frag]; + iov[iov_index].iov_base = skb_frag_address_safe(skb_frag); + iov[iov_index].iov_len = skb_frag_size(skb_frag); + iov_index++; + } + return iov_index; +drop: + return -1; +} +/* + * Generic vector enqueue with support for forming headers using transport + * specific callback. Allows GRE, L2TPv3, RAW and other transports + * to use a common enqueue procedure in vector mode + */ + +static int vector_enqueue(struct vector_queue *qi, struct sk_buff *skb) +{ + struct vector_private *vp = netdev_priv(qi->dev); + int queue_depth; + int packet_len; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + int iov_count; + + spin_lock(&qi->tail_lock); + spin_lock(&qi->head_lock); + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + + if (skb) + packet_len = skb->len; + + if (queue_depth < qi->max_depth) { + + *(qi->skbuff_vector + qi->tail) = skb; + mmsg_vector += qi->tail; + iov_count = prep_msg( + vp, + skb, + mmsg_vector->msg_hdr.msg_iov + ); + if (iov_count < 1) + goto drop; + mmsg_vector->msg_hdr.msg_iovlen = iov_count; + mmsg_vector->msg_hdr.msg_name = vp->fds->remote_addr; + mmsg_vector->msg_hdr.msg_namelen = vp->fds->remote_addr_size; + queue_depth = vector_advancetail(qi, 1); + } else + goto drop; + spin_unlock(&qi->tail_lock); + return queue_depth; +drop: + qi->dev->stats.tx_dropped++; + if (skb != NULL) { + packet_len = skb->len; + dev_consume_skb_any(skb); + netdev_completed_queue(qi->dev, 1, packet_len); + } + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +static int consume_vector_skbs(struct vector_queue *qi, int count) +{ + struct sk_buff *skb; + int skb_index; + int bytes_compl = 0; + + for (skb_index = qi->head; skb_index < qi->head + count; skb_index++) { + skb = *(qi->skbuff_vector + skb_index); + /* mark as empty to ensure correct destruction if + * needed + */ + bytes_compl += skb->len; + *(qi->skbuff_vector + skb_index) = NULL; + dev_consume_skb_any(skb); + } + qi->dev->stats.tx_bytes += bytes_compl; + qi->dev->stats.tx_packets += count; + netdev_completed_queue(qi->dev, count, bytes_compl); + return vector_advancehead(qi, count); +} + +/* + * Generic vector deque via sendmmsg with support for forming headers + * using transport specific callback. Allows GRE, L2TPv3, RAW and + * other transports to use a common dequeue procedure in vector mode + */ + + +static int vector_send(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *send_from; + int result = 0, send_len, queue_depth = qi->max_depth; + + if (spin_trylock(&qi->head_lock)) { + if (spin_trylock(&qi->tail_lock)) { + /* update queue_depth to current value */ + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + while (queue_depth > 0) { + /* Calculate the start of the vector */ + send_len = queue_depth; + send_from = qi->mmsg_vector; + send_from += qi->head; + /* Adjust vector size if wraparound */ + if (send_len + qi->head > qi->max_depth) + send_len = qi->max_depth - qi->head; + /* Try to TX as many packets as possible */ + if (send_len > 0) { + result = uml_vector_sendmmsg( + vp->fds->tx_fd, + send_from, + send_len, + 0 + ); + vp->in_write_poll = (result != send_len); + } + /* For some of the sendmmsg error scenarios + * we may end being unsure in the TX success + * for all packets. It is safer to declare + * them all TX-ed and blame the network. + */ + if (result < 0) { + printk(KERN_ERR "sendmmsg err=%i\n", + result); + result = send_len; + } + if (result > 0) { + queue_depth = consume_vector_skbs(qi, result); + /* This is equivalent to an TX IRQ. + * Restart the upper layers to feed us + * more packets. + */ + if (result > vp->estats.tx_queue_max) + vp->estats.tx_queue_max = result; + vp->estats.tx_queue_running_average = + (vp->estats.tx_queue_running_average + result) >> 1; + } + netif_trans_update(qi->dev); + netif_wake_queue(qi->dev); + /* if TX is busy, break out of the send loop, poll + * write IRQ will reschedule xmit for us + */ + if (result != send_len) { + vp->estats.tx_restart_queue++; + break; + } + } + } + spin_unlock(&qi->head_lock); + } else { + tasklet_schedule(&vp->tx_poll); + } + return queue_depth; +} + +/* Queue destructor. Deliberately stateless so we can use + * it in queue cleanup if initialization fails. + */ + +static void destroy_queue(struct vector_queue *qi) +{ + int i; + struct iovec *iov; + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *mmsg_vector; + + if (qi == NULL) + return; + /* deallocate any skbuffs - we rely on any unused to be + * set to NULL. + */ + if (qi->skbuff_vector != NULL) { + for (i = 0; i < qi->max_depth; i++) { + if (*(qi->skbuff_vector + i) != NULL) + dev_kfree_skb_any(*(qi->skbuff_vector + i)); + } + kfree(qi->skbuff_vector); + } + /* deallocate matching IOV structures including header buffs */ + if (qi->mmsg_vector != NULL) { + mmsg_vector = qi->mmsg_vector; + for (i = 0; i < qi->max_depth; i++) { + iov = mmsg_vector->msg_hdr.msg_iov; + if (iov != NULL) { + if ((vp->header_size > 0) && + (iov->iov_base != NULL)) + kfree(iov->iov_base); + kfree(iov); + } + mmsg_vector++; + } + kfree(qi->mmsg_vector); + } + kfree(qi); +} + +/* + * Queue constructor. Create a queue with a given side. + */ +static struct vector_queue *create_queue( + struct vector_private *vp, + int max_size, + int header_size, + int num_extra_frags) +{ + struct vector_queue *result; + int i; + struct iovec *iov; + struct mmsghdr *mmsg_vector; + + result = kmalloc(sizeof(struct vector_queue), GFP_KERNEL); + if (result == NULL) + goto out_fail; + result->max_depth = max_size; + result->dev = vp->dev; + result->mmsg_vector = kmalloc( + (sizeof(struct mmsghdr) * max_size), GFP_KERNEL); + result->skbuff_vector = kmalloc( + (sizeof(void *) * max_size), GFP_KERNEL); + if (result->mmsg_vector == NULL || result->skbuff_vector == NULL) + goto out_fail; + + mmsg_vector = result->mmsg_vector; + for (i = 0; i < max_size; i++) { + /* Clear all pointers - we use non-NULL as marking on + * what to free on destruction + */ + *(result->skbuff_vector + i) = NULL; + mmsg_vector->msg_hdr.msg_iov = NULL; + mmsg_vector++; + } + mmsg_vector = result->mmsg_vector; + result->max_iov_frags = num_extra_frags; + for (i = 0; i < max_size; i++) { + if (vp->header_size > 0) + iov = kmalloc(sizeof(struct iovec) * (3 + num_extra_frags), GFP_KERNEL); + else + iov = kmalloc(sizeof(struct iovec) * (2 + num_extra_frags), GFP_KERNEL); + if (iov == NULL) + goto out_fail; + mmsg_vector->msg_hdr.msg_iov = iov; + mmsg_vector->msg_hdr.msg_iovlen = 1; + mmsg_vector->msg_hdr.msg_control = NULL; + mmsg_vector->msg_hdr.msg_controllen = 0; + mmsg_vector->msg_hdr.msg_flags = MSG_DONTWAIT; + mmsg_vector->msg_hdr.msg_name = NULL; + mmsg_vector->msg_hdr.msg_namelen = 0; + if (vp->header_size > 0) { + iov->iov_base = kmalloc(header_size, GFP_KERNEL); + if (iov->iov_base == NULL) + goto out_fail; + iov->iov_len = header_size; + mmsg_vector->msg_hdr.msg_iovlen = 2; + iov++; + } + iov->iov_base = NULL; + iov->iov_len = 0; + mmsg_vector++; + } + spin_lock_init(&result->head_lock); + spin_lock_init(&result->tail_lock); + result->queue_depth = 0; + result->head = 0; + result->tail = 0; + return result; +out_fail: + destroy_queue(result); + return NULL; +} + +/* + * We do not use the RX queue as a proper wraparound queue for now + * This is not necessary because the consumption via netif_rx() + * happens in-line. While we can try using the return code of + * netif_rx() for flow control there are no drivers doing this today. + * For this RX specific use we ignore the tail/head locks and + * just read into a prepared queue filled with skbuffs. + */ + +/* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ + +static void prep_queue_for_rx(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct sk_buff *skb; + struct iovec *iov; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int i; + + if (qi->queue_depth == 0) + return; + for (i = 0; i < qi->queue_depth; i++) { + /* it is OK if allocation fails - recvmmsg with NULL data in + * iov argument still performs an RX, just drops the packet + * This allows us stop faffing around with a "drop buffer" + */ + + skb = netdev_alloc_skb( + vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + iov = mmsg_vector->msg_hdr.msg_iov; + mmsg_vector->msg_len = 0; + if (vp->header_size > 0) + iov++; + if (skb != NULL) { + skb_reserve(skb, vp->headroom); + skb->dev = qi->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + skb->ip_summed = CHECKSUM_NONE; + iov->iov_base = skb->data; + iov->iov_len = vp->max_packet; + } else { + iov->iov_base = NULL; + iov->iov_len = 0; + } + *skbuff_vector = skb; + skbuff_vector++; + mmsg_vector++; + } + qi->queue_depth = 0; +} + +static struct vector_device *find_device(int n) +{ + struct vector_device *device; + struct list_head *ele; + + spin_lock(&vector_devices_lock); + list_for_each(ele, &vector_devices) { + device = list_entry(ele, struct vector_device, list); + if (device->unit == n) + goto out; + } + device = NULL; + out: + spin_unlock(&vector_devices_lock); + return device; +} + +static int vector_parse(char *str, int *index_out, char **str_out, + char **error_out) +{ + int n, len, err = -EINVAL; + char *start = str; + + len = strlen(str); + + while ((*str != ':') && (strlen(str) > 1)) + str++; + if (*str != ':') { + *error_out = "Expected ':' after device number"; + return err; + } else + *str = '\0'; + + err = kstrtouint(start, 0, &n); + if (err < 0) { + *error_out = "Bad device number"; + return err; + } + + str++; + if (find_device(n)) { + *error_out = "Device already configured"; + return err; + } + + *index_out = n; + *str_out = str; + return 0; +} + +static int vector_config(char *str, char **error_out) +{ + int err, n; + char *params; + struct arglist *parsed; + + err = vector_parse(str, &n, ¶ms, error_out); + if (err != 0) + return err; + + /* This string is broken up and the pieces used by the underlying + * driver. We should copy it to make sure things do not go wrong + * later. + */ + + params = kstrdup(params, GFP_KERNEL); + if (str == NULL) { + *error_out = "vector_config failed to strdup string"; + return -ENOMEM; + } + + parsed = uml_parse_vector_ifspec(params); + + if (parsed == NULL) { + *error_out = "vector_config failed to parse parameters"; + return -EINVAL; + } + + vector_eth_configure(n, parsed); + return 0; +} + +static int vector_id(char **str, int *start_out, int *end_out) +{ + char *end; + int n; + + n = simple_strtoul(*str, &end, 0); + if ((*end != '\0') || (end == *str)) + return -1; + + *start_out = n; + *end_out = n; + *str = end; + return n; +} + +static int vector_remove(int n, char **error_out) +{ + struct vector_device *vec_d; + struct net_device *dev; + struct vector_private *vp; + + vec_d = find_device(n); + if (vec_d == NULL) + return -ENODEV; + dev = vec_d->dev; + vp = netdev_priv(dev); + if (vp->fds != NULL) + return -EBUSY; + unregister_netdev(dev); + platform_device_unregister(&vec_d->pdev); + return 0; +} + +/* + * There is no shared per-transport initialization code, so + * we will just initialize each interface one by one and + * add them to a list + */ + +static struct platform_driver uml_net_driver = { + .driver = { + .name = DRIVER_NAME, + }, +}; + + +static void vector_device_release(struct device *dev) +{ + struct vector_device *device = dev_get_drvdata(dev); + struct net_device *netdev = device->dev; + + list_del(&device->list); + kfree(device); + free_netdev(netdev); +} + +/* Bog standard recv using recvmsg - not used normally unless the user + * explicitly specifies not to use recvmmsg vector RX. + */ + +static int vector_legacy_rx(struct vector_private *vp) +{ + int pkt_len; + struct user_msghdr hdr; + struct iovec iov[2]; /* header + data use case only */ + int iovpos = 0; + struct sk_buff *skb; + int header_check; + + hdr.msg_name = NULL; + hdr.msg_namelen = 0; + hdr.msg_iov = (struct iovec *) &iov; + hdr.msg_iovlen = 1; + hdr.msg_control = NULL; + hdr.msg_controllen = 0; + hdr.msg_flags = 0; + + if (vp->header_size > 0) { + iov[iovpos].iov_base = vp->header_rxbuffer; + iov[iovpos].iov_len = vp->rx_header_size; + hdr.msg_iovlen++; + iovpos++; + } + + skb = netdev_alloc_skb(vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + if (skb == NULL) { + /* Read a packet into drop_buffer and don't do + * anything with it. + */ + iov[iovpos].iov_base = drop_buffer; + iov[iovpos].iov_len = DROP_BUFFER_SIZE; + vp->dev->stats.rx_dropped++; + } else { + skb_reserve(skb, vp->headroom); + skb->dev = vp->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + iov[iovpos].iov_base = skb->data; + iov[iovpos].iov_len = vp->max_packet; + } + + pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); + + if (skb != NULL) { + if (pkt_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + vp->header_rxbuffer, skb, vp); + if (header_check < 0) { + dev_kfree_skb_irq(skb); + vp->dev->stats.rx_dropped++; + vp->estats.rx_encaps_errors++; + return 0; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, pkt_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + dev_kfree_skb_irq(skb); + } + } + return pkt_len; +} + +/* + * Packet at a time TX which falls back to vector TX if the + * underlying transport is busy. + */ + + + +static int writev_tx(struct vector_private *vp, struct sk_buff *skb) +{ + struct iovec iov[3 + MAX_IOV_SIZE]; + int iov_count, pkt_len = 0; + + iov[0].iov_base = vp->header_txbuffer; + iov_count = prep_msg(vp, skb, (struct iovec *) &iov); + + if (iov_count < 1) + goto drop; + pkt_len = uml_vector_writev(vp->fds->tx_fd, (struct iovec *) &iov, iov_count); + + netif_trans_update(vp->dev); + netif_wake_queue(vp->dev); + + if (pkt_len > 0) { + vp->dev->stats.tx_bytes += skb->len; + vp->dev->stats.tx_packets++; + } else { + vp->dev->stats.tx_dropped++; + } + consume_skb(skb); + return pkt_len; +drop: + vp->dev->stats.tx_dropped++; + consume_skb(skb); + return pkt_len; +} + +/* + * Receive as many messages as we can in one call using the special + * mmsg vector matched to an skb vector which we prepared earlier. + */ + +static int vector_mmsg_rx(struct vector_private *vp) +{ + int packet_count, i; + struct vector_queue *qi = vp->rx_queue; + struct sk_buff *skb; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int header_check; + + /* Refresh the vector and make sure it is with new skbs and the + * iovs are updated to point to them. + */ + + prep_queue_for_rx(qi); + + /* Fire the Lazy Gun - get as many packets as we can in one go. */ + + packet_count = uml_vector_recvmmsg( + vp->fds->rx_fd, qi->mmsg_vector, qi->max_depth, 0); + + if (packet_count <= 0) + return packet_count; + + /* We treat packet processing as enqueue, buffer refresh as dequeue + * The queue_depth tells us how many buffers have been used and how + * many do we need to prep the next time prep_queue_for_rx() is called. + */ + + qi->queue_depth = packet_count; + + for (i = 0; i < packet_count; i++) { + skb = (*skbuff_vector); + if (mmsg_vector->msg_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + mmsg_vector->msg_hdr.msg_iov->iov_base, skb, vp); + if (header_check < 0) { + /* Overlay header failed to verify - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too + * complex. + */ + dev_kfree_skb_irq(skb); + vp->estats.rx_encaps_errors++; + continue; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, + mmsg_vector->msg_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + /* + * We do not need to lock on updating stats here + * The interrupt loop is non-reentrant. + */ + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + /* Overlay header too short to do anything - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too complex. + */ + if (skb != NULL) + dev_kfree_skb_irq(skb); + } + (*skbuff_vector) = NULL; + /* Move to the next buffer element */ + mmsg_vector++; + skbuff_vector++; + } + if (packet_count > 0) { + if (vp->estats.rx_queue_max < packet_count) + vp->estats.rx_queue_max = packet_count; + vp->estats.rx_queue_running_average = + (vp->estats.rx_queue_running_average + packet_count) >> 1; + } + return packet_count; +} + +static void vector_rx(struct vector_private *vp) +{ + int err; + + if ((vp->options & VECTOR_RX) > 0) + while ((err = vector_mmsg_rx(vp)) > 0) + ; + else + while ((err = vector_legacy_rx(vp)) > 0) + ; + if (err != 0) + printk(KERN_ERR "vector_rx: error(%d)\n", err); +} + +static int vector_net_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + int queue_depth = 0; + + if ((vp->options & VECTOR_TX) == 0) { + writev_tx(vp, skb); + return NETDEV_TX_OK; + } + + /* We do BQL only in the vector path, no point doing it in + * packet at a time mode as there is no device queue + */ + + netdev_sent_queue(vp->dev, skb->len); + queue_depth = vector_enqueue(vp->tx_queue, skb); + + /* if the device queue is full, stop the upper layers and + * flush it. + */ + + if (queue_depth >= vp->tx_queue->max_depth - 1) { + vp->estats.tx_kicks++; + netif_stop_queue(dev); + vector_send(vp->tx_queue); + return NETDEV_TX_OK; + } + if (skb->xmit_more) { + mod_timer(&vp->tl, vp->coalesce); + return NETDEV_TX_OK; + } + if (skb->len < TX_SMALL_PACKET) { + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); + } else + tasklet_schedule(&vp->tx_poll); + return NETDEV_TX_OK; +} + +static irqreturn_t vector_rx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + vector_rx(vp); + return IRQ_HANDLED; + +} + +static irqreturn_t vector_tx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + /* We need to pay attention to it only if we got + * -EAGAIN or -ENOBUFFS from sendmmsg. Otherwise + * we ignore it. In the future, it may be worth + * it to improve the IRQ controller a bit to make + * tweaking the IRQ mask less costly + */ + + if (vp->in_write_poll) { + tasklet_schedule(&vp->tx_poll); + } + return IRQ_HANDLED; + +} + +static int irq_rr; + +static int vector_net_close(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + + netif_stop_queue(dev); + del_timer(&vp->tl); + + if (vp->fds == NULL) + return 0; + + /* Disable and free all IRQS */ + if (vp->rx_irq > 0) { + um_free_irq(vp->rx_irq, dev); + vp->rx_irq = 0; + } + if (vp->tx_irq > 0) { + um_free_irq(vp->tx_irq, dev); + vp->tx_irq = 0; + } + tasklet_kill(&vp->tx_poll); + if (vp->fds->rx_fd > 0) { + os_close_file(vp->fds->rx_fd); + vp->fds->rx_fd = -1; + } + if (vp->fds->tx_fd > 0) { + os_close_file(vp->fds->tx_fd); + vp->fds->tx_fd = -1; + } + if (vp->bpf != NULL) + kfree(vp->bpf); + if (vp->fds->remote_addr != NULL) + kfree(vp->fds->remote_addr); + if (vp->transport_data != NULL) + kfree(vp->transport_data); + if (vp->header_rxbuffer != NULL) + kfree(vp->header_rxbuffer); + if (vp->header_txbuffer != NULL) + kfree(vp->header_txbuffer); + if (vp->rx_queue != NULL) + destroy_queue(vp->rx_queue); + if (vp->tx_queue != NULL) + destroy_queue(vp->tx_queue); + kfree(vp->fds); + vp->fds = NULL; + spin_lock_irqsave(&vp->lock, flags); + vp->opened = false; + spin_unlock_irqrestore(&vp->lock, flags); + return 0; +} + +/* TX tasklet */ + +static void vector_tx_poll(unsigned long data) +{ + struct vector_private *vp = (struct vector_private *)data; + + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} +static void vector_reset_tx(struct work_struct *work) +{ + struct vector_private *vp = + container_of(work, struct vector_private, reset_tx); + netdev_reset_queue(vp->dev); + netif_start_queue(vp->dev); + netif_wake_queue(vp->dev); +} +static int vector_net_open(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + int err = -EINVAL; + struct vector_device *vdevice; + + spin_lock_irqsave(&vp->lock, flags); + if (vp->opened) + return -ENXIO; + vp->opened = true; + spin_unlock_irqrestore(&vp->lock, flags); + + vp->fds = uml_vector_user_open(vp->unit, vp->parsed); + + if (vp->fds == NULL) + goto out_close; + + if (build_transport_data(vp) < 0) + goto out_close; + + if ((vp->options & VECTOR_RX) > 0) { + vp->rx_queue = create_queue( + vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp->rx_queue->queue_depth = get_depth(vp->parsed); + } else { + vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); + if (vp->header_rxbuffer == NULL) + goto out_close; + } + if ((vp->options & VECTOR_TX) > 0) { + vp->tx_queue = create_queue( + vp, get_depth(vp->parsed), vp->header_size, MAX_IOV_SIZE); + } else { + vp->header_txbuffer = kmalloc(vp->header_size, GFP_KERNEL); + if (vp->header_txbuffer == NULL) + goto out_close; + } + + /* READ IRQ */ + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->rx_fd, + IRQ_READ, vector_rx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR "vector_open: failed to get rx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->rx_irq = irq_rr + VECTOR_BASE_IRQ; + dev->irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + + /* WRITE IRQ - we need it only if we have vector TX */ + if ((vp->options & VECTOR_TX) > 0) { + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->tx_fd, + IRQ_WRITE, vector_tx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR + "vector_open: failed to get tx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->tx_irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + } + + if ((vp->options & VECTOR_BPF) != 0) { + vp->bpf = uml_vector_default_bpf(vp->fds->rx_fd, dev->dev_addr); + } + + + /* Write Timeout Timer */ + + vp->tl.data = (unsigned long) vp; + netif_start_queue(dev); + + /* clear buffer - it can happen that the host side of the interface + * is full when we get here. In this case, new data is never queued, + * SIGIOs never arrive, and the net never works. + */ + + vector_rx(vp); + + vector_reset_stats(vp); + vdevice = find_device(vp->unit); + vdevice->opened = 1; + + if ((vp->options & VECTOR_TX) != 0) { + add_timer(&vp->tl); + } + return 0; +out_close: + vector_net_close(dev); + return err; +} + + +static void vector_net_set_multicast_list(struct net_device *dev) +{ + /* TODO: - we can do some BPF games here */ + return; +} + +static void vector_net_tx_timeout(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + vp->estats.tx_timeout_count++; + netif_trans_update(dev); + schedule_work(&vp->reset_tx); +} + +#ifdef CONFIG_NET_POLL_CONTROLLER +static void vector_net_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + uml_net_interrupt(dev->irq, dev); + enable_irq(dev->irq); +} +#endif + +static void vector_net_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *info) +{ + strlcpy(info->driver, DRIVER_NAME, sizeof(info->driver)); + strlcpy(info->version, DRIVER_VERSION, sizeof(info->version)); +} + +static void vector_get_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct vector_private *vp = netdev_priv(netdev); + + ring->rx_max_pending = vp->rx_queue->max_depth; + ring->tx_max_pending = vp->tx_queue->max_depth; + ring->rx_pending = vp->rx_queue->max_depth; + ring->tx_pending = vp->tx_queue->max_depth; +} + +static void vector_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch (stringset) { + case ETH_SS_TEST: + *buf = '\0'; + break; + case ETH_SS_STATS: + memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + default: + WARN_ON(1); + break; + } +} + +static int vector_get_sset_count(struct net_device *dev, int sset) +{ + switch (sset) { + case ETH_SS_TEST: + return 0; + case ETH_SS_STATS: + return VECTOR_NUM_STATS; + default: + return -EOPNOTSUPP; + } +} + +static void vector_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *estats, u64 *tmp_stats) +{ + struct vector_private *vp = netdev_priv(dev); + + memcpy(tmp_stats, &vp->estats, sizeof(struct vector_estats)); +} + +static int vector_get_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + ec->tx_coalesce_usecs = (vp->coalesce * 1000000) / HZ; + return 0; +} + +static int vector_set_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + vp->coalesce = (ec->tx_coalesce_usecs * HZ) / 1000000; + if (vp->coalesce == 0) + vp->coalesce = 1; + return 0; +} + +static const struct ethtool_ops vector_net_ethtool_ops = { + .get_drvinfo = vector_net_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_ts_info = ethtool_op_get_ts_info, + .get_ringparam = vector_get_ringparam, + .get_strings = vector_get_strings, + .get_sset_count = vector_get_sset_count, + .get_ethtool_stats = vector_get_ethtool_stats, + .get_coalesce = vector_get_coalesce, + .set_coalesce = vector_set_coalesce, +}; + + +static const struct net_device_ops vector_netdev_ops = { + .ndo_open = vector_net_open, + .ndo_stop = vector_net_close, + .ndo_start_xmit = vector_net_start_xmit, + .ndo_set_rx_mode = vector_net_set_multicast_list, + .ndo_tx_timeout = vector_net_tx_timeout, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = vector_net_poll_controller, +#endif +}; + + +static void vector_timer_expire(unsigned long _conn) +{ + struct vector_private *vp = (struct vector_private *)_conn; + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} + +static void vector_eth_configure( + int n, + struct arglist *def + ) +{ + struct vector_device *device; + struct net_device *dev; + struct vector_private *vp; + int err; + + printk(KERN_INFO "Vector Netdevice %d\n", n); + + device = kzalloc(sizeof(*device), GFP_KERNEL); + if (device == NULL) { + printk(KERN_ERR "eth_configure failed to allocate struct " + "vector_device\n"); + return; + } + printk(KERN_INFO "vector etherdevice start %d\n", n); + dev = alloc_etherdev(sizeof(struct vector_private)); + if (dev == NULL) { + printk(KERN_ERR "eth_configure: failed to allocate struct " + "net_device for vec%d\n", n); + goto out_free_device; + } + + dev->mtu = get_mtu(def); + + INIT_LIST_HEAD(&device->list); + device->unit = n; + + /* If this name ends up conflicting with an existing registered + * netdevice, that is OK, register_netdev{,ice}() will notice this + * and fail. + */ + snprintf(dev->name, sizeof(dev->name), "vec%d", n); + uml_net_setup_etheraddr(dev, uml_vector_fetch_arg(def, "mac")); + vp = netdev_priv(dev); + + /* sysfs register */ + if (!driver_registered) { + platform_driver_register(¨_net_driver); + driver_registered = 1; + } + device->pdev.id = n; + device->pdev.name = DRIVER_NAME; + device->pdev.dev.release = vector_device_release; + dev_set_drvdata(&device->pdev.dev, device); + if (platform_device_register(&device->pdev)) + goto out_free_netdev; + SET_NETDEV_DEV(dev, &device->pdev.dev); + + device->dev = dev; + + *vp = ((struct vector_private) + { + .list = LIST_HEAD_INIT(vp->list), + .dev = dev, + .unit = n, + .options = get_transport_options(def), + .rx_irq = 0, + .tx_irq = 0, + .parsed = def, + .max_packet = get_mtu(def) + ETH_HEADER_OTHER, + /* TODO - we need to calculate headroom so that ip header + * is 16 byte aligned all the time + */ + .headroom = get_headroom(def), + .form_header = NULL, + .verify_header = NULL, + .header_rxbuffer = NULL, + .header_txbuffer = NULL, + .header_size = 0, + .rx_header_size = 0, + .rexmit_scheduled = false, + .opened = false, + .transport_data = NULL, + .in_write_poll = false, + .coalesce = 2 + }); + + dev->features = NETIF_F_SG; + tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); + INIT_WORK(&vp->reset_tx, vector_reset_tx); + + init_timer(&vp->tl); + spin_lock_init(&vp->lock); + vp->tl.function = vector_timer_expire; + + /* FIXME */ + dev->netdev_ops = &vector_netdev_ops; + dev->ethtool_ops = &vector_net_ethtool_ops; + dev->watchdog_timeo = (HZ >> 1); + /* primary IRQ - fixme */ + dev->irq = 0; /* we will adjust this once opened */ + + rtnl_lock(); + err = register_netdevice(dev); + rtnl_unlock(); + if (err) + goto out_undo_user_init; + + spin_lock(&vector_devices_lock); + list_add(&device->list, &vector_devices); + spin_unlock(&vector_devices_lock); + + return; + +out_undo_user_init: + return; +out_free_netdev: + free_netdev(dev); +out_free_device: + kfree(device); +} + + + + +/* + * Invoked late in the init + */ + +static int __init vector_init(void) +{ + struct list_head *ele; + struct vector_cmd_line_arg *def; + struct arglist *parsed; + + list_for_each(ele, &vec_cmd_line) { + def = list_entry(ele, struct vector_cmd_line_arg, list); + parsed = uml_parse_vector_ifspec(def->arguments); + if (parsed != NULL) + vector_eth_configure(def->unit, parsed); + } + return 0; +} + + +/* Invoked at initial argument parsing, only stores + * arguments until a proper vector_init is called + * later + */ + +static int __init vector_setup(char *str) +{ + char *error; + int n, err; + struct vector_cmd_line_arg *new; + + err = vector_parse(str, &n, &str, &error); + if (err) { + printk(KERN_ERR "vector_setup - Couldn't parse '%s' : %s\n", + str, error); + return 1; + } + new = alloc_bootmem(sizeof(*new)); + INIT_LIST_HEAD(&new->list); + new->unit = n; + new->arguments = str; + list_add_tail(&new->list, &vec_cmd_line); + return 1; +} + +__setup("vec", vector_setup); +__uml_help(vector_setup, +"vec[0-9]+:<option>=<value>,<option>=<value>\n" +" Configure a vector io network device.\n\n" +); + +late_initcall(vector_init); + +static struct mc_device vector_mc = { + .list = LIST_HEAD_INIT(vector_mc.list), + .name = "vec", + .config = vector_config, + .get_config = NULL, + .id = vector_id, + .remove = vector_remove, +}; + +#ifdef CONFIG_INET +static int vector_inetaddr_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + return NOTIFY_DONE; +} + +static struct notifier_block vector_inetaddr_notifier = { + .notifier_call = vector_inetaddr_event, +}; + +static void inet_register(void) +{ + register_inetaddr_notifier(&vector_inetaddr_notifier); +} +#else +static inline void inet_register(void) +{ +} +#endif + +static int vector_net_init(void) +{ + mconsole_register_dev(&vector_mc); + inet_register(); + return 0; +} + +__initcall(vector_net_init); + + + diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h new file mode 100644 index 000000000000..a9ade0851fda --- /dev/null +++ b/arch/um/drivers/vector_kern.h @@ -0,0 +1,128 @@ +/* + * Copyright (C) 2002 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#ifndef __UM_VECTOR_KERN_H +#define __UM_VECTOR_KERN_H + +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/skbuff.h> +#include <linux/socket.h> +#include <linux/list.h> +#include <linux/ctype.h> +#include <linux/workqueue.h> +#include <linux/interrupt.h> +#include "vector_user.h" + +/* Queue structure specially adapted for multiple enqueue/dequeue + * in a mmsgrecv/mmsgsend context + */ + +/* Dequeue method */ + +#define QUEUE_SENDMSG 0 +#define QUEUE_SENDMMSG 1 + +#define VECTOR_RX 1 +#define VECTOR_TX (1 << 1) +#define VECTOR_BPF (1 << 2) + +#define ETH_MAX_PACKET 1500 +#define ETH_HEADER_OTHER 32 /* just in case someone decides to go mad on QnQ */ + +struct vector_queue { + struct mmsghdr *mmsg_vector; + void **skbuff_vector; + /* backlink to device which owns us */ + struct net_device *dev; + spinlock_t head_lock; + spinlock_t tail_lock; + int queue_depth, head, tail, max_depth, max_iov_frags; + short options; +}; + +struct vector_estats { + uint64_t rx_queue_max; + uint64_t rx_queue_running_average; + uint64_t tx_queue_max; + uint64_t tx_queue_running_average; + uint64_t rx_encaps_errors; + uint64_t tx_timeout_count; + uint64_t tx_restart_queue; + uint64_t tx_kicks; + uint64_t tx_flow_control_xon; + uint64_t tx_flow_control_xoff; + uint64_t rx_csum_offload_good; + uint64_t rx_csum_offload_errors; + uint64_t sg_ok; + uint64_t sg_linearized; +}; + +#define VERIFY_HEADER_NOK -1 +#define VERIFY_HEADER_OK 0 +#define VERIFY_CSUM_OK 1 + +struct vector_private { + struct list_head list; + spinlock_t lock; + struct net_device *dev; + + int unit; + + /* Timeout timer in TX */ + + struct timer_list tl; + + /* Scheduled "remove device" work */ + struct work_struct reset_tx; + struct vector_fds *fds; + + struct vector_queue *rx_queue; + struct vector_queue *tx_queue; + + int rx_irq; + int tx_irq; + + struct arglist *parsed; + + void *transport_data; /* transport specific params if needed */ + + int max_packet; + int headroom; + + int options; + + /* remote address if any - some transports will leave this as null */ + + int header_size; + int rx_header_size; + int coalesce; + + void *header_rxbuffer; + void *header_txbuffer; + + int (*form_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + int (*verify_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + + spinlock_t stats_lock; + + struct tasklet_struct tx_poll; + bool rexmit_scheduled; + bool opened; + bool in_write_poll; + + /* ethtool stats */ + + struct vector_estats estats; + void *bpf; + + char user[0]; +}; + +extern int build_transport_data(struct vector_private *vp); + +#endif diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c new file mode 100644 index 000000000000..9f07d585f71b --- /dev/null +++ b/arch/um/drivers/vector_transports.c @@ -0,0 +1,430 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Licensed under the GPL. + */ + +#include <linux/etherdevice.h> +#include <linux/netdevice.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <asm/byteorder.h> +#include <uapi/linux/ip.h> +#include <uapi/linux/virtio_net.h> +#include <linux/virtio_net.h> +#include <linux/virtio_byteorder.h> +#include <linux/netdev_features.h> +#include "vector_user.h" +#include "vector_kern.h" + +#define GOOD_LINEAR 512 + +struct gre_minimal_header { + uint16_t header; + uint16_t arptype; +}; + + +struct uml_gre_data { + uint32_t rx_key; + uint32_t tx_key; + uint32_t sequence; + + bool ipv6; + bool has_sequence; + bool pin_sequence; + bool checksum; + bool key; + struct gre_minimal_header expected_header; + + uint32_t checksum_offset; + uint32_t key_offset; + uint32_t sequence_offset; + +}; + +struct uml_l2tpv3_data { + uint64_t rx_cookie; + uint64_t tx_cookie; + uint64_t rx_session; + uint64_t tx_session; + uint32_t counter; + + bool udp; + bool ipv6; + bool has_counter; + bool pin_counter; + bool cookie; + bool cookie_is_64; + + uint32_t cookie_offset; + uint32_t session_offset; + uint32_t counter_offset; +}; + +static int l2tpv3_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *counter; + + if (td->udp) + *(uint32_t *) header = cpu_to_be32(L2TPV3_DATA_PACKET); + (*(uint32_t *) (header + td->session_offset)) = td->tx_session; + + if (td->cookie) { + if (td->cookie_is_64) + (*(uint64_t *)(header + td->cookie_offset)) = + td->tx_cookie; + else + (*(uint32_t *)(header + td->cookie_offset)) = + td->tx_cookie; + } + if (td->has_counter) { + counter = (uint32_t *)(header + td->counter_offset); + if (td->pin_counter) { + *counter = 0; + } else { + td->counter++; + *counter = cpu_to_be32(td->counter); + } + } + return 0; +} + +static int gre_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_gre_data *td = vp->transport_data; + uint32_t *sequence; + *((uint32_t *) header) = *((uint32_t *) &td->expected_header); + if (td->key) + (*(uint32_t *) (header + td->key_offset)) = td->tx_key; + if (td->has_sequence) { + sequence = (uint32_t *)(header + td->sequence_offset); + if (td->pin_sequence) + *sequence = 0; + else + *sequence = cpu_to_be32(++td->sequence); + } + return 0; +} + +static int raw_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + virtio_net_hdr_from_skb(skb, vheader, virtio_legacy_is_little_endian(), false); + + return 0; +} + +static int l2tpv3_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *session; + uint64_t cookie; + + if ((!td->udp) && (!td->ipv6)) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + /* we do not do a strict check for "data" packets as per + * the RFC spec because the pure IP spec does not have + * that anyway. + */ + + if (td->cookie) { + if (td->cookie_is_64) + cookie = *(uint64_t *)(header + td->cookie_offset); + else + cookie = *(uint32_t *)(header + td->cookie_offset); + if (cookie != td->rx_cookie) { + printk(KERN_ERR "uml_l2tpv3: unknown cookie id"); + return -1; + } + } + session = (uint32_t *) (header + td->session_offset); + if (*session != td->rx_session) { + printk(KERN_ERR "uml_l2tpv3: session mismatch"); + return -1; + } + return 0; +} + +static int gre_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + + uint32_t key; + struct uml_gre_data *td = vp->transport_data; + + if (!td->ipv6) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + if (*((uint32_t *) header) != *((uint32_t *) &td->expected_header)) { + printk(KERN_ERR "header type disagreement, expecting %0x, got %0x", + *((uint32_t *) &td->expected_header), + *((uint32_t *) header) + ); + return -1; + } + + if (td->key) { + key = (*(uint32_t *)(header + td->key_offset)); + if (key != td->rx_key) { + printk(KERN_ERR "unknown key id %0x, expecting %0x", + key, td->rx_key); + return -1; + } + } + return 0; +} + +static int raw_verify_header ( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { + printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); + return -1; /* GSO, we cannot process this */ + } + if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) + return 1; + else + virtio_net_hdr_to_skb(skb, vheader, virtio_legacy_is_little_endian()); + return 0; +} + +static bool get_uint_param( + struct arglist *def, char *param, unsigned int *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoint(arg, 0, result) == 0) + return true; + } + return false; +} + +static bool get_ulong_param( + struct arglist *def, char *param, unsigned long *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoul(arg, 0, result) == 0) + return true; + return true; + } + return false; +} + +static int build_gre_transport_data(struct vector_private *vp) +{ + struct uml_gre_data *td; + int temp_int; + int temp_rx; + int temp_tx; + + vp->transport_data = kmalloc(sizeof(struct uml_gre_data), GFP_KERNEL); + if (vp->transport_data == NULL) + return -ENOMEM; + td = vp->transport_data; + td->sequence = 0; + + td->expected_header.arptype = GRE_IRB; + td->expected_header.header = 0; + + vp->form_header = &gre_form_header; + vp->verify_header = &gre_verify_header; + vp->header_size = 4; + td->key_offset = 4; + td->sequence_offset = 4; + td->checksum_offset = 4; + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + td->key = false; + if (get_uint_param(vp->parsed, "rx_key", &temp_rx)) { + if (get_uint_param(vp->parsed, "tx_key", &temp_tx)) { + td->key = true; + td->expected_header.header |= GRE_MODE_KEY; + td->rx_key = cpu_to_be32(temp_rx); + td->tx_key = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->sequence_offset += 4; + } else { + return -EINVAL; + } + } + + td->sequence = false; + if (get_uint_param(vp->parsed, "sequence", &temp_int)) { + if (temp_int > 0) { + vp->header_size += 4; + td->has_sequence = true; + td->expected_header.header |= GRE_MODE_SEQUENCE; + if (get_uint_param( + vp->parsed, "pin_sequence", &temp_int)) { + if (temp_int > 0) + td->pin_sequence = true; + } + } + } + vp->rx_header_size = vp->header_size; + if (!td->ipv6) + vp->rx_header_size += sizeof(struct iphdr); + return 0; +} + +static int build_l2tpv3_transport_data(struct vector_private *vp) +{ + + struct uml_l2tpv3_data *td; + int temp_int, temp_rxs, temp_txs; + unsigned long temp_rx; + unsigned long temp_tx; + + vp->transport_data = kmalloc( + sizeof(struct uml_l2tpv3_data), GFP_KERNEL); + + if (vp->transport_data == NULL) + return -ENOMEM; + + td = vp->transport_data; + + vp->form_header = &l2tpv3_form_header; + vp->verify_header = &l2tpv3_verify_header; + td->counter = 0; + + vp->header_size = 4; + td->session_offset = 0; + td->cookie_offset = 4; + td->counter_offset = 4; + + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + + if (get_uint_param(vp->parsed, "rx_session", &temp_rxs)) { + if (get_uint_param(vp->parsed, "tx_session", &temp_txs)) { + td->tx_session = cpu_to_be32(temp_txs); + td->rx_session = cpu_to_be32(temp_rxs); + } else { + return -EINVAL; + } + } else { + return -EINVAL; + } + + td->cookie_is_64 = false; + if (get_uint_param(vp->parsed, "cookie64", &temp_int)) { + if (temp_int > 0) + td->cookie_is_64 = true; + } + td->cookie = false; + if (get_ulong_param(vp->parsed, "rx_cookie", &temp_rx)) { + if (get_ulong_param(vp->parsed, "tx_cookie", &temp_tx)) { + td->cookie = true; + if (td->cookie_is_64) { + td->rx_cookie = cpu_to_be64(temp_rx); + td->tx_cookie = cpu_to_be64(temp_tx); + vp->header_size += 8; + td->counter_offset += 8; + } else { + td->rx_cookie = cpu_to_be32(temp_rx); + td->tx_cookie = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->counter_offset += 4; + } + } else { + return -EINVAL; + } + } + + td->has_counter = false; + if (get_uint_param(vp->parsed, "counter", &temp_int)) { + if (temp_int > 0) { + td->has_counter = true; + vp->header_size += 4; + if (get_uint_param( + vp->parsed, "pin_counter", &temp_int)) { + if (temp_int > 0) + td->pin_counter = true; + } + } + } + + if (get_uint_param(vp->parsed, "udp", &temp_int)) { + if (temp_int > 0) { + td->udp = true; + vp->header_size += 4; + td->counter_offset += 4; + td->session_offset += 4; + td->cookie_offset += 4; + } + } + + vp->rx_header_size = vp->header_size; + if ((!td->ipv6) && (!td->udp)) + vp->rx_header_size += sizeof(struct iphdr); + + return 0; +} + +static int build_raw_transport_data(struct vector_private *vp) +{ + if (uml_raw_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= NETIF_F_HW_CSUM; /* TSO does not work on RAW */ + printk(KERN_INFO "raw: using vnet headers to offload checksum"); + } + return 0; +} + +static int build_tap_transport_data(struct vector_private *vp) +{ + if (uml_raw_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= (NETIF_F_HW_CSUM | NETIF_F_TSO); + printk(KERN_INFO "tap/raw: using vnet headers for tso and tx/rx checksum"); + } else { + return 0; /* do not try to enable tap too if raw failed */ + } + if (uml_tap_enable_vnet_headers(vp->fds->tx_fd)) { + return 0; + } + return -1; +} + +int build_transport_data(struct vector_private *vp) +{ + char *transport = uml_vector_fetch_arg(vp->parsed, "transport"); + + if (strncmp(transport, TRANS_GRE, TRANS_GRE_LEN) == 0) + return build_gre_transport_data(vp); + if (strncmp(transport, TRANS_L2TPV3, TRANS_L2TPV3_LEN) == 0) + return build_l2tpv3_transport_data(vp); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return build_raw_transport_data(vp); + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return build_tap_transport_data(vp); + return 0; +} + diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c new file mode 100644 index 000000000000..9210bf2db569 --- /dev/null +++ b/arch/um/drivers/vector_user.c @@ -0,0 +1,528 @@ +/* + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#include <stdio.h> +#include <unistd.h> +#include <stdarg.h> +#include <errno.h> +#include <stddef.h> +#include <string.h> +#include <sys/ioctl.h> +#include <net/if.h> +#include <linux/if_tun.h> +#include <arpa/inet.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <net/ethernet.h> +#include <netinet/ip.h> +#include <netinet/ether.h> +#include <linux/if_ether.h> +#include <linux/if_packet.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <linux/virtio_net.h> +#include <netdb.h> +#include <stdlib.h> +#include <os.h> +#include <um_malloc.h> +#include "vector_user.h" + +#define ID_GRE 0 +#define ID_L2TPV3 1 +#define ID_MAX 1 + +#define TOKEN_IFNAME "ifname" + +#define TRANS_RAW "raw" +#define TRANS_RAW_LEN strlen(TRANS_RAW) + +/* This is very ugly and brute force lookup, but it is done + * only once at initialization so not worth doing hashes or + * anything more intelligent + */ + +char *uml_vector_fetch_arg(struct arglist *ifspec, char *token) +{ + int i; + + for (i = 0; i < ifspec->numargs; i++) { + if (strcmp(ifspec->tokens[i], token) == 0) + return ifspec->values[i]; + } + return NULL; + +} + +struct arglist *uml_parse_vector_ifspec(char *arg) +{ + struct arglist *result; + int pos, len; + bool parsing_token = true, next_starts = true; + + if (arg == NULL) + return NULL; + result = uml_kmalloc(sizeof(struct arglist), UM_GFP_KERNEL); + if (result == NULL) + return NULL; + result->numargs = 0; + len = strlen(arg); + for (pos = 0; pos < len; pos++) { + if (next_starts) { + if (parsing_token) { + result->tokens[result->numargs] = arg + pos; + } else { + result->values[result->numargs] = arg + pos; + result->numargs++; + } + next_starts = false; + } + if (*(arg + pos) == '=') { + if (parsing_token) + parsing_token = false; + else + goto cleanup; + next_starts = true; + (*(arg + pos)) = '\0'; + } + if (*(arg + pos) == ',') { + parsing_token = true; + next_starts = true; + (*(arg + pos)) = '\0'; + } + } + return result; +cleanup: + printk(UM_KERN_ERR "vector_setup - Couldn't parse '%s'\n", arg); + kfree(result); + return NULL; +} + +/* + * Socket/FD configuration functions. These return an structure + * of rx and tx descriptors to cover cases where these are not + * the same (f.e. read via raw socket and write via tap). + */ + +#define PATH_NET_TUN "/dev/net... [truncated message content] |
From: <ant...@ko...> - 2017-10-06 21:33:58
|
From: Anton Ivanov <ant...@ca...> 1. Removes the need to walk the IRQ/Device list to determine who triggered the IRQ. 2. Improves scalability (up to several times performance improvement for cases with 10s of devices). 3. Improves UML baseline IO performance for one disk + one NIC use case by up to 10%. 4. Introduces write poll triggered IRQs. 5. Prerequisite for introducing high performance mmesg family of functions in network IO. 6. Fixes RNG shutdown which was leaking a file descriptor Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/chan_kern.c | 53 +---- arch/um/drivers/line.c | 2 +- arch/um/drivers/random.c | 11 +- arch/um/drivers/ubd_kern.c | 4 +- arch/um/include/shared/irq_user.h | 12 +- arch/um/include/shared/os.h | 17 +- arch/um/kernel/irq.c | 460 ++++++++++++++++++++++++-------------- arch/um/os-Linux/irq.c | 202 +++++++++-------- 8 files changed, 444 insertions(+), 317 deletions(-) diff --git a/arch/um/drivers/chan_kern.c b/arch/um/drivers/chan_kern.c index acbe6c67afba..05588f9466c7 100644 --- a/arch/um/drivers/chan_kern.c +++ b/arch/um/drivers/chan_kern.c @@ -171,56 +171,19 @@ int enable_chan(struct line *line) return err; } -/* Items are added in IRQ context, when free_irq can't be called, and - * removed in process context, when it can. - * This handles interrupt sources which disappear, and which need to - * be permanently disabled. This is discovered in IRQ context, but - * the freeing of the IRQ must be done later. - */ -static DEFINE_SPINLOCK(irqs_to_free_lock); -static LIST_HEAD(irqs_to_free); - -void free_irqs(void) -{ - struct chan *chan; - LIST_HEAD(list); - struct list_head *ele; - unsigned long flags; - - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_splice_init(&irqs_to_free, &list); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - - list_for_each(ele, &list) { - chan = list_entry(ele, struct chan, free_list); - - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } -} - static void close_one_chan(struct chan *chan, int delay_free_irq) { - unsigned long flags; - if (!chan->opened) return; - if (delay_free_irq) { - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_add(&chan->free_list, &irqs_to_free); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - } - else { - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } + /* we can safely call free now - it will be marked + * as free and freed once the IRQ stopped processing + */ + if (chan->input && chan->enabled) + um_free_irq(chan->line->driver->read_irq, chan); + if (chan->output && chan->enabled) + um_free_irq(chan->line->driver->write_irq, chan); + chan->enabled = 0; if (chan->ops->close != NULL) (*chan->ops->close)(chan->fd, chan->data); diff --git a/arch/um/drivers/line.c b/arch/um/drivers/line.c index 366e57f5e8d6..8d80b27502e6 100644 --- a/arch/um/drivers/line.c +++ b/arch/um/drivers/line.c @@ -284,7 +284,7 @@ int line_setup_irq(int fd, int input, int output, struct line *line, void *data) if (err) return err; if (output) - err = um_request_irq(driver->write_irq, fd, IRQ_WRITE, + err = um_request_irq(driver->write_irq, fd, IRQ_NONE, line_write_interrupt, IRQF_SHARED, driver->write_irq_name, data); return err; diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c index 37c51a6be690..778a0e52d5a5 100644 --- a/arch/um/drivers/random.c +++ b/arch/um/drivers/random.c @@ -13,6 +13,7 @@ #include <linux/miscdevice.h> #include <linux/delay.h> #include <linux/uaccess.h> +#include <init.h> #include <irq_kern.h> #include <os.h> @@ -154,7 +155,14 @@ static int __init rng_init (void) /* * rng_cleanup - shutdown RNG module */ -static void __exit rng_cleanup (void) + +static void cleanup(void) +{ + free_irq_by_fd(random_fd); + os_close_file(random_fd); +} + +static void __exit rng_cleanup(void) { os_close_file(random_fd); misc_deregister (&rng_miscdev); @@ -162,6 +170,7 @@ static void __exit rng_cleanup (void) module_init (rng_init); module_exit (rng_cleanup); +__uml_exitcall(cleanup); MODULE_DESCRIPTION("UML Host Random Number Generator (RNG) driver"); MODULE_LICENSE("GPL"); diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index b55fe9bf5d3e..d4e8c497ae86 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1587,11 +1587,11 @@ int io_thread(void *arg) do { res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n); - if (res > 0) { + if (res >= 0) { written += res; } else { if (res != -EAGAIN) { - printk("io_thread - read failed, fd = %d, " + printk("io_thread - write failed, fd = %d, " "err = %d\n", kernel_fd, -n); } } diff --git a/arch/um/include/shared/irq_user.h b/arch/um/include/shared/irq_user.h index df5633053957..a7a6120f19d5 100644 --- a/arch/um/include/shared/irq_user.h +++ b/arch/um/include/shared/irq_user.h @@ -7,6 +7,7 @@ #define __IRQ_USER_H__ #include <sysdep/ptrace.h> +#include <stdbool.h> struct irq_fd { struct irq_fd *next; @@ -15,10 +16,17 @@ struct irq_fd { int type; int irq; int events; - int current_events; + bool active; + bool pending; + bool purge; }; -enum { IRQ_READ, IRQ_WRITE }; +#define IRQ_READ 0 +#define IRQ_WRITE 1 +#define IRQ_NONE 2 +#define MAX_IRQ_TYPE (IRQ_NONE + 1) + + struct siginfo; extern void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 574e03fc7ba2..edbe6563d5c5 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -290,15 +290,16 @@ extern void halt_skas(void); extern void reboot_skas(void); /* irq.c */ -extern int os_waiting_for_events(struct irq_fd *active_fds); -extern int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds); -extern void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2); -extern void os_free_irq_later(struct irq_fd *active_fds, - int irq, void *dev_id); -extern int os_get_pollfd(int i); -extern void os_set_pollfd(int i, int fd); +extern int os_waiting_for_events_epoll(void); +extern void *os_epoll_get_data_pointer(int index); +extern int os_epoll_triggered(int index, int events); +extern int os_event_mask(int irq_type); +extern int os_setup_epoll(void); +extern int os_add_epoll_fd(int events, int fd, void *data); +extern int os_mod_epoll_fd(int events, int fd, void *data); +extern int os_del_epoll_fd(int fd); extern void os_set_ioignore(void); +extern void os_close_epoll_fd(void); /* sigio.c */ extern int add_sigio_fd(int fd); diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 23cb9350d47e..980148d56537 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -1,4 +1,6 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL * Derived (i.e. mostly copied) from arch/i386/kernel/irq.c: @@ -16,243 +18,361 @@ #include <as-layout.h> #include <kern_util.h> #include <os.h> +#include <irq_user.h> -/* - * This list is accessed under irq_lock, except in sigio_handler, - * where it is safe from being modified. IRQ handlers won't change it - - * if an IRQ source has vanished, it will be freed by free_irqs just - * before returning from sigio_handler. That will process a separate - * list of irqs to free, with its own locking, coming back here to - * remove list elements, taking the irq_lock to do so. + +/* When epoll triggers we do not know why it did so + * we can also have different IRQs for read and write. + * This is why we keep a small irq_fd array for each fd - + * one entry per IRQ type */ -static struct irq_fd *active_fds = NULL; -static struct irq_fd **last_irq_ptr = &active_fds; -extern void free_irqs(void); +struct irq_entry { + struct irq_entry *next; + int fd; + struct irq_fd *irq_array[MAX_IRQ_TYPE + 1]; +}; + +static struct irq_entry *active_fds; + +static DEFINE_SPINLOCK(irq_lock); + +static void irq_io_loop(struct irq_fd *irq, struct uml_pt_regs *regs) +{ +/* + * irq->active guards against reentry + * irq->pending accumulates pending requests + * if pending is raised the irq_handler is re-run + * until pending is cleared + */ + if (irq->active) { + irq->active = false; + do { + irq->pending = false; + do_IRQ(irq->irq, regs); + } while (irq->pending && (!irq->purge)); + if (!irq->purge) + irq->active = true; + } else { + irq->pending = true; + } +} void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs) { - struct irq_fd *irq_fd; - int n; + struct irq_entry *irq_entry; + struct irq_fd *irq; + + int n, i, j; while (1) { - n = os_waiting_for_events(active_fds); + /* This is now lockless - epoll keeps back-referencesto the irqs + * which have trigger it so there is no need to walk the irq + * list and lock it every time. We avoid locking by turning off + * IO for a specific fd by executing os_del_epoll_fd(fd) before + * we do any changes to the actual data structures + */ + n = os_waiting_for_events_epoll(); + if (n <= 0) { if (n == -EINTR) continue; - else break; + else + break; } - for (irq_fd = active_fds; irq_fd != NULL; - irq_fd = irq_fd->next) { - if (irq_fd->current_events != 0) { - irq_fd->current_events = 0; - do_IRQ(irq_fd->irq, regs); + for (i = 0; i < n ; i++) { + /* Epoll back reference is the entry with 3 irq_fd + * leaves - one for each irq type. + */ + irq_entry = (struct irq_entry *) + os_epoll_get_data_pointer(i); + for (j = 0; j < MAX_IRQ_TYPE ; j++) { + irq = irq_entry->irq_array[j]; + if (irq == NULL) + continue; + if (os_epoll_triggered(i, irq->events) > 0) + irq_io_loop(irq, regs); + if (irq->purge) { + irq_entry->irq_array[j] = NULL; + kfree(irq); + } } } } +} + +static int assign_epoll_events_to_irq(struct irq_entry *irq_entry) +{ + int i; + int events = 0; + struct irq_fd *irq; - free_irqs(); + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + irq = irq_entry->irq_array[i]; + if (irq != NULL) + events = irq->events | events; + } + if (events > 0) { + /* os_add_epoll will call os_mod_epoll if this already exists */ + return os_add_epoll_fd(events, irq_entry->fd, irq_entry); + } + /* No events - delete */ + return os_del_epoll_fd(irq_entry->fd); } -static DEFINE_SPINLOCK(irq_lock); + static int activate_fd(int irq, int fd, int type, void *dev_id) { - struct pollfd *tmp_pfd; - struct irq_fd *new_fd, *irq_fd; + struct irq_fd *new_fd; + struct irq_entry *irq_entry; + int i, err, events; unsigned long flags; - int events, err, n; err = os_set_fd_async(fd); if (err < 0) goto out; - err = -ENOMEM; - new_fd = kmalloc(sizeof(struct irq_fd), GFP_KERNEL); - if (new_fd == NULL) - goto out; + spin_lock_irqsave(&irq_lock, flags); - if (type == IRQ_READ) - events = UM_POLLIN | UM_POLLPRI; - else events = UM_POLLOUT; - *new_fd = ((struct irq_fd) { .next = NULL, - .id = dev_id, - .fd = fd, - .type = type, - .irq = irq, - .events = events, - .current_events = 0 } ); + /* Check if we have an entry for this fd */ err = -EBUSY; - spin_lock_irqsave(&irq_lock, flags); - for (irq_fd = active_fds; irq_fd != NULL; irq_fd = irq_fd->next) { - if ((irq_fd->fd == fd) && (irq_fd->type == type)) { - printk(KERN_ERR "Registering fd %d twice\n", fd); - printk(KERN_ERR "Irqs : %d, %d\n", irq_fd->irq, irq); - printk(KERN_ERR "Ids : 0x%p, 0x%p\n", irq_fd->id, - dev_id); + for (irq_entry = active_fds; + irq_entry != NULL; irq_entry = irq_entry->next) { + if (irq_entry->fd == fd) + break; + } + + if (irq_entry == NULL) { + /* This needs to be atomic as it may be called from an + * IRQ context. + */ + irq_entry = kmalloc(sizeof(struct irq_entry), GFP_ATOMIC); + if (irq_entry == NULL) { + printk(KERN_ERR + "Failed to allocate new IRQ entry\n"); goto out_unlock; } + irq_entry->fd = fd; + for (i = 0; i < MAX_IRQ_TYPE; i++) + irq_entry->irq_array[i] = NULL; + irq_entry->next = active_fds; + active_fds = irq_entry; } - if (type == IRQ_WRITE) - fd = -1; - - tmp_pfd = NULL; - n = 0; + /* Check if we are trying to re-register an interrupt for a + * particular fd + */ - while (1) { - n = os_create_pollfd(fd, events, tmp_pfd, n); - if (n == 0) - break; + if (irq_entry->irq_array[type] != NULL) { + printk(KERN_ERR + "Trying to reregister IRQ %d FD %d TYPE %d ID %p\n", + irq, fd, type, dev_id + ); + goto out_unlock; + } else { + /* New entry for this fd */ + + err = -ENOMEM; + new_fd = kmalloc(sizeof(struct irq_fd), GFP_ATOMIC); + if (new_fd == NULL) + goto out_unlock; - /* - * n > 0 - * It means we couldn't put new pollfd to current pollfds - * and tmp_fds is NULL or too small for new pollfds array. - * Needed size is equal to n as minimum. - * - * Here we have to drop the lock in order to call - * kmalloc, which might sleep. - * If something else came in and changed the pollfds array - * so we will not be able to put new pollfd struct to pollfds - * then we free the buffer tmp_fds and try again. + events = os_event_mask(type); + + *new_fd = ((struct irq_fd) { + .id = dev_id, + .irq = irq, + .type = type, + .events = events, + .active = true, + .pending = false, + .purge = false + }); + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop */ - spin_unlock_irqrestore(&irq_lock, flags); - kfree(tmp_pfd); - - tmp_pfd = kmalloc(n, GFP_KERNEL); - if (tmp_pfd == NULL) - goto out_kfree; - - spin_lock_irqsave(&irq_lock, flags); + os_del_epoll_fd(irq_entry->fd); + irq_entry->irq_array[type] = new_fd; } - *last_irq_ptr = new_fd; - last_irq_ptr = &new_fd->next; - + /* Turn back IO on with the correct (new) IO event mask */ + assign_epoll_events_to_irq(irq_entry); spin_unlock_irqrestore(&irq_lock, flags); - - /* - * This calls activate_fd, so it has to be outside the critical - * section. - */ - maybe_sigio_broken(fd, (type == IRQ_READ)); + maybe_sigio_broken(fd, (type != IRQ_NONE)); return 0; - - out_unlock: +out_unlock: spin_unlock_irqrestore(&irq_lock, flags); - out_kfree: - kfree(new_fd); - out: +out: return err; } -static void free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg) +/* + * Walk the IRQ list and dispose of any unused entries. + * Should be done under irq_lock. + */ + +static void garbage_collect_irq_entries(void) { - unsigned long flags; + int i; + bool reap; + struct irq_entry *walk; + struct irq_entry *previous = NULL; + struct irq_entry *to_free; - spin_lock_irqsave(&irq_lock, flags); - os_free_irq_by_cb(test, arg, active_fds, &last_irq_ptr); - spin_unlock_irqrestore(&irq_lock, flags); + if (active_fds == NULL) + return; + walk = active_fds; + while (walk != NULL) { + reap = true; + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (walk->irq_array[i] != NULL) { + reap = false; + break; + } + } + if (reap) { + if (previous == NULL) + active_fds = walk->next; + else + previous->next = walk->next; + to_free = walk; + } else { + to_free = NULL; + } + walk = walk->next; + if (to_free != NULL) + kfree(to_free); + } } -struct irq_and_dev { - int irq; - void *dev; -}; +/* + * Walk the IRQ list and get the descriptor for our FD + */ -static int same_irq_and_dev(struct irq_fd *irq, void *d) +static struct irq_entry *get_irq_entry_by_fd(int fd) { - struct irq_and_dev *data = d; + struct irq_entry *walk = active_fds; - return ((irq->irq == data->irq) && (irq->id == data->dev)); + while (walk != NULL) { + if (walk->fd == fd) + return walk; + walk = walk->next; + } + return NULL; } -static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) -{ - struct irq_and_dev data = ((struct irq_and_dev) { .irq = irq, - .dev = dev }); - free_irq_by_cb(same_irq_and_dev, &data); -} +/* + * Walk the IRQ list and dispose of an entry for a specific + * device, fd and number. Note - if sharing an IRQ for read + * and writefor the same FD it will be disposed in either case. + * If this behaviour is undesirable use different IRQ ids. + */ -static int same_fd(struct irq_fd *irq, void *fd) -{ - return (irq->fd == *((int *)fd)); -} +#define IGNORE_IRQ 1 +#define IGNORE_DEV (1<<1) -void free_irq_by_fd(int fd) +static void do_free_by_irq_and_dev( + struct irq_entry *irq_entry, + unsigned int irq, + void *dev, + int flags +) { - free_irq_by_cb(same_fd, &fd); + int i; + struct irq_fd *to_free; + + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (irq_entry->irq_array[i] != NULL) { + if ( + ((flags & IGNORE_IRQ) || + (irq_entry->irq_array[i]->irq == irq)) && + ((flags & IGNORE_DEV) || + (irq_entry->irq_array[i]->id == dev)) + ) { + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop + */ + os_del_epoll_fd(irq_entry->fd); + to_free = irq_entry->irq_array[i]; + irq_entry->irq_array[i] = NULL; + assign_epoll_events_to_irq(irq_entry); + if (to_free->active) + to_free->purge = true; + else + kfree(to_free); + } + } + } } -/* Must be called with irq_lock held */ -static struct irq_fd *find_irq_by_fd(int fd, int irqnum, int *index_out) +void free_irq_by_fd(int fd) { - struct irq_fd *irq; - int i = 0; - int fdi; + struct irq_entry *to_free; + unsigned long flags; - for (irq = active_fds; irq != NULL; irq = irq->next) { - if ((irq->fd == fd) && (irq->irq == irqnum)) - break; - i++; - } - if (irq == NULL) { - printk(KERN_ERR "find_irq_by_fd doesn't have descriptor %d\n", - fd); - goto out; - } - fdi = os_get_pollfd(i); - if ((fdi != -1) && (fdi != fd)) { - printk(KERN_ERR "find_irq_by_fd - mismatch between active_fds " - "and pollfds, fd %d vs %d, need %d\n", irq->fd, - fdi, fd); - irq = NULL; - goto out; + spin_lock_irqsave(&irq_lock, flags); + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); } - *index_out = i; - out: - return irq; + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); } -void reactivate_fd(int fd, int irqnum) +static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irq, + dev, + 0 + ); + to_free = to_free->next; } - os_set_pollfd(i, irq->fd); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); +} - add_sigio_fd(fd); + +void reactivate_fd(int fd, int irqnum) +{ + /** NOP - we do auto-EOI now **/ } void deactivate_fd(int fd, int irqnum) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; + os_del_epoll_fd(fd); spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irqnum, + NULL, + IGNORE_DEV + ); } - - os_set_pollfd(i, -1); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); - ignore_sigio_fd(fd); } EXPORT_SYMBOL(deactivate_fd); @@ -265,17 +385,28 @@ EXPORT_SYMBOL(deactivate_fd); */ int deactivate_all_fds(void) { - struct irq_fd *irq; - int err; + unsigned long flags; + struct irq_entry *to_free; - for (irq = active_fds; irq != NULL; irq = irq->next) { - err = os_clear_fd_async(irq->fd); - if (err) - return err; - } - /* If there is a signal already queued, after unblocking ignore it */ + spin_lock_irqsave(&irq_lock, flags); + /* Stop IO. The IRQ loop has no lock so this is our + * only way of making sure we are safe to dispose + * of all IRQ handlers + */ os_set_ioignore(); - + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); + to_free = to_free->next; + } + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); + os_close_epoll_fd(); return 0; } @@ -353,8 +484,11 @@ void __init init_IRQ(void) irq_set_chip_and_handler(TIMER_IRQ, &SIGVTALRM_irq_type, handle_edge_irq); + for (i = 1; i < NR_IRQS; i++) irq_set_chip_and_handler(i, &normal_irq_type, handle_edge_irq); + /* Initialize EPOLL Loop */ + os_setup_epoll(); } /* diff --git a/arch/um/os-Linux/irq.c b/arch/um/os-Linux/irq.c index b9afb74b79ad..365823010346 100644 --- a/arch/um/os-Linux/irq.c +++ b/arch/um/os-Linux/irq.c @@ -1,135 +1,147 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL */ #include <stdlib.h> #include <errno.h> -#include <poll.h> +#include <sys/epoll.h> #include <signal.h> #include <string.h> #include <irq_user.h> #include <os.h> #include <um_malloc.h> +/* Epoll support */ + +static int epollfd = -1; + +#define MAX_EPOLL_EVENTS 64 + +static struct epoll_event epoll_events[MAX_EPOLL_EVENTS]; + +/* Helper to return an Epoll data pointer from an epoll event structure. + * We need to keep this one on the userspace side to keep includes separate + */ + +void *os_epoll_get_data_pointer(int index) +{ + return epoll_events[index].data.ptr; +} + +/* Helper to compare events versus the events in the epoll structure. + * Same as above - needs to be on the userspace side + */ + + +int os_epoll_triggered(int index, int events) +{ + return epoll_events[index].events & events; +} +/* Helper to set the event mask. + * The event mask is opaque to the kernel side, because it does not have + * access to the right includes/defines for EPOLL constants. + */ + +int os_event_mask(int irq_type) +{ + if (irq_type == IRQ_READ) + return EPOLLIN | EPOLLPRI; + if (irq_type == IRQ_WRITE) + return EPOLLOUT; + return 0; +} + /* - * Locked by irq_lock in arch/um/kernel/irq.c. Changed by os_create_pollfd - * and os_free_irq_by_cb, which are called under irq_lock. + * Initial Epoll Setup */ -static struct pollfd *pollfds = NULL; -static int pollfds_num = 0; -static int pollfds_size = 0; +int os_setup_epoll(void) +{ + epollfd = epoll_create(MAX_EPOLL_EVENTS); + return epollfd; +} -int os_waiting_for_events(struct irq_fd *active_fds) +/* + * Helper to run the actual epoll_wait + */ +int os_waiting_for_events_epoll(void) { - struct irq_fd *irq_fd; - int i, n, err; + int n, err; - n = poll(pollfds, pollfds_num, 0); + n = epoll_wait(epollfd, + (struct epoll_event *) &epoll_events, MAX_EPOLL_EVENTS, 0); if (n < 0) { err = -errno; if (errno != EINTR) - printk(UM_KERN_ERR "os_waiting_for_events:" - " poll returned %d, errno = %d\n", n, errno); + printk( + UM_KERN_ERR "os_waiting_for_events:" + " epoll returned %d, error = %s\n", n, + strerror(errno) + ); return err; } - - if (n == 0) - return 0; - - irq_fd = active_fds; - - for (i = 0; i < pollfds_num; i++) { - if (pollfds[i].revents != 0) { - irq_fd->current_events = pollfds[i].revents; - pollfds[i].fd = -1; - } - irq_fd = irq_fd->next; - } return n; } -int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds) -{ - if (pollfds_num == pollfds_size) { - if (size_tmpfds <= pollfds_size * sizeof(pollfds[0])) { - /* return min size needed for new pollfds area */ - return (pollfds_size + 1) * sizeof(pollfds[0]); - } - - if (pollfds != NULL) { - memcpy(tmp_pfd, pollfds, - sizeof(pollfds[0]) * pollfds_size); - /* remove old pollfds */ - kfree(pollfds); - } - pollfds = tmp_pfd; - pollfds_size++; - } else - kfree(tmp_pfd); /* remove not used tmp_pfd */ - - pollfds[pollfds_num] = ((struct pollfd) { .fd = fd, - .events = events, - .revents = 0 }); - pollfds_num++; - - return 0; -} -void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2) +/* + * Helper to add a fd to epoll + */ +int os_add_epoll_fd(int events, int fd, void *data) { - struct irq_fd **prev; - int i = 0; - - prev = &active_fds; - while (*prev != NULL) { - if ((*test)(*prev, arg)) { - struct irq_fd *old_fd = *prev; - if ((pollfds[i].fd != -1) && - (pollfds[i].fd != (*prev)->fd)) { - printk(UM_KERN_ERR "os_free_irq_by_cb - " - "mismatch between active_fds and " - "pollfds, fd %d vs %d\n", - (*prev)->fd, pollfds[i].fd); - goto out; - } - - pollfds_num--; - - /* - * This moves the *whole* array after pollfds[i] - * (though it doesn't spot as such)! - */ - memmove(&pollfds[i], &pollfds[i + 1], - (pollfds_num - i) * sizeof(pollfds[0])); - if (*last_irq_ptr2 == &old_fd->next) - *last_irq_ptr2 = prev; - - *prev = (*prev)->next; - if (old_fd->type == IRQ_WRITE) - ignore_sigio_fd(old_fd->fd); - kfree(old_fd); - continue; - } - prev = &(*prev)->next; - i++; - } - out: - return; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events | EPOLLET; + result = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &event); + if ((result) && (errno == EEXIST)) + result = os_mod_epoll_fd(events, fd, data); + if (result) + printk("epollctl add err fd %d, %s\n", fd, strerror(errno)); + return result; } -int os_get_pollfd(int i) +/* + * Helper to mod the fd event mask and/or data backreference + */ +int os_mod_epoll_fd(int events, int fd, void *data) { - return pollfds[i].fd; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events; + result = epoll_ctl(epollfd, EPOLL_CTL_MOD, fd, &event); + if (result) + printk(UM_KERN_ERR + "epollctl mod err fd %d, %s\n", fd, strerror(errno)); + return result; } -void os_set_pollfd(int i, int fd) +/* + * Helper to delete the epoll fd + */ +int os_del_epoll_fd(int fd) { - pollfds[i].fd = fd; + struct epoll_event event; + int result; + /* This is quiet as we use this as IO ON/OFF - so it is often + * invoked on a non-existent fd + */ + result = epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, &event); + return result; } void os_set_ioignore(void) { signal(SIGIO, SIG_IGN); } + +void os_close_epoll_fd(void) +{ + /* Needed so we do not leak an fd when rebooting */ + os_close_file(epollfd); +} -- 2.11.0 |
From: <ant...@ko...> - 2017-10-06 21:33:51
|
This is a resubmit of the patch from earlier today. I had forgotten to include all files - apologies. A. |
From: Anton I. <ant...@ko...> - 2017-10-06 10:49:20
|
Sorry all, I fat fingered this one, will resubmit once I get back home and redo it (I am traveling all day). A. |
From: <ant...@ko...> - 2017-10-06 07:35:21
|
From: Anton Ivanov <ant...@ca...> 1. TSO/GSO support where applicable or available RX - raw and tapraw TX - tap only (raw appears to be hitting a bug in the af_packet family in the kernel resulting in it being stuck in a -ENOBUFS loop. This results in TX/RX TCP performance ~ 2-3 times higher than qemu on same hardware (measured with iperf). 2. Cleanup and unification of the RX/TX code to use the same skb and msg prep routines. Adds two new transport arguments applicable to all transports gro - enable/disable GRO in driver vec - enable/disable multi-message vector IO 3. Adds change/set device features support. Gro,gso,gso,sg,etc can now be adjusted via ethtool. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/vector_kern.c | 167 ++++++++++++++++++++++++++---------- arch/um/drivers/vector_kern.h | 1 + arch/um/drivers/vector_transports.c | 15 ++-- arch/um/drivers/vector_user.c | 5 +- 4 files changed, 135 insertions(+), 53 deletions(-) diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c index f0ea7f98b86c..268862d8f915 100644 --- a/arch/um/drivers/vector_kern.c +++ b/arch/um/drivers/vector_kern.c @@ -75,7 +75,7 @@ static void vector_eth_configure(int n, struct arglist *def); #define SAFETY_MARGIN 32 #define DEFAULT_VECTOR_SIZE 64 #define TX_SMALL_PACKET 128 -#define MAX_IOV_SIZE 8 +#define MAX_IOV_SIZE (MAX_SKB_FRAGS + 1) static const struct { const char string[ETH_GSTRING_LEN]; @@ -162,15 +162,45 @@ static int get_headroom(struct arglist *def) return DEFAULT_HEADROOM; } +static int get_req_size(struct arglist *def) +{ + char *gro = uml_vector_fetch_arg(def, "gro"); + long result; + + if (gro != NULL) { + if (kstrtoul(gro, 10, &result) == 0) { + if (result > 0) + return 65536; + } + } + return get_mtu(def) + ETH_HEADER_OTHER + get_headroom(def) + SAFETY_MARGIN; +} + + static int get_transport_options(struct arglist *def) { char *transport = uml_vector_fetch_arg(def, "transport"); + char *vector = uml_vector_fetch_arg(def, "vec"); + + int vec_rx = VECTOR_RX; + int vec_tx = VECTOR_TX; + long parsed; + + if (vector != NULL) { + if (kstrtoul(vector, 10, &parsed) == 0) { + if (parsed == 0) { + vec_rx = 0; + vec_tx = 0; + } + } + } + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) - return (VECTOR_RX | VECTOR_BPF); + return (vec_rx | VECTOR_BPF); if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) - return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); - return (VECTOR_TX | VECTOR_RX); + return (vec_rx | vec_tx | VECTOR_BPF); + return (vec_rx | vec_tx); } @@ -547,13 +577,59 @@ static struct vector_queue *create_queue( * just read into a prepared queue filled with skbuffs. */ +static struct sk_buff *prep_skb(struct vector_private *vp, struct user_msghdr *msg) +{ + int linear = vp->max_packet + vp->headroom + SAFETY_MARGIN; + struct sk_buff *result; + int iov_index = 0, len; + struct iovec *iov = msg->msg_iov; + int err, nr_frags, frag; + skb_frag_t *skb_frag; + + if (vp->req_size <= linear) + len = linear; + else + len = vp->req_size; + result = alloc_skb_with_frags(linear, len - vp->max_packet, 3, &err, GFP_ATOMIC); + if (vp->header_size > 0) + iov_index++; + if (result == NULL) { + iov[iov_index].iov_base = NULL; + iov[iov_index].iov_len = 0; + goto done; + } + skb_reserve(result, vp->headroom); + result->dev = vp->dev; + skb_put(result, vp->max_packet); + result->data_len = len - vp->max_packet; + result->len += len - vp->max_packet; + skb_reset_mac_header(result); + result->ip_summed = CHECKSUM_NONE; + iov[iov_index].iov_base = result->data; + iov[iov_index].iov_len = vp->max_packet; + iov_index++; + + nr_frags = skb_shinfo(result)->nr_frags; + for (frag = 0; frag < nr_frags; frag++) { + skb_frag = &skb_shinfo(result)->frags[frag]; + iov[iov_index].iov_base = skb_frag_address_safe(skb_frag); + if (iov[iov_index].iov_base != NULL) + iov[iov_index].iov_len = skb_frag_size(skb_frag); + else + iov[iov_index].iov_len = 0; + iov_index++; + } +done: + msg->msg_iovlen = iov_index; + return result; +} + + /* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ static void prep_queue_for_rx(struct vector_queue *qi) { struct vector_private *vp = netdev_priv(qi->dev); - struct sk_buff *skb; - struct iovec *iov; struct mmsghdr *mmsg_vector = qi->mmsg_vector; void **skbuff_vector = qi->skbuff_vector; int i; @@ -566,26 +642,7 @@ static void prep_queue_for_rx(struct vector_queue *qi) * This allows us stop faffing around with a "drop buffer" */ - skb = netdev_alloc_skb( - vp->dev, - vp->max_packet + vp->headroom + SAFETY_MARGIN); - iov = mmsg_vector->msg_hdr.msg_iov; - mmsg_vector->msg_len = 0; - if (vp->header_size > 0) - iov++; - if (skb != NULL) { - skb_reserve(skb, vp->headroom); - skb->dev = qi->dev; - skb_put(skb, vp->max_packet); - skb_reset_mac_header(skb); - skb->ip_summed = CHECKSUM_NONE; - iov->iov_base = skb->data; - iov->iov_len = vp->max_packet; - } else { - iov->iov_base = NULL; - iov->iov_len = 0; - } - *skbuff_vector = skb; + *skbuff_vector = prep_skb(vp, &mmsg_vector->msg_hdr); skbuff_vector++; mmsg_vector++; } @@ -738,7 +795,7 @@ static int vector_legacy_rx(struct vector_private *vp) { int pkt_len; struct user_msghdr hdr; - struct iovec iov[2]; /* header + data use case only */ + struct iovec iov[2 + MAX_IOV_SIZE]; /* header + data use case only */ int iovpos = 0; struct sk_buff *skb; int header_check; @@ -746,34 +803,25 @@ static int vector_legacy_rx(struct vector_private *vp) hdr.msg_name = NULL; hdr.msg_namelen = 0; hdr.msg_iov = (struct iovec *) &iov; - hdr.msg_iovlen = 1; hdr.msg_control = NULL; hdr.msg_controllen = 0; hdr.msg_flags = 0; if (vp->header_size > 0) { - iov[iovpos].iov_base = vp->header_rxbuffer; - iov[iovpos].iov_len = vp->rx_header_size; - hdr.msg_iovlen++; - iovpos++; + iov[0].iov_base = vp->header_rxbuffer; + iov[0].iov_len = vp->header_size; } - skb = netdev_alloc_skb(vp->dev, - vp->max_packet + vp->headroom + SAFETY_MARGIN); + skb = prep_skb(vp, &hdr); + if (skb == NULL) { /* Read a packet into drop_buffer and don't do * anything with it. */ iov[iovpos].iov_base = drop_buffer; iov[iovpos].iov_len = DROP_BUFFER_SIZE; + hdr.msg_iovlen = 1; vp->dev->stats.rx_dropped++; - } else { - skb_reserve(skb, vp->headroom); - skb->dev = vp->dev; - skb_put(skb, vp->max_packet); - skb_reset_mac_header(skb); - iov[iovpos].iov_base = skb->data; - iov[iovpos].iov_len = vp->max_packet; } pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); @@ -794,7 +842,7 @@ static int vector_legacy_rx(struct vector_private *vp) skb->ip_summed = CHECKSUM_UNNECESSARY; } } - skb_trim(skb, pkt_len - vp->rx_header_size); + pskb_trim(skb, pkt_len - vp->rx_header_size); skb->protocol = eth_type_trans(skb, skb->dev); vp->dev->stats.rx_bytes += skb->len; vp->dev->stats.rx_packets++; @@ -898,7 +946,7 @@ static int vector_mmsg_rx(struct vector_private *vp) skb->ip_summed = CHECKSUM_UNNECESSARY; } } - skb_trim(skb, + pskb_trim(skb, mmsg_vector->msg_len - vp->rx_header_size); skb->protocol = eth_type_trans(skb, skb->dev); /* @@ -1109,7 +1157,7 @@ static int vector_net_open(struct net_device *dev) if ((vp->options & VECTOR_RX) > 0) { vp->rx_queue = create_queue( - vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp, get_depth(vp->parsed), vp->rx_header_size, MAX_IOV_SIZE); vp->rx_queue->queue_depth = get_depth(vp->parsed); } else { vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); @@ -1200,6 +1248,30 @@ static void vector_net_tx_timeout(struct net_device *dev) schedule_work(&vp->reset_tx); } +static netdev_features_t vector_fix_features(struct net_device *dev, + netdev_features_t features) +{ + features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM); + return features; +} + +static int vector_set_features(struct net_device *dev, + netdev_features_t features) +{ + struct vector_private *vp = netdev_priv(dev); + /* Adjust buffer sizes for GSO/GRO. Unfortunately, there is + * no way to negotiate it on raw sockets, so we can change + * only our side. + */ + if (features & NETIF_F_GRO) + /* All new frame buffers will be GRO-sized */ + vp->req_size = 65536; + else + /* All new frame buffers will be normal sized */ + vp->req_size = vp->max_packet + vp->headroom + SAFETY_MARGIN; + return 0; +} + #ifdef CONFIG_NET_POLL_CONTROLLER static void vector_net_poll_controller(struct net_device *dev) { @@ -1303,6 +1375,8 @@ static const struct net_device_ops vector_netdev_ops = { .ndo_tx_timeout = vector_net_tx_timeout, .ndo_set_mac_address = eth_mac_addr, .ndo_validate_addr = eth_validate_addr, + .ndo_fix_features = vector_fix_features, + .ndo_set_features = vector_set_features, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = vector_net_poll_controller, #endif @@ -1394,10 +1468,11 @@ static void vector_eth_configure( .opened = false, .transport_data = NULL, .in_write_poll = false, - .coalesce = 2 + .coalesce = 2, + .req_size = get_req_size(def) }); - dev->features = NETIF_F_SG; + dev->features = dev->hw_features = (NETIF_F_SG | NETIF_F_FRAGLIST); tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); INIT_WORK(&vp->reset_tx, vector_reset_tx); diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h index a9ade0851fda..699696deb396 100644 --- a/arch/um/drivers/vector_kern.h +++ b/arch/um/drivers/vector_kern.h @@ -90,6 +90,7 @@ struct vector_private { void *transport_data; /* transport specific params if needed */ int max_packet; + int req_size; /* different from max packet - used for TSO */ int headroom; int options; diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c index 9f07d585f71b..57aa9cb5434c 100644 --- a/arch/um/drivers/vector_transports.c +++ b/arch/um/drivers/vector_transports.c @@ -187,9 +187,8 @@ static int raw_verify_header ( { struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; - if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { - printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); - return -1; /* GSO, we cannot process this */ + if ((vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) && (vp->req_size != 65536)) { + printk(KERN_INFO "Incoming GSO frames and GRO disabled on the interface"); } if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) return 1; @@ -389,8 +388,9 @@ static int build_raw_transport_data(struct vector_private *vp) vp->verify_header = &raw_verify_header; vp->header_size = sizeof(struct virtio_net_hdr); vp->rx_header_size = sizeof(struct virtio_net_hdr); - vp->dev->features |= NETIF_F_HW_CSUM; /* TSO does not work on RAW */ - printk(KERN_INFO "raw: using vnet headers to offload checksum"); + vp->dev->hw_features |= (NETIF_F_GRO); /* TSO does not work on RAW */ + vp->dev->features |= (NETIF_F_RXCSUM | NETIF_F_HW_CSUM | NETIF_F_GRO); + printk(KERN_INFO "raw: using vnet headers for tso and tx/rx checksum"); } return 0; } @@ -402,7 +402,10 @@ static int build_tap_transport_data(struct vector_private *vp) vp->verify_header = &raw_verify_header; vp->header_size = sizeof(struct virtio_net_hdr); vp->rx_header_size = sizeof(struct virtio_net_hdr); - vp->dev->features |= (NETIF_F_HW_CSUM | NETIF_F_TSO); + vp->dev->hw_features |= (NETIF_F_TSO | NETIF_F_GSO | NETIF_F_GRO); + vp->dev->features |= + (NETIF_F_RXCSUM | NETIF_F_HW_CSUM | + NETIF_F_TSO | NETIF_F_GSO | NETIF_F_GRO_BIT); printk(KERN_INFO "tap/raw: using vnet headers for tso and tx/rx checksum"); } else { return 0; /* do not try to enable tap too if raw failed */ diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c index 9210bf2db569..259c3c639eab 100644 --- a/arch/um/drivers/vector_user.c +++ b/arch/um/drivers/vector_user.c @@ -115,7 +115,7 @@ static struct vector_fds *user_init_tap_fds(struct arglist *ifspec) struct ifreq ifr; int fd = -1; struct sockaddr_ll sock; - int err = -ENOMEM; + int err = -ENOMEM, offload; char *iface; struct vector_fds *result = NULL; @@ -153,6 +153,9 @@ static struct vector_fds *user_init_tap_fds(struct arglist *ifspec) goto tap_cleanup; } + offload = TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6; + ioctl(fd, TUNSETOFFLOAD, offload); + /* RAW */ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); -- 2.11.0 |
From: <ant...@ko...> - 2017-10-06 07:35:18
|
From: Anton Ivanov <ant...@ca...> 1. Removes the need to walk the IRQ/Device list to determine who triggered the IRQ. 2. Improves scalability (up to several times performance improvement for cases with 10s of devices). 3. Improves UML baseline IO performance for one disk + one NIC use case by up to 10%. 4. Introduces write poll triggered IRQs. 5. Prerequisite for introducing high performance mmesg family of functions in network IO. 6. Fixes RNG shutdown which was leaking a file descriptor Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/chan_kern.c | 53 +---- arch/um/drivers/line.c | 2 +- arch/um/drivers/random.c | 11 +- arch/um/drivers/ubd_kern.c | 4 +- arch/um/include/shared/irq_user.h | 12 +- arch/um/include/shared/os.h | 17 +- arch/um/kernel/irq.c | 460 ++++++++++++++++++++++++-------------- arch/um/os-Linux/irq.c | 202 +++++++++-------- 8 files changed, 444 insertions(+), 317 deletions(-) diff --git a/arch/um/drivers/chan_kern.c b/arch/um/drivers/chan_kern.c index acbe6c67afba..05588f9466c7 100644 --- a/arch/um/drivers/chan_kern.c +++ b/arch/um/drivers/chan_kern.c @@ -171,56 +171,19 @@ int enable_chan(struct line *line) return err; } -/* Items are added in IRQ context, when free_irq can't be called, and - * removed in process context, when it can. - * This handles interrupt sources which disappear, and which need to - * be permanently disabled. This is discovered in IRQ context, but - * the freeing of the IRQ must be done later. - */ -static DEFINE_SPINLOCK(irqs_to_free_lock); -static LIST_HEAD(irqs_to_free); - -void free_irqs(void) -{ - struct chan *chan; - LIST_HEAD(list); - struct list_head *ele; - unsigned long flags; - - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_splice_init(&irqs_to_free, &list); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - - list_for_each(ele, &list) { - chan = list_entry(ele, struct chan, free_list); - - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } -} - static void close_one_chan(struct chan *chan, int delay_free_irq) { - unsigned long flags; - if (!chan->opened) return; - if (delay_free_irq) { - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_add(&chan->free_list, &irqs_to_free); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - } - else { - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } + /* we can safely call free now - it will be marked + * as free and freed once the IRQ stopped processing + */ + if (chan->input && chan->enabled) + um_free_irq(chan->line->driver->read_irq, chan); + if (chan->output && chan->enabled) + um_free_irq(chan->line->driver->write_irq, chan); + chan->enabled = 0; if (chan->ops->close != NULL) (*chan->ops->close)(chan->fd, chan->data); diff --git a/arch/um/drivers/line.c b/arch/um/drivers/line.c index 366e57f5e8d6..8d80b27502e6 100644 --- a/arch/um/drivers/line.c +++ b/arch/um/drivers/line.c @@ -284,7 +284,7 @@ int line_setup_irq(int fd, int input, int output, struct line *line, void *data) if (err) return err; if (output) - err = um_request_irq(driver->write_irq, fd, IRQ_WRITE, + err = um_request_irq(driver->write_irq, fd, IRQ_NONE, line_write_interrupt, IRQF_SHARED, driver->write_irq_name, data); return err; diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c index 37c51a6be690..778a0e52d5a5 100644 --- a/arch/um/drivers/random.c +++ b/arch/um/drivers/random.c @@ -13,6 +13,7 @@ #include <linux/miscdevice.h> #include <linux/delay.h> #include <linux/uaccess.h> +#include <init.h> #include <irq_kern.h> #include <os.h> @@ -154,7 +155,14 @@ static int __init rng_init (void) /* * rng_cleanup - shutdown RNG module */ -static void __exit rng_cleanup (void) + +static void cleanup(void) +{ + free_irq_by_fd(random_fd); + os_close_file(random_fd); +} + +static void __exit rng_cleanup(void) { os_close_file(random_fd); misc_deregister (&rng_miscdev); @@ -162,6 +170,7 @@ static void __exit rng_cleanup (void) module_init (rng_init); module_exit (rng_cleanup); +__uml_exitcall(cleanup); MODULE_DESCRIPTION("UML Host Random Number Generator (RNG) driver"); MODULE_LICENSE("GPL"); diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index b55fe9bf5d3e..d4e8c497ae86 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1587,11 +1587,11 @@ int io_thread(void *arg) do { res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n); - if (res > 0) { + if (res >= 0) { written += res; } else { if (res != -EAGAIN) { - printk("io_thread - read failed, fd = %d, " + printk("io_thread - write failed, fd = %d, " "err = %d\n", kernel_fd, -n); } } diff --git a/arch/um/include/shared/irq_user.h b/arch/um/include/shared/irq_user.h index df5633053957..a7a6120f19d5 100644 --- a/arch/um/include/shared/irq_user.h +++ b/arch/um/include/shared/irq_user.h @@ -7,6 +7,7 @@ #define __IRQ_USER_H__ #include <sysdep/ptrace.h> +#include <stdbool.h> struct irq_fd { struct irq_fd *next; @@ -15,10 +16,17 @@ struct irq_fd { int type; int irq; int events; - int current_events; + bool active; + bool pending; + bool purge; }; -enum { IRQ_READ, IRQ_WRITE }; +#define IRQ_READ 0 +#define IRQ_WRITE 1 +#define IRQ_NONE 2 +#define MAX_IRQ_TYPE (IRQ_NONE + 1) + + struct siginfo; extern void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 574e03fc7ba2..edbe6563d5c5 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -290,15 +290,16 @@ extern void halt_skas(void); extern void reboot_skas(void); /* irq.c */ -extern int os_waiting_for_events(struct irq_fd *active_fds); -extern int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds); -extern void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2); -extern void os_free_irq_later(struct irq_fd *active_fds, - int irq, void *dev_id); -extern int os_get_pollfd(int i); -extern void os_set_pollfd(int i, int fd); +extern int os_waiting_for_events_epoll(void); +extern void *os_epoll_get_data_pointer(int index); +extern int os_epoll_triggered(int index, int events); +extern int os_event_mask(int irq_type); +extern int os_setup_epoll(void); +extern int os_add_epoll_fd(int events, int fd, void *data); +extern int os_mod_epoll_fd(int events, int fd, void *data); +extern int os_del_epoll_fd(int fd); extern void os_set_ioignore(void); +extern void os_close_epoll_fd(void); /* sigio.c */ extern int add_sigio_fd(int fd); diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 23cb9350d47e..980148d56537 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -1,4 +1,6 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL * Derived (i.e. mostly copied) from arch/i386/kernel/irq.c: @@ -16,243 +18,361 @@ #include <as-layout.h> #include <kern_util.h> #include <os.h> +#include <irq_user.h> -/* - * This list is accessed under irq_lock, except in sigio_handler, - * where it is safe from being modified. IRQ handlers won't change it - - * if an IRQ source has vanished, it will be freed by free_irqs just - * before returning from sigio_handler. That will process a separate - * list of irqs to free, with its own locking, coming back here to - * remove list elements, taking the irq_lock to do so. + +/* When epoll triggers we do not know why it did so + * we can also have different IRQs for read and write. + * This is why we keep a small irq_fd array for each fd - + * one entry per IRQ type */ -static struct irq_fd *active_fds = NULL; -static struct irq_fd **last_irq_ptr = &active_fds; -extern void free_irqs(void); +struct irq_entry { + struct irq_entry *next; + int fd; + struct irq_fd *irq_array[MAX_IRQ_TYPE + 1]; +}; + +static struct irq_entry *active_fds; + +static DEFINE_SPINLOCK(irq_lock); + +static void irq_io_loop(struct irq_fd *irq, struct uml_pt_regs *regs) +{ +/* + * irq->active guards against reentry + * irq->pending accumulates pending requests + * if pending is raised the irq_handler is re-run + * until pending is cleared + */ + if (irq->active) { + irq->active = false; + do { + irq->pending = false; + do_IRQ(irq->irq, regs); + } while (irq->pending && (!irq->purge)); + if (!irq->purge) + irq->active = true; + } else { + irq->pending = true; + } +} void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs) { - struct irq_fd *irq_fd; - int n; + struct irq_entry *irq_entry; + struct irq_fd *irq; + + int n, i, j; while (1) { - n = os_waiting_for_events(active_fds); + /* This is now lockless - epoll keeps back-referencesto the irqs + * which have trigger it so there is no need to walk the irq + * list and lock it every time. We avoid locking by turning off + * IO for a specific fd by executing os_del_epoll_fd(fd) before + * we do any changes to the actual data structures + */ + n = os_waiting_for_events_epoll(); + if (n <= 0) { if (n == -EINTR) continue; - else break; + else + break; } - for (irq_fd = active_fds; irq_fd != NULL; - irq_fd = irq_fd->next) { - if (irq_fd->current_events != 0) { - irq_fd->current_events = 0; - do_IRQ(irq_fd->irq, regs); + for (i = 0; i < n ; i++) { + /* Epoll back reference is the entry with 3 irq_fd + * leaves - one for each irq type. + */ + irq_entry = (struct irq_entry *) + os_epoll_get_data_pointer(i); + for (j = 0; j < MAX_IRQ_TYPE ; j++) { + irq = irq_entry->irq_array[j]; + if (irq == NULL) + continue; + if (os_epoll_triggered(i, irq->events) > 0) + irq_io_loop(irq, regs); + if (irq->purge) { + irq_entry->irq_array[j] = NULL; + kfree(irq); + } } } } +} + +static int assign_epoll_events_to_irq(struct irq_entry *irq_entry) +{ + int i; + int events = 0; + struct irq_fd *irq; - free_irqs(); + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + irq = irq_entry->irq_array[i]; + if (irq != NULL) + events = irq->events | events; + } + if (events > 0) { + /* os_add_epoll will call os_mod_epoll if this already exists */ + return os_add_epoll_fd(events, irq_entry->fd, irq_entry); + } + /* No events - delete */ + return os_del_epoll_fd(irq_entry->fd); } -static DEFINE_SPINLOCK(irq_lock); + static int activate_fd(int irq, int fd, int type, void *dev_id) { - struct pollfd *tmp_pfd; - struct irq_fd *new_fd, *irq_fd; + struct irq_fd *new_fd; + struct irq_entry *irq_entry; + int i, err, events; unsigned long flags; - int events, err, n; err = os_set_fd_async(fd); if (err < 0) goto out; - err = -ENOMEM; - new_fd = kmalloc(sizeof(struct irq_fd), GFP_KERNEL); - if (new_fd == NULL) - goto out; + spin_lock_irqsave(&irq_lock, flags); - if (type == IRQ_READ) - events = UM_POLLIN | UM_POLLPRI; - else events = UM_POLLOUT; - *new_fd = ((struct irq_fd) { .next = NULL, - .id = dev_id, - .fd = fd, - .type = type, - .irq = irq, - .events = events, - .current_events = 0 } ); + /* Check if we have an entry for this fd */ err = -EBUSY; - spin_lock_irqsave(&irq_lock, flags); - for (irq_fd = active_fds; irq_fd != NULL; irq_fd = irq_fd->next) { - if ((irq_fd->fd == fd) && (irq_fd->type == type)) { - printk(KERN_ERR "Registering fd %d twice\n", fd); - printk(KERN_ERR "Irqs : %d, %d\n", irq_fd->irq, irq); - printk(KERN_ERR "Ids : 0x%p, 0x%p\n", irq_fd->id, - dev_id); + for (irq_entry = active_fds; + irq_entry != NULL; irq_entry = irq_entry->next) { + if (irq_entry->fd == fd) + break; + } + + if (irq_entry == NULL) { + /* This needs to be atomic as it may be called from an + * IRQ context. + */ + irq_entry = kmalloc(sizeof(struct irq_entry), GFP_ATOMIC); + if (irq_entry == NULL) { + printk(KERN_ERR + "Failed to allocate new IRQ entry\n"); goto out_unlock; } + irq_entry->fd = fd; + for (i = 0; i < MAX_IRQ_TYPE; i++) + irq_entry->irq_array[i] = NULL; + irq_entry->next = active_fds; + active_fds = irq_entry; } - if (type == IRQ_WRITE) - fd = -1; - - tmp_pfd = NULL; - n = 0; + /* Check if we are trying to re-register an interrupt for a + * particular fd + */ - while (1) { - n = os_create_pollfd(fd, events, tmp_pfd, n); - if (n == 0) - break; + if (irq_entry->irq_array[type] != NULL) { + printk(KERN_ERR + "Trying to reregister IRQ %d FD %d TYPE %d ID %p\n", + irq, fd, type, dev_id + ); + goto out_unlock; + } else { + /* New entry for this fd */ + + err = -ENOMEM; + new_fd = kmalloc(sizeof(struct irq_fd), GFP_ATOMIC); + if (new_fd == NULL) + goto out_unlock; - /* - * n > 0 - * It means we couldn't put new pollfd to current pollfds - * and tmp_fds is NULL or too small for new pollfds array. - * Needed size is equal to n as minimum. - * - * Here we have to drop the lock in order to call - * kmalloc, which might sleep. - * If something else came in and changed the pollfds array - * so we will not be able to put new pollfd struct to pollfds - * then we free the buffer tmp_fds and try again. + events = os_event_mask(type); + + *new_fd = ((struct irq_fd) { + .id = dev_id, + .irq = irq, + .type = type, + .events = events, + .active = true, + .pending = false, + .purge = false + }); + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop */ - spin_unlock_irqrestore(&irq_lock, flags); - kfree(tmp_pfd); - - tmp_pfd = kmalloc(n, GFP_KERNEL); - if (tmp_pfd == NULL) - goto out_kfree; - - spin_lock_irqsave(&irq_lock, flags); + os_del_epoll_fd(irq_entry->fd); + irq_entry->irq_array[type] = new_fd; } - *last_irq_ptr = new_fd; - last_irq_ptr = &new_fd->next; - + /* Turn back IO on with the correct (new) IO event mask */ + assign_epoll_events_to_irq(irq_entry); spin_unlock_irqrestore(&irq_lock, flags); - - /* - * This calls activate_fd, so it has to be outside the critical - * section. - */ - maybe_sigio_broken(fd, (type == IRQ_READ)); + maybe_sigio_broken(fd, (type != IRQ_NONE)); return 0; - - out_unlock: +out_unlock: spin_unlock_irqrestore(&irq_lock, flags); - out_kfree: - kfree(new_fd); - out: +out: return err; } -static void free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg) +/* + * Walk the IRQ list and dispose of any unused entries. + * Should be done under irq_lock. + */ + +static void garbage_collect_irq_entries(void) { - unsigned long flags; + int i; + bool reap; + struct irq_entry *walk; + struct irq_entry *previous = NULL; + struct irq_entry *to_free; - spin_lock_irqsave(&irq_lock, flags); - os_free_irq_by_cb(test, arg, active_fds, &last_irq_ptr); - spin_unlock_irqrestore(&irq_lock, flags); + if (active_fds == NULL) + return; + walk = active_fds; + while (walk != NULL) { + reap = true; + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (walk->irq_array[i] != NULL) { + reap = false; + break; + } + } + if (reap) { + if (previous == NULL) + active_fds = walk->next; + else + previous->next = walk->next; + to_free = walk; + } else { + to_free = NULL; + } + walk = walk->next; + if (to_free != NULL) + kfree(to_free); + } } -struct irq_and_dev { - int irq; - void *dev; -}; +/* + * Walk the IRQ list and get the descriptor for our FD + */ -static int same_irq_and_dev(struct irq_fd *irq, void *d) +static struct irq_entry *get_irq_entry_by_fd(int fd) { - struct irq_and_dev *data = d; + struct irq_entry *walk = active_fds; - return ((irq->irq == data->irq) && (irq->id == data->dev)); + while (walk != NULL) { + if (walk->fd == fd) + return walk; + walk = walk->next; + } + return NULL; } -static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) -{ - struct irq_and_dev data = ((struct irq_and_dev) { .irq = irq, - .dev = dev }); - free_irq_by_cb(same_irq_and_dev, &data); -} +/* + * Walk the IRQ list and dispose of an entry for a specific + * device, fd and number. Note - if sharing an IRQ for read + * and writefor the same FD it will be disposed in either case. + * If this behaviour is undesirable use different IRQ ids. + */ -static int same_fd(struct irq_fd *irq, void *fd) -{ - return (irq->fd == *((int *)fd)); -} +#define IGNORE_IRQ 1 +#define IGNORE_DEV (1<<1) -void free_irq_by_fd(int fd) +static void do_free_by_irq_and_dev( + struct irq_entry *irq_entry, + unsigned int irq, + void *dev, + int flags +) { - free_irq_by_cb(same_fd, &fd); + int i; + struct irq_fd *to_free; + + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (irq_entry->irq_array[i] != NULL) { + if ( + ((flags & IGNORE_IRQ) || + (irq_entry->irq_array[i]->irq == irq)) && + ((flags & IGNORE_DEV) || + (irq_entry->irq_array[i]->id == dev)) + ) { + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop + */ + os_del_epoll_fd(irq_entry->fd); + to_free = irq_entry->irq_array[i]; + irq_entry->irq_array[i] = NULL; + assign_epoll_events_to_irq(irq_entry); + if (to_free->active) + to_free->purge = true; + else + kfree(to_free); + } + } + } } -/* Must be called with irq_lock held */ -static struct irq_fd *find_irq_by_fd(int fd, int irqnum, int *index_out) +void free_irq_by_fd(int fd) { - struct irq_fd *irq; - int i = 0; - int fdi; + struct irq_entry *to_free; + unsigned long flags; - for (irq = active_fds; irq != NULL; irq = irq->next) { - if ((irq->fd == fd) && (irq->irq == irqnum)) - break; - i++; - } - if (irq == NULL) { - printk(KERN_ERR "find_irq_by_fd doesn't have descriptor %d\n", - fd); - goto out; - } - fdi = os_get_pollfd(i); - if ((fdi != -1) && (fdi != fd)) { - printk(KERN_ERR "find_irq_by_fd - mismatch between active_fds " - "and pollfds, fd %d vs %d, need %d\n", irq->fd, - fdi, fd); - irq = NULL; - goto out; + spin_lock_irqsave(&irq_lock, flags); + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); } - *index_out = i; - out: - return irq; + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); } -void reactivate_fd(int fd, int irqnum) +static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irq, + dev, + 0 + ); + to_free = to_free->next; } - os_set_pollfd(i, irq->fd); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); +} - add_sigio_fd(fd); + +void reactivate_fd(int fd, int irqnum) +{ + /** NOP - we do auto-EOI now **/ } void deactivate_fd(int fd, int irqnum) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; + os_del_epoll_fd(fd); spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irqnum, + NULL, + IGNORE_DEV + ); } - - os_set_pollfd(i, -1); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); - ignore_sigio_fd(fd); } EXPORT_SYMBOL(deactivate_fd); @@ -265,17 +385,28 @@ EXPORT_SYMBOL(deactivate_fd); */ int deactivate_all_fds(void) { - struct irq_fd *irq; - int err; + unsigned long flags; + struct irq_entry *to_free; - for (irq = active_fds; irq != NULL; irq = irq->next) { - err = os_clear_fd_async(irq->fd); - if (err) - return err; - } - /* If there is a signal already queued, after unblocking ignore it */ + spin_lock_irqsave(&irq_lock, flags); + /* Stop IO. The IRQ loop has no lock so this is our + * only way of making sure we are safe to dispose + * of all IRQ handlers + */ os_set_ioignore(); - + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); + to_free = to_free->next; + } + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); + os_close_epoll_fd(); return 0; } @@ -353,8 +484,11 @@ void __init init_IRQ(void) irq_set_chip_and_handler(TIMER_IRQ, &SIGVTALRM_irq_type, handle_edge_irq); + for (i = 1; i < NR_IRQS; i++) irq_set_chip_and_handler(i, &normal_irq_type, handle_edge_irq); + /* Initialize EPOLL Loop */ + os_setup_epoll(); } /* diff --git a/arch/um/os-Linux/irq.c b/arch/um/os-Linux/irq.c index b9afb74b79ad..365823010346 100644 --- a/arch/um/os-Linux/irq.c +++ b/arch/um/os-Linux/irq.c @@ -1,135 +1,147 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL */ #include <stdlib.h> #include <errno.h> -#include <poll.h> +#include <sys/epoll.h> #include <signal.h> #include <string.h> #include <irq_user.h> #include <os.h> #include <um_malloc.h> +/* Epoll support */ + +static int epollfd = -1; + +#define MAX_EPOLL_EVENTS 64 + +static struct epoll_event epoll_events[MAX_EPOLL_EVENTS]; + +/* Helper to return an Epoll data pointer from an epoll event structure. + * We need to keep this one on the userspace side to keep includes separate + */ + +void *os_epoll_get_data_pointer(int index) +{ + return epoll_events[index].data.ptr; +} + +/* Helper to compare events versus the events in the epoll structure. + * Same as above - needs to be on the userspace side + */ + + +int os_epoll_triggered(int index, int events) +{ + return epoll_events[index].events & events; +} +/* Helper to set the event mask. + * The event mask is opaque to the kernel side, because it does not have + * access to the right includes/defines for EPOLL constants. + */ + +int os_event_mask(int irq_type) +{ + if (irq_type == IRQ_READ) + return EPOLLIN | EPOLLPRI; + if (irq_type == IRQ_WRITE) + return EPOLLOUT; + return 0; +} + /* - * Locked by irq_lock in arch/um/kernel/irq.c. Changed by os_create_pollfd - * and os_free_irq_by_cb, which are called under irq_lock. + * Initial Epoll Setup */ -static struct pollfd *pollfds = NULL; -static int pollfds_num = 0; -static int pollfds_size = 0; +int os_setup_epoll(void) +{ + epollfd = epoll_create(MAX_EPOLL_EVENTS); + return epollfd; +} -int os_waiting_for_events(struct irq_fd *active_fds) +/* + * Helper to run the actual epoll_wait + */ +int os_waiting_for_events_epoll(void) { - struct irq_fd *irq_fd; - int i, n, err; + int n, err; - n = poll(pollfds, pollfds_num, 0); + n = epoll_wait(epollfd, + (struct epoll_event *) &epoll_events, MAX_EPOLL_EVENTS, 0); if (n < 0) { err = -errno; if (errno != EINTR) - printk(UM_KERN_ERR "os_waiting_for_events:" - " poll returned %d, errno = %d\n", n, errno); + printk( + UM_KERN_ERR "os_waiting_for_events:" + " epoll returned %d, error = %s\n", n, + strerror(errno) + ); return err; } - - if (n == 0) - return 0; - - irq_fd = active_fds; - - for (i = 0; i < pollfds_num; i++) { - if (pollfds[i].revents != 0) { - irq_fd->current_events = pollfds[i].revents; - pollfds[i].fd = -1; - } - irq_fd = irq_fd->next; - } return n; } -int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds) -{ - if (pollfds_num == pollfds_size) { - if (size_tmpfds <= pollfds_size * sizeof(pollfds[0])) { - /* return min size needed for new pollfds area */ - return (pollfds_size + 1) * sizeof(pollfds[0]); - } - - if (pollfds != NULL) { - memcpy(tmp_pfd, pollfds, - sizeof(pollfds[0]) * pollfds_size); - /* remove old pollfds */ - kfree(pollfds); - } - pollfds = tmp_pfd; - pollfds_size++; - } else - kfree(tmp_pfd); /* remove not used tmp_pfd */ - - pollfds[pollfds_num] = ((struct pollfd) { .fd = fd, - .events = events, - .revents = 0 }); - pollfds_num++; - - return 0; -} -void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2) +/* + * Helper to add a fd to epoll + */ +int os_add_epoll_fd(int events, int fd, void *data) { - struct irq_fd **prev; - int i = 0; - - prev = &active_fds; - while (*prev != NULL) { - if ((*test)(*prev, arg)) { - struct irq_fd *old_fd = *prev; - if ((pollfds[i].fd != -1) && - (pollfds[i].fd != (*prev)->fd)) { - printk(UM_KERN_ERR "os_free_irq_by_cb - " - "mismatch between active_fds and " - "pollfds, fd %d vs %d\n", - (*prev)->fd, pollfds[i].fd); - goto out; - } - - pollfds_num--; - - /* - * This moves the *whole* array after pollfds[i] - * (though it doesn't spot as such)! - */ - memmove(&pollfds[i], &pollfds[i + 1], - (pollfds_num - i) * sizeof(pollfds[0])); - if (*last_irq_ptr2 == &old_fd->next) - *last_irq_ptr2 = prev; - - *prev = (*prev)->next; - if (old_fd->type == IRQ_WRITE) - ignore_sigio_fd(old_fd->fd); - kfree(old_fd); - continue; - } - prev = &(*prev)->next; - i++; - } - out: - return; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events | EPOLLET; + result = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &event); + if ((result) && (errno == EEXIST)) + result = os_mod_epoll_fd(events, fd, data); + if (result) + printk("epollctl add err fd %d, %s\n", fd, strerror(errno)); + return result; } -int os_get_pollfd(int i) +/* + * Helper to mod the fd event mask and/or data backreference + */ +int os_mod_epoll_fd(int events, int fd, void *data) { - return pollfds[i].fd; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events; + result = epoll_ctl(epollfd, EPOLL_CTL_MOD, fd, &event); + if (result) + printk(UM_KERN_ERR + "epollctl mod err fd %d, %s\n", fd, strerror(errno)); + return result; } -void os_set_pollfd(int i, int fd) +/* + * Helper to delete the epoll fd + */ +int os_del_epoll_fd(int fd) { - pollfds[i].fd = fd; + struct epoll_event event; + int result; + /* This is quiet as we use this as IO ON/OFF - so it is often + * invoked on a non-existent fd + */ + result = epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, &event); + return result; } void os_set_ioignore(void) { signal(SIGIO, SIG_IGN); } + +void os_close_epoll_fd(void) +{ + /* Needed so we do not leak an fd when rebooting */ + os_close_file(epollfd); +} -- 2.11.0 |
From: <ant...@ko...> - 2017-10-06 07:35:18
|
From: Anton Ivanov <ant...@ca...> 1. Provides infrastructure for vector IO using recvmmsg/sendmmsg. 1.1. Multi-message read. 1.2. Multi-message write. 1.3. Optimized queue support for multi-packet enqueue/dequeue. 1.4. BQL/DQL support. 2. Implements transports for several transports as well support for direct wiring of PWEs to NIC. Allows direct connection of VMs to host, other VMs and network devices with no switch in use. 2.1. Raw socket >4 times faster than existing pcap based transport 2.2. Tap transport using socket RX and tap xmit. >3 times faster RX than existing tap, 10+ times faster on TX. 2.3. GRE transport - direct wiring to GRE PWE, >3 times faster RX than any existing transport. 2.4. L2TPv3 transport - direct wiring to L2TPv3 PWE, > 3 times faster RX than any existing transport. 2.5. TX in all cases shows only minor improvement, but consumes significantly less CPU under load. 3. Tuning and performance related information via ethtool. 4. Rudimentary BPF support - used in tap only to avoid software looping 5. Scatter Gather support. 6. VNET and checksum offload support for raw socket transport. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/Kconfig.net | 11 + arch/um/drivers/Makefile | 4 +- arch/um/drivers/net_kern.c | 4 +- arch/um/drivers/vector_kern.c | 1531 +++++++++++++++++++++++++++++++++++ arch/um/drivers/vector_kern.h | 128 +++ arch/um/drivers/vector_transports.c | 430 ++++++++++ arch/um/drivers/vector_user.c | 528 ++++++++++++ arch/um/drivers/vector_user.h | 90 ++ arch/um/include/asm/irq.h | 12 + arch/um/include/shared/net_kern.h | 2 + 10 files changed, 2737 insertions(+), 3 deletions(-) create mode 100644 arch/um/drivers/vector_kern.c create mode 100644 arch/um/drivers/vector_kern.h create mode 100644 arch/um/drivers/vector_transports.c create mode 100644 arch/um/drivers/vector_user.c create mode 100644 arch/um/drivers/vector_user.h diff --git a/arch/um/Kconfig.net b/arch/um/Kconfig.net index 820a56f00332..0516dc76e6aa 100644 --- a/arch/um/Kconfig.net +++ b/arch/um/Kconfig.net @@ -108,6 +108,17 @@ config UML_NET_DAEMON more than one without conflict. If you don't need UML networking, say N. +config UML_NET_VECTOR + bool "Vector I/O high performance network devices" + depends on UML_NET + help + This User-Mode Linux network driver uses multi-message send + and receive functions. The host running the UML guest must have + a linux kernel version above 3.0 and a libc version > 2.13. + This driver provides tap, raw, gre and l2tpv3 network transports + with up to 4 times higher network throughput than the UML network + drivers. + config UML_NET_VDE bool "VDE transport" depends on UML_NET diff --git a/arch/um/drivers/Makefile b/arch/um/drivers/Makefile index e7582e1d248c..16b3cebddafb 100644 --- a/arch/um/drivers/Makefile +++ b/arch/um/drivers/Makefile @@ -9,6 +9,7 @@ slip-objs := slip_kern.o slip_user.o slirp-objs := slirp_kern.o slirp_user.o daemon-objs := daemon_kern.o daemon_user.o +vector-objs := vector_kern.o vector_user.o vector_transports.o umcast-objs := umcast_kern.o umcast_user.o net-objs := net_kern.o net_user.o mconsole-objs := mconsole_kern.o mconsole_user.o @@ -43,6 +44,7 @@ obj-$(CONFIG_STDERR_CONSOLE) += stderr_console.o obj-$(CONFIG_UML_NET_SLIP) += slip.o slip_common.o obj-$(CONFIG_UML_NET_SLIRP) += slirp.o slip_common.o obj-$(CONFIG_UML_NET_DAEMON) += daemon.o +obj-$(CONFIG_UML_NET_VECTOR) += vector.o obj-$(CONFIG_UML_NET_VDE) += vde.o obj-$(CONFIG_UML_NET_MCAST) += umcast.o obj-$(CONFIG_UML_NET_PCAP) += pcap.o @@ -61,7 +63,7 @@ obj-$(CONFIG_BLK_DEV_COW_COMMON) += cow_user.o obj-$(CONFIG_UML_RANDOM) += random.o # pcap_user.o must be added explicitly. -USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o +USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o vector_user.o CFLAGS_null.o = -DDEV_NULL=$(DEV_NULL_PATH) include arch/um/scripts/Makefile.rules diff --git a/arch/um/drivers/net_kern.c b/arch/um/drivers/net_kern.c index 1669240c7a25..03cc5857a3b6 100644 --- a/arch/um/drivers/net_kern.c +++ b/arch/um/drivers/net_kern.c @@ -288,7 +288,7 @@ static void uml_net_user_timer_expire(unsigned long _conn) #endif } -static void setup_etheraddr(struct net_device *dev, char *str) +void uml_net_setup_etheraddr(struct net_device *dev, char *str) { unsigned char *addr = dev->dev_addr; char *end; @@ -412,7 +412,7 @@ static void eth_configure(int n, void *init, char *mac, */ snprintf(dev->name, sizeof(dev->name), "eth%d", n); - setup_etheraddr(dev, mac); + uml_net_setup_etheraddr(dev, mac); printk(KERN_INFO "Netdevice %d (%pM) : ", n, dev->dev_addr); diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c new file mode 100644 index 000000000000..f0ea7f98b86c --- /dev/null +++ b/arch/um/drivers/vector_kern.c @@ -0,0 +1,1531 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright (C) 2001 Lennert Buytenhek (bu...@gn...) and + * James Leu (jl...@mi...). + * Copyright (C) 2001 by various other people who didn't put their name here. + * Licensed under the GPL. + */ + +#include <linux/version.h> +#include <linux/bootmem.h> +#include <linux/etherdevice.h> +#include <linux/ethtool.h> +#include <linux/inetdevice.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/rtnetlink.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <linux/interrupt.h> +#include <init.h> +#include <irq_kern.h> +#include <irq_user.h> +#include <net_kern.h> +#include <os.h> +#include "mconsole_kern.h" +#include "vector_user.h" +#include "vector_kern.h" + +/* + * Adapted from network devices with the following major changes: + * All transports are static - simplifies the code significantly + * Multiple FDs/IRQs per device + * Vector IO optionally used for read/write, falling back to legacy + * based on configuration and/or availability + * Configuration is no longer positional - L2TPv3 and GRE require up to + * 10 parameters, passing this as positional is not fit for purpose. + * Only socket transports are supported + */ + + +#define DRIVER_NAME "uml-vector" +#define DRIVER_VERSION "01" +struct vector_cmd_line_arg { + struct list_head list; + int unit; + char *arguments; +}; + +struct vector_device { + struct list_head list; + struct net_device *dev; + struct platform_device pdev; + int unit; + int opened; +}; + +static LIST_HEAD(vec_cmd_line); + +static DEFINE_SPINLOCK(vector_devices_lock); +static LIST_HEAD(vector_devices); + +static int driver_registered; + +static void vector_eth_configure(int n, struct arglist *def); + +/* Argument accessors to set variables (and/or set default values) + * mtu, buffer sizing, default headroom, etc + */ + +#define DEFAULT_HEADROOM 2 +#define SAFETY_MARGIN 32 +#define DEFAULT_VECTOR_SIZE 64 +#define TX_SMALL_PACKET 128 +#define MAX_IOV_SIZE 8 + +static const struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { "rx_queue_max" }, + { "rx_queue_running_average" }, + { "tx_queue_max" }, + { "tx_queue_running_average" }, + { "rx_encaps_errors" }, + { "tx_timeout_count" }, + { "tx_restart_queue" }, + { "tx_kicks" }, + { "tx_flow_control_xon" }, + { "tx_flow_control_xoff" }, + { "rx_csum_offload_good" }, + { "rx_csum_offload_errors"}, + { "sg_ok"}, + { "sg_linearized"}, +}; + +#define VECTOR_NUM_STATS ARRAY_SIZE(ethtool_stats_keys) + +/* Used in the 4.4 OpenWRT backport */ + +#if LINUX_VERSION_CODE <= KERNEL_VERSION(4, 7, 0) +static inline void netif_trans_update(struct net_device *dev) +{ + struct netdev_queue *txq = netdev_get_tx_queue(dev, 0); + + if (txq->trans_start != jiffies) + txq->trans_start = jiffies; +} +#endif + +static void vector_reset_stats(struct vector_private *vp) +{ + vp->estats.rx_queue_max = 0; + vp->estats.rx_queue_running_average = 0; + vp->estats.tx_queue_max = 0; + vp->estats.tx_queue_running_average = 0; + vp->estats.rx_encaps_errors = 0; + vp->estats.tx_timeout_count = 0; + vp->estats.tx_restart_queue = 0; + vp->estats.tx_kicks = 0; + vp->estats.tx_flow_control_xon = 0; + vp->estats.tx_flow_control_xoff = 0; + vp->estats.sg_ok = 0; + vp->estats.sg_linearized = 0; +} + +static int get_mtu(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "mtu"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return ETH_MAX_PACKET; +} + +static int get_depth(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "depth"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_VECTOR_SIZE; +} + +static int get_headroom(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "headroom"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_HEADROOM; +} + +static int get_transport_options(struct arglist *def) +{ + char *transport = uml_vector_fetch_arg(def, "transport"); + + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return (VECTOR_RX | VECTOR_BPF); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); + return (VECTOR_TX | VECTOR_RX); +} + + +/* A mini-buffer for packet drop read + * All of our supported transports are datagram oriented and we always + * read using recvmsg or recvmmsg. If we pass a buffer which is smaller + * than the packet size it still counts as full packet read and will + * clean the incoming stream to keep sigio/epoll happy + */ + +#define DROP_BUFFER_SIZE 32 + +static char *drop_buffer; + +/* Array backed queues optimized for bulk enqueue/dequeue and + * 1:N (small values of N) or 1:1 enqueuer/dequeuer ratios. + * For more details and full design rationale see + * http://foswiki.cambridgegreys.com/Main/EatYourTailAndEnjoyIt + */ + + +/* + * Advance the mmsg queue head by n = advance. Resets the queue to + * maximum enqueue/dequeue-at-once capacity if possible. Called by + * dequeuers. Caller must hold the head_lock! + */ + +static int vector_advancehead(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->head = + (qi->head + advance) + % qi->max_depth; + + + spin_lock(&qi->tail_lock); + qi->queue_depth -= advance; + + /* we are at 0, use this to + * reset head and tail so we can use max size vectors + */ + + if (qi->queue_depth == 0) { + qi->head = 0; + qi->tail = 0; + } + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +/* Advance the queue tail by n = advance. + * This is called by enqueuers which should hold the + * head lock already + */ + +static int vector_advancetail(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->tail = + (qi->tail + advance) + % qi->max_depth; + spin_lock(&qi->head_lock); + qi->queue_depth += advance; + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + return queue_depth; +} + +static int prep_msg(struct vector_private *vp, struct sk_buff *skb, struct iovec *iov) +{ + int iov_index = 0; + int nr_frags, frag; + skb_frag_t *skb_frag; + + nr_frags = skb_shinfo(skb)->nr_frags; + if (nr_frags > MAX_IOV_SIZE) { + if (skb_linearize(skb) != 0) + goto drop; + } + if (vp->header_size > 0) { + iov[iov_index].iov_len = vp->header_size; + vp->form_header(iov[iov_index].iov_base, skb, vp); + iov_index++; + } + iov[iov_index].iov_base = skb->data; + if (nr_frags > 0) { + iov[iov_index].iov_len = skb->len - skb->data_len; + vp->estats.sg_ok++; + } else + iov[iov_index].iov_len = skb->len; + iov_index++; + for (frag = 0; frag < nr_frags; frag++) { + skb_frag = &skb_shinfo(skb)->frags[frag]; + iov[iov_index].iov_base = skb_frag_address_safe(skb_frag); + iov[iov_index].iov_len = skb_frag_size(skb_frag); + iov_index++; + } + return iov_index; +drop: + return -1; +} +/* + * Generic vector enqueue with support for forming headers using transport + * specific callback. Allows GRE, L2TPv3, RAW and other transports + * to use a common enqueue procedure in vector mode + */ + +static int vector_enqueue(struct vector_queue *qi, struct sk_buff *skb) +{ + struct vector_private *vp = netdev_priv(qi->dev); + int queue_depth; + int packet_len; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + int iov_count; + + spin_lock(&qi->tail_lock); + spin_lock(&qi->head_lock); + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + + if (skb) + packet_len = skb->len; + + if (queue_depth < qi->max_depth) { + + *(qi->skbuff_vector + qi->tail) = skb; + mmsg_vector += qi->tail; + iov_count = prep_msg( + vp, + skb, + mmsg_vector->msg_hdr.msg_iov + ); + if (iov_count < 1) + goto drop; + mmsg_vector->msg_hdr.msg_iovlen = iov_count; + mmsg_vector->msg_hdr.msg_name = vp->fds->remote_addr; + mmsg_vector->msg_hdr.msg_namelen = vp->fds->remote_addr_size; + queue_depth = vector_advancetail(qi, 1); + } else + goto drop; + spin_unlock(&qi->tail_lock); + return queue_depth; +drop: + qi->dev->stats.tx_dropped++; + if (skb != NULL) { + packet_len = skb->len; + dev_consume_skb_any(skb); + netdev_completed_queue(qi->dev, 1, packet_len); + } + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +static int consume_vector_skbs(struct vector_queue *qi, int count) +{ + struct sk_buff *skb; + int skb_index; + int bytes_compl = 0; + + for (skb_index = qi->head; skb_index < qi->head + count; skb_index++) { + skb = *(qi->skbuff_vector + skb_index); + /* mark as empty to ensure correct destruction if + * needed + */ + bytes_compl += skb->len; + *(qi->skbuff_vector + skb_index) = NULL; + dev_consume_skb_any(skb); + } + qi->dev->stats.tx_bytes += bytes_compl; + qi->dev->stats.tx_packets += count; + netdev_completed_queue(qi->dev, count, bytes_compl); + return vector_advancehead(qi, count); +} + +/* + * Generic vector deque via sendmmsg with support for forming headers + * using transport specific callback. Allows GRE, L2TPv3, RAW and + * other transports to use a common dequeue procedure in vector mode + */ + + +static int vector_send(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *send_from; + int result = 0, send_len, queue_depth = qi->max_depth; + + if (spin_trylock(&qi->head_lock)) { + if (spin_trylock(&qi->tail_lock)) { + /* update queue_depth to current value */ + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + while (queue_depth > 0) { + /* Calculate the start of the vector */ + send_len = queue_depth; + send_from = qi->mmsg_vector; + send_from += qi->head; + /* Adjust vector size if wraparound */ + if (send_len + qi->head > qi->max_depth) + send_len = qi->max_depth - qi->head; + /* Try to TX as many packets as possible */ + if (send_len > 0) { + result = uml_vector_sendmmsg( + vp->fds->tx_fd, + send_from, + send_len, + 0 + ); + vp->in_write_poll = (result != send_len); + } + /* For some of the sendmmsg error scenarios + * we may end being unsure in the TX success + * for all packets. It is safer to declare + * them all TX-ed and blame the network. + */ + if (result < 0) { + printk(KERN_ERR "sendmmsg err=%i\n", + result); + result = send_len; + } + if (result > 0) { + queue_depth = consume_vector_skbs(qi, result); + /* This is equivalent to an TX IRQ. + * Restart the upper layers to feed us + * more packets. + */ + if (result > vp->estats.tx_queue_max) + vp->estats.tx_queue_max = result; + vp->estats.tx_queue_running_average = + (vp->estats.tx_queue_running_average + result) >> 1; + } + netif_trans_update(qi->dev); + netif_wake_queue(qi->dev); + /* if TX is busy, break out of the send loop, poll + * write IRQ will reschedule xmit for us + */ + if (result != send_len) { + vp->estats.tx_restart_queue++; + break; + } + } + } + spin_unlock(&qi->head_lock); + } else { + tasklet_schedule(&vp->tx_poll); + } + return queue_depth; +} + +/* Queue destructor. Deliberately stateless so we can use + * it in queue cleanup if initialization fails. + */ + +static void destroy_queue(struct vector_queue *qi) +{ + int i; + struct iovec *iov; + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *mmsg_vector; + + if (qi == NULL) + return; + /* deallocate any skbuffs - we rely on any unused to be + * set to NULL. + */ + if (qi->skbuff_vector != NULL) { + for (i = 0; i < qi->max_depth; i++) { + if (*(qi->skbuff_vector + i) != NULL) + dev_kfree_skb_any(*(qi->skbuff_vector + i)); + } + kfree(qi->skbuff_vector); + } + /* deallocate matching IOV structures including header buffs */ + if (qi->mmsg_vector != NULL) { + mmsg_vector = qi->mmsg_vector; + for (i = 0; i < qi->max_depth; i++) { + iov = mmsg_vector->msg_hdr.msg_iov; + if (iov != NULL) { + if ((vp->header_size > 0) && + (iov->iov_base != NULL)) + kfree(iov->iov_base); + kfree(iov); + } + mmsg_vector++; + } + kfree(qi->mmsg_vector); + } + kfree(qi); +} + +/* + * Queue constructor. Create a queue with a given side. + */ +static struct vector_queue *create_queue( + struct vector_private *vp, + int max_size, + int header_size, + int num_extra_frags) +{ + struct vector_queue *result; + int i; + struct iovec *iov; + struct mmsghdr *mmsg_vector; + + result = kmalloc(sizeof(struct vector_queue), GFP_KERNEL); + if (result == NULL) + goto out_fail; + result->max_depth = max_size; + result->dev = vp->dev; + result->mmsg_vector = kmalloc( + (sizeof(struct mmsghdr) * max_size), GFP_KERNEL); + result->skbuff_vector = kmalloc( + (sizeof(void *) * max_size), GFP_KERNEL); + if (result->mmsg_vector == NULL || result->skbuff_vector == NULL) + goto out_fail; + + mmsg_vector = result->mmsg_vector; + for (i = 0; i < max_size; i++) { + /* Clear all pointers - we use non-NULL as marking on + * what to free on destruction + */ + *(result->skbuff_vector + i) = NULL; + mmsg_vector->msg_hdr.msg_iov = NULL; + mmsg_vector++; + } + mmsg_vector = result->mmsg_vector; + result->max_iov_frags = num_extra_frags; + for (i = 0; i < max_size; i++) { + if (vp->header_size > 0) + iov = kmalloc(sizeof(struct iovec) * (3 + num_extra_frags), GFP_KERNEL); + else + iov = kmalloc(sizeof(struct iovec) * (2 + num_extra_frags), GFP_KERNEL); + if (iov == NULL) + goto out_fail; + mmsg_vector->msg_hdr.msg_iov = iov; + mmsg_vector->msg_hdr.msg_iovlen = 1; + mmsg_vector->msg_hdr.msg_control = NULL; + mmsg_vector->msg_hdr.msg_controllen = 0; + mmsg_vector->msg_hdr.msg_flags = MSG_DONTWAIT; + mmsg_vector->msg_hdr.msg_name = NULL; + mmsg_vector->msg_hdr.msg_namelen = 0; + if (vp->header_size > 0) { + iov->iov_base = kmalloc(header_size, GFP_KERNEL); + if (iov->iov_base == NULL) + goto out_fail; + iov->iov_len = header_size; + mmsg_vector->msg_hdr.msg_iovlen = 2; + iov++; + } + iov->iov_base = NULL; + iov->iov_len = 0; + mmsg_vector++; + } + spin_lock_init(&result->head_lock); + spin_lock_init(&result->tail_lock); + result->queue_depth = 0; + result->head = 0; + result->tail = 0; + return result; +out_fail: + destroy_queue(result); + return NULL; +} + +/* + * We do not use the RX queue as a proper wraparound queue for now + * This is not necessary because the consumption via netif_rx() + * happens in-line. While we can try using the return code of + * netif_rx() for flow control there are no drivers doing this today. + * For this RX specific use we ignore the tail/head locks and + * just read into a prepared queue filled with skbuffs. + */ + +/* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ + +static void prep_queue_for_rx(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct sk_buff *skb; + struct iovec *iov; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int i; + + if (qi->queue_depth == 0) + return; + for (i = 0; i < qi->queue_depth; i++) { + /* it is OK if allocation fails - recvmmsg with NULL data in + * iov argument still performs an RX, just drops the packet + * This allows us stop faffing around with a "drop buffer" + */ + + skb = netdev_alloc_skb( + vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + iov = mmsg_vector->msg_hdr.msg_iov; + mmsg_vector->msg_len = 0; + if (vp->header_size > 0) + iov++; + if (skb != NULL) { + skb_reserve(skb, vp->headroom); + skb->dev = qi->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + skb->ip_summed = CHECKSUM_NONE; + iov->iov_base = skb->data; + iov->iov_len = vp->max_packet; + } else { + iov->iov_base = NULL; + iov->iov_len = 0; + } + *skbuff_vector = skb; + skbuff_vector++; + mmsg_vector++; + } + qi->queue_depth = 0; +} + +static struct vector_device *find_device(int n) +{ + struct vector_device *device; + struct list_head *ele; + + spin_lock(&vector_devices_lock); + list_for_each(ele, &vector_devices) { + device = list_entry(ele, struct vector_device, list); + if (device->unit == n) + goto out; + } + device = NULL; + out: + spin_unlock(&vector_devices_lock); + return device; +} + +static int vector_parse(char *str, int *index_out, char **str_out, + char **error_out) +{ + int n, len, err = -EINVAL; + char *start = str; + + len = strlen(str); + + while ((*str != ':') && (strlen(str) > 1)) + str++; + if (*str != ':') { + *error_out = "Expected ':' after device number"; + return err; + } else + *str = '\0'; + + err = kstrtouint(start, 0, &n); + if (err < 0) { + *error_out = "Bad device number"; + return err; + } + + str++; + if (find_device(n)) { + *error_out = "Device already configured"; + return err; + } + + *index_out = n; + *str_out = str; + return 0; +} + +static int vector_config(char *str, char **error_out) +{ + int err, n; + char *params; + struct arglist *parsed; + + err = vector_parse(str, &n, ¶ms, error_out); + if (err != 0) + return err; + + /* This string is broken up and the pieces used by the underlying + * driver. We should copy it to make sure things do not go wrong + * later. + */ + + params = kstrdup(params, GFP_KERNEL); + if (str == NULL) { + *error_out = "vector_config failed to strdup string"; + return -ENOMEM; + } + + parsed = uml_parse_vector_ifspec(params); + + if (parsed == NULL) { + *error_out = "vector_config failed to parse parameters"; + return -EINVAL; + } + + vector_eth_configure(n, parsed); + return 0; +} + +static int vector_id(char **str, int *start_out, int *end_out) +{ + char *end; + int n; + + n = simple_strtoul(*str, &end, 0); + if ((*end != '\0') || (end == *str)) + return -1; + + *start_out = n; + *end_out = n; + *str = end; + return n; +} + +static int vector_remove(int n, char **error_out) +{ + struct vector_device *vec_d; + struct net_device *dev; + struct vector_private *vp; + + vec_d = find_device(n); + if (vec_d == NULL) + return -ENODEV; + dev = vec_d->dev; + vp = netdev_priv(dev); + if (vp->fds != NULL) + return -EBUSY; + unregister_netdev(dev); + platform_device_unregister(&vec_d->pdev); + return 0; +} + +/* + * There is no shared per-transport initialization code, so + * we will just initialize each interface one by one and + * add them to a list + */ + +static struct platform_driver uml_net_driver = { + .driver = { + .name = DRIVER_NAME, + }, +}; + + +static void vector_device_release(struct device *dev) +{ + struct vector_device *device = dev_get_drvdata(dev); + struct net_device *netdev = device->dev; + + list_del(&device->list); + kfree(device); + free_netdev(netdev); +} + +/* Bog standard recv using recvmsg - not used normally unless the user + * explicitly specifies not to use recvmmsg vector RX. + */ + +static int vector_legacy_rx(struct vector_private *vp) +{ + int pkt_len; + struct user_msghdr hdr; + struct iovec iov[2]; /* header + data use case only */ + int iovpos = 0; + struct sk_buff *skb; + int header_check; + + hdr.msg_name = NULL; + hdr.msg_namelen = 0; + hdr.msg_iov = (struct iovec *) &iov; + hdr.msg_iovlen = 1; + hdr.msg_control = NULL; + hdr.msg_controllen = 0; + hdr.msg_flags = 0; + + if (vp->header_size > 0) { + iov[iovpos].iov_base = vp->header_rxbuffer; + iov[iovpos].iov_len = vp->rx_header_size; + hdr.msg_iovlen++; + iovpos++; + } + + skb = netdev_alloc_skb(vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + if (skb == NULL) { + /* Read a packet into drop_buffer and don't do + * anything with it. + */ + iov[iovpos].iov_base = drop_buffer; + iov[iovpos].iov_len = DROP_BUFFER_SIZE; + vp->dev->stats.rx_dropped++; + } else { + skb_reserve(skb, vp->headroom); + skb->dev = vp->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + iov[iovpos].iov_base = skb->data; + iov[iovpos].iov_len = vp->max_packet; + } + + pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); + + if (skb != NULL) { + if (pkt_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + vp->header_rxbuffer, skb, vp); + if (header_check < 0) { + dev_kfree_skb_irq(skb); + vp->dev->stats.rx_dropped++; + vp->estats.rx_encaps_errors++; + return 0; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, pkt_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + dev_kfree_skb_irq(skb); + } + } + return pkt_len; +} + +/* + * Packet at a time TX which falls back to vector TX if the + * underlying transport is busy. + */ + + + +static int writev_tx(struct vector_private *vp, struct sk_buff *skb) +{ + struct iovec iov[3 + MAX_IOV_SIZE]; + int iov_count, pkt_len = 0; + + iov[0].iov_base = vp->header_txbuffer; + iov_count = prep_msg(vp, skb, (struct iovec *) &iov); + + if (iov_count < 1) + goto drop; + pkt_len = uml_vector_writev(vp->fds->tx_fd, (struct iovec *) &iov, iov_count); + + netif_trans_update(vp->dev); + netif_wake_queue(vp->dev); + + if (pkt_len > 0) { + vp->dev->stats.tx_bytes += skb->len; + vp->dev->stats.tx_packets++; + } else { + vp->dev->stats.tx_dropped++; + } + consume_skb(skb); + return pkt_len; +drop: + vp->dev->stats.tx_dropped++; + consume_skb(skb); + return pkt_len; +} + +/* + * Receive as many messages as we can in one call using the special + * mmsg vector matched to an skb vector which we prepared earlier. + */ + +static int vector_mmsg_rx(struct vector_private *vp) +{ + int packet_count, i; + struct vector_queue *qi = vp->rx_queue; + struct sk_buff *skb; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int header_check; + + /* Refresh the vector and make sure it is with new skbs and the + * iovs are updated to point to them. + */ + + prep_queue_for_rx(qi); + + /* Fire the Lazy Gun - get as many packets as we can in one go. */ + + packet_count = uml_vector_recvmmsg( + vp->fds->rx_fd, qi->mmsg_vector, qi->max_depth, 0); + + if (packet_count <= 0) + return packet_count; + + /* We treat packet processing as enqueue, buffer refresh as dequeue + * The queue_depth tells us how many buffers have been used and how + * many do we need to prep the next time prep_queue_for_rx() is called. + */ + + qi->queue_depth = packet_count; + + for (i = 0; i < packet_count; i++) { + skb = (*skbuff_vector); + if (mmsg_vector->msg_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + mmsg_vector->msg_hdr.msg_iov->iov_base, skb, vp); + if (header_check < 0) { + /* Overlay header failed to verify - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too + * complex. + */ + dev_kfree_skb_irq(skb); + vp->estats.rx_encaps_errors++; + continue; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, + mmsg_vector->msg_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + /* + * We do not need to lock on updating stats here + * The interrupt loop is non-reentrant. + */ + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + /* Overlay header too short to do anything - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too complex. + */ + if (skb != NULL) + dev_kfree_skb_irq(skb); + } + (*skbuff_vector) = NULL; + /* Move to the next buffer element */ + mmsg_vector++; + skbuff_vector++; + } + if (packet_count > 0) { + if (vp->estats.rx_queue_max < packet_count) + vp->estats.rx_queue_max = packet_count; + vp->estats.rx_queue_running_average = + (vp->estats.rx_queue_running_average + packet_count) >> 1; + } + return packet_count; +} + +static void vector_rx(struct vector_private *vp) +{ + int err; + + if ((vp->options & VECTOR_RX) > 0) + while ((err = vector_mmsg_rx(vp)) > 0) + ; + else + while ((err = vector_legacy_rx(vp)) > 0) + ; + if (err != 0) + printk(KERN_ERR "vector_rx: error(%d)\n", err); +} + +static int vector_net_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + int queue_depth = 0; + + if ((vp->options & VECTOR_TX) == 0) { + writev_tx(vp, skb); + return NETDEV_TX_OK; + } + + /* We do BQL only in the vector path, no point doing it in + * packet at a time mode as there is no device queue + */ + + netdev_sent_queue(vp->dev, skb->len); + queue_depth = vector_enqueue(vp->tx_queue, skb); + + /* if the device queue is full, stop the upper layers and + * flush it. + */ + + if (queue_depth >= vp->tx_queue->max_depth - 1) { + vp->estats.tx_kicks++; + netif_stop_queue(dev); + vector_send(vp->tx_queue); + return NETDEV_TX_OK; + } + if (skb->xmit_more) { + mod_timer(&vp->tl, vp->coalesce); + return NETDEV_TX_OK; + } + if (skb->len < TX_SMALL_PACKET) { + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); + } else + tasklet_schedule(&vp->tx_poll); + return NETDEV_TX_OK; +} + +static irqreturn_t vector_rx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + vector_rx(vp); + return IRQ_HANDLED; + +} + +static irqreturn_t vector_tx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + /* We need to pay attention to it only if we got + * -EAGAIN or -ENOBUFFS from sendmmsg. Otherwise + * we ignore it. In the future, it may be worth + * it to improve the IRQ controller a bit to make + * tweaking the IRQ mask less costly + */ + + if (vp->in_write_poll) { + tasklet_schedule(&vp->tx_poll); + } + return IRQ_HANDLED; + +} + +static int irq_rr; + +static int vector_net_close(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + + netif_stop_queue(dev); + del_timer(&vp->tl); + + if (vp->fds == NULL) + return 0; + + /* Disable and free all IRQS */ + if (vp->rx_irq > 0) { + um_free_irq(vp->rx_irq, dev); + vp->rx_irq = 0; + } + if (vp->tx_irq > 0) { + um_free_irq(vp->tx_irq, dev); + vp->tx_irq = 0; + } + tasklet_kill(&vp->tx_poll); + if (vp->fds->rx_fd > 0) { + os_close_file(vp->fds->rx_fd); + vp->fds->rx_fd = -1; + } + if (vp->fds->tx_fd > 0) { + os_close_file(vp->fds->tx_fd); + vp->fds->tx_fd = -1; + } + if (vp->bpf != NULL) + kfree(vp->bpf); + if (vp->fds->remote_addr != NULL) + kfree(vp->fds->remote_addr); + if (vp->transport_data != NULL) + kfree(vp->transport_data); + if (vp->header_rxbuffer != NULL) + kfree(vp->header_rxbuffer); + if (vp->header_txbuffer != NULL) + kfree(vp->header_txbuffer); + if (vp->rx_queue != NULL) + destroy_queue(vp->rx_queue); + if (vp->tx_queue != NULL) + destroy_queue(vp->tx_queue); + kfree(vp->fds); + vp->fds = NULL; + spin_lock_irqsave(&vp->lock, flags); + vp->opened = false; + spin_unlock_irqrestore(&vp->lock, flags); + return 0; +} + +/* TX tasklet */ + +static void vector_tx_poll(unsigned long data) +{ + struct vector_private *vp = (struct vector_private *)data; + + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} +static void vector_reset_tx(struct work_struct *work) +{ + struct vector_private *vp = + container_of(work, struct vector_private, reset_tx); + netdev_reset_queue(vp->dev); + netif_start_queue(vp->dev); + netif_wake_queue(vp->dev); +} +static int vector_net_open(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + int err = -EINVAL; + struct vector_device *vdevice; + + spin_lock_irqsave(&vp->lock, flags); + if (vp->opened) + return -ENXIO; + vp->opened = true; + spin_unlock_irqrestore(&vp->lock, flags); + + vp->fds = uml_vector_user_open(vp->unit, vp->parsed); + + if (vp->fds == NULL) + goto out_close; + + if (build_transport_data(vp) < 0) + goto out_close; + + if ((vp->options & VECTOR_RX) > 0) { + vp->rx_queue = create_queue( + vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp->rx_queue->queue_depth = get_depth(vp->parsed); + } else { + vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); + if (vp->header_rxbuffer == NULL) + goto out_close; + } + if ((vp->options & VECTOR_TX) > 0) { + vp->tx_queue = create_queue( + vp, get_depth(vp->parsed), vp->header_size, MAX_IOV_SIZE); + } else { + vp->header_txbuffer = kmalloc(vp->header_size, GFP_KERNEL); + if (vp->header_txbuffer == NULL) + goto out_close; + } + + /* READ IRQ */ + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->rx_fd, + IRQ_READ, vector_rx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR "vector_open: failed to get rx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->rx_irq = irq_rr + VECTOR_BASE_IRQ; + dev->irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + + /* WRITE IRQ - we need it only if we have vector TX */ + if ((vp->options & VECTOR_TX) > 0) { + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->tx_fd, + IRQ_WRITE, vector_tx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR + "vector_open: failed to get tx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->tx_irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + } + + if ((vp->options & VECTOR_BPF) != 0) { + vp->bpf = uml_vector_default_bpf(vp->fds->rx_fd, dev->dev_addr); + } + + + /* Write Timeout Timer */ + + vp->tl.data = (unsigned long) vp; + netif_start_queue(dev); + + /* clear buffer - it can happen that the host side of the interface + * is full when we get here. In this case, new data is never queued, + * SIGIOs never arrive, and the net never works. + */ + + vector_rx(vp); + + vector_reset_stats(vp); + vdevice = find_device(vp->unit); + vdevice->opened = 1; + + if ((vp->options & VECTOR_TX) != 0) { + add_timer(&vp->tl); + } + return 0; +out_close: + vector_net_close(dev); + return err; +} + + +static void vector_net_set_multicast_list(struct net_device *dev) +{ + /* TODO: - we can do some BPF games here */ + return; +} + +static void vector_net_tx_timeout(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + vp->estats.tx_timeout_count++; + netif_trans_update(dev); + schedule_work(&vp->reset_tx); +} + +#ifdef CONFIG_NET_POLL_CONTROLLER +static void vector_net_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + uml_net_interrupt(dev->irq, dev); + enable_irq(dev->irq); +} +#endif + +static void vector_net_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *info) +{ + strlcpy(info->driver, DRIVER_NAME, sizeof(info->driver)); + strlcpy(info->version, DRIVER_VERSION, sizeof(info->version)); +} + +static void vector_get_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct vector_private *vp = netdev_priv(netdev); + + ring->rx_max_pending = vp->rx_queue->max_depth; + ring->tx_max_pending = vp->tx_queue->max_depth; + ring->rx_pending = vp->rx_queue->max_depth; + ring->tx_pending = vp->tx_queue->max_depth; +} + +static void vector_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch (stringset) { + case ETH_SS_TEST: + *buf = '\0'; + break; + case ETH_SS_STATS: + memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + default: + WARN_ON(1); + break; + } +} + +static int vector_get_sset_count(struct net_device *dev, int sset) +{ + switch (sset) { + case ETH_SS_TEST: + return 0; + case ETH_SS_STATS: + return VECTOR_NUM_STATS; + default: + return -EOPNOTSUPP; + } +} + +static void vector_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *estats, u64 *tmp_stats) +{ + struct vector_private *vp = netdev_priv(dev); + + memcpy(tmp_stats, &vp->estats, sizeof(struct vector_estats)); +} + +static int vector_get_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + ec->tx_coalesce_usecs = (vp->coalesce * 1000000) / HZ; + return 0; +} + +static int vector_set_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + vp->coalesce = (ec->tx_coalesce_usecs * HZ) / 1000000; + if (vp->coalesce == 0) + vp->coalesce = 1; + return 0; +} + +static const struct ethtool_ops vector_net_ethtool_ops = { + .get_drvinfo = vector_net_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_ts_info = ethtool_op_get_ts_info, + .get_ringparam = vector_get_ringparam, + .get_strings = vector_get_strings, + .get_sset_count = vector_get_sset_count, + .get_ethtool_stats = vector_get_ethtool_stats, + .get_coalesce = vector_get_coalesce, + .set_coalesce = vector_set_coalesce, +}; + + +static const struct net_device_ops vector_netdev_ops = { + .ndo_open = vector_net_open, + .ndo_stop = vector_net_close, + .ndo_start_xmit = vector_net_start_xmit, + .ndo_set_rx_mode = vector_net_set_multicast_list, + .ndo_tx_timeout = vector_net_tx_timeout, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = vector_net_poll_controller, +#endif +}; + + +static void vector_timer_expire(unsigned long _conn) +{ + struct vector_private *vp = (struct vector_private *)_conn; + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} + +static void vector_eth_configure( + int n, + struct arglist *def + ) +{ + struct vector_device *device; + struct net_device *dev; + struct vector_private *vp; + int err; + + printk(KERN_INFO "Vector Netdevice %d\n", n); + + device = kzalloc(sizeof(*device), GFP_KERNEL); + if (device == NULL) { + printk(KERN_ERR "eth_configure failed to allocate struct " + "vector_device\n"); + return; + } + printk(KERN_INFO "vector etherdevice start %d\n", n); + dev = alloc_etherdev(sizeof(struct vector_private)); + if (dev == NULL) { + printk(KERN_ERR "eth_configure: failed to allocate struct " + "net_device for vec%d\n", n); + goto out_free_device; + } + + dev->mtu = get_mtu(def); + + INIT_LIST_HEAD(&device->list); + device->unit = n; + + /* If this name ends up conflicting with an existing registered + * netdevice, that is OK, register_netdev{,ice}() will notice this + * and fail. + */ + snprintf(dev->name, sizeof(dev->name), "vec%d", n); + uml_net_setup_etheraddr(dev, uml_vector_fetch_arg(def, "mac")); + vp = netdev_priv(dev); + + /* sysfs register */ + if (!driver_registered) { + platform_driver_register(¨_net_driver); + driver_registered = 1; + } + device->pdev.id = n; + device->pdev.name = DRIVER_NAME; + device->pdev.dev.release = vector_device_release; + dev_set_drvdata(&device->pdev.dev, device); + if (platform_device_register(&device->pdev)) + goto out_free_netdev; + SET_NETDEV_DEV(dev, &device->pdev.dev); + + device->dev = dev; + + *vp = ((struct vector_private) + { + .list = LIST_HEAD_INIT(vp->list), + .dev = dev, + .unit = n, + .options = get_transport_options(def), + .rx_irq = 0, + .tx_irq = 0, + .parsed = def, + .max_packet = get_mtu(def) + ETH_HEADER_OTHER, + /* TODO - we need to calculate headroom so that ip header + * is 16 byte aligned all the time + */ + .headroom = get_headroom(def), + .form_header = NULL, + .verify_header = NULL, + .header_rxbuffer = NULL, + .header_txbuffer = NULL, + .header_size = 0, + .rx_header_size = 0, + .rexmit_scheduled = false, + .opened = false, + .transport_data = NULL, + .in_write_poll = false, + .coalesce = 2 + }); + + dev->features = NETIF_F_SG; + tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); + INIT_WORK(&vp->reset_tx, vector_reset_tx); + + init_timer(&vp->tl); + spin_lock_init(&vp->lock); + vp->tl.function = vector_timer_expire; + + /* FIXME */ + dev->netdev_ops = &vector_netdev_ops; + dev->ethtool_ops = &vector_net_ethtool_ops; + dev->watchdog_timeo = (HZ >> 1); + /* primary IRQ - fixme */ + dev->irq = 0; /* we will adjust this once opened */ + + rtnl_lock(); + err = register_netdevice(dev); + rtnl_unlock(); + if (err) + goto out_undo_user_init; + + spin_lock(&vector_devices_lock); + list_add(&device->list, &vector_devices); + spin_unlock(&vector_devices_lock); + + return; + +out_undo_user_init: + return; +out_free_netdev: + free_netdev(dev); +out_free_device: + kfree(device); +} + + + + +/* + * Invoked late in the init + */ + +static int __init vector_init(void) +{ + struct list_head *ele; + struct vector_cmd_line_arg *def; + struct arglist *parsed; + + list_for_each(ele, &vec_cmd_line) { + def = list_entry(ele, struct vector_cmd_line_arg, list); + parsed = uml_parse_vector_ifspec(def->arguments); + if (parsed != NULL) + vector_eth_configure(def->unit, parsed); + } + return 0; +} + + +/* Invoked at initial argument parsing, only stores + * arguments until a proper vector_init is called + * later + */ + +static int __init vector_setup(char *str) +{ + char *error; + int n, err; + struct vector_cmd_line_arg *new; + + err = vector_parse(str, &n, &str, &error); + if (err) { + printk(KERN_ERR "vector_setup - Couldn't parse '%s' : %s\n", + str, error); + return 1; + } + new = alloc_bootmem(sizeof(*new)); + INIT_LIST_HEAD(&new->list); + new->unit = n; + new->arguments = str; + list_add_tail(&new->list, &vec_cmd_line); + return 1; +} + +__setup("vec", vector_setup); +__uml_help(vector_setup, +"vec[0-9]+:<option>=<value>,<option>=<value>\n" +" Configure a vector io network device.\n\n" +); + +late_initcall(vector_init); + +static struct mc_device vector_mc = { + .list = LIST_HEAD_INIT(vector_mc.list), + .name = "vec", + .config = vector_config, + .get_config = NULL, + .id = vector_id, + .remove = vector_remove, +}; + +#ifdef CONFIG_INET +static int vector_inetaddr_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + return NOTIFY_DONE; +} + +static struct notifier_block vector_inetaddr_notifier = { + .notifier_call = vector_inetaddr_event, +}; + +static void inet_register(void) +{ + register_inetaddr_notifier(&vector_inetaddr_notifier); +} +#else +static inline void inet_register(void) +{ +} +#endif + +static int vector_net_init(void) +{ + mconsole_register_dev(&vector_mc); + inet_register(); + return 0; +} + +__initcall(vector_net_init); + + + diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h new file mode 100644 index 000000000000..a9ade0851fda --- /dev/null +++ b/arch/um/drivers/vector_kern.h @@ -0,0 +1,128 @@ +/* + * Copyright (C) 2002 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#ifndef __UM_VECTOR_KERN_H +#define __UM_VECTOR_KERN_H + +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/skbuff.h> +#include <linux/socket.h> +#include <linux/list.h> +#include <linux/ctype.h> +#include <linux/workqueue.h> +#include <linux/interrupt.h> +#include "vector_user.h" + +/* Queue structure specially adapted for multiple enqueue/dequeue + * in a mmsgrecv/mmsgsend context + */ + +/* Dequeue method */ + +#define QUEUE_SENDMSG 0 +#define QUEUE_SENDMMSG 1 + +#define VECTOR_RX 1 +#define VECTOR_TX (1 << 1) +#define VECTOR_BPF (1 << 2) + +#define ETH_MAX_PACKET 1500 +#define ETH_HEADER_OTHER 32 /* just in case someone decides to go mad on QnQ */ + +struct vector_queue { + struct mmsghdr *mmsg_vector; + void **skbuff_vector; + /* backlink to device which owns us */ + struct net_device *dev; + spinlock_t head_lock; + spinlock_t tail_lock; + int queue_depth, head, tail, max_depth, max_iov_frags; + short options; +}; + +struct vector_estats { + uint64_t rx_queue_max; + uint64_t rx_queue_running_average; + uint64_t tx_queue_max; + uint64_t tx_queue_running_average; + uint64_t rx_encaps_errors; + uint64_t tx_timeout_count; + uint64_t tx_restart_queue; + uint64_t tx_kicks; + uint64_t tx_flow_control_xon; + uint64_t tx_flow_control_xoff; + uint64_t rx_csum_offload_good; + uint64_t rx_csum_offload_errors; + uint64_t sg_ok; + uint64_t sg_linearized; +}; + +#define VERIFY_HEADER_NOK -1 +#define VERIFY_HEADER_OK 0 +#define VERIFY_CSUM_OK 1 + +struct vector_private { + struct list_head list; + spinlock_t lock; + struct net_device *dev; + + int unit; + + /* Timeout timer in TX */ + + struct timer_list tl; + + /* Scheduled "remove device" work */ + struct work_struct reset_tx; + struct vector_fds *fds; + + struct vector_queue *rx_queue; + struct vector_queue *tx_queue; + + int rx_irq; + int tx_irq; + + struct arglist *parsed; + + void *transport_data; /* transport specific params if needed */ + + int max_packet; + int headroom; + + int options; + + /* remote address if any - some transports will leave this as null */ + + int header_size; + int rx_header_size; + int coalesce; + + void *header_rxbuffer; + void *header_txbuffer; + + int (*form_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + int (*verify_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + + spinlock_t stats_lock; + + struct tasklet_struct tx_poll; + bool rexmit_scheduled; + bool opened; + bool in_write_poll; + + /* ethtool stats */ + + struct vector_estats estats; + void *bpf; + + char user[0]; +}; + +extern int build_transport_data(struct vector_private *vp); + +#endif diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c new file mode 100644 index 000000000000..9f07d585f71b --- /dev/null +++ b/arch/um/drivers/vector_transports.c @@ -0,0 +1,430 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Licensed under the GPL. + */ + +#include <linux/etherdevice.h> +#include <linux/netdevice.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <asm/byteorder.h> +#include <uapi/linux/ip.h> +#include <uapi/linux/virtio_net.h> +#include <linux/virtio_net.h> +#include <linux/virtio_byteorder.h> +#include <linux/netdev_features.h> +#include "vector_user.h" +#include "vector_kern.h" + +#define GOOD_LINEAR 512 + +struct gre_minimal_header { + uint16_t header; + uint16_t arptype; +}; + + +struct uml_gre_data { + uint32_t rx_key; + uint32_t tx_key; + uint32_t sequence; + + bool ipv6; + bool has_sequence; + bool pin_sequence; + bool checksum; + bool key; + struct gre_minimal_header expected_header; + + uint32_t checksum_offset; + uint32_t key_offset; + uint32_t sequence_offset; + +}; + +struct uml_l2tpv3_data { + uint64_t rx_cookie; + uint64_t tx_cookie; + uint64_t rx_session; + uint64_t tx_session; + uint32_t counter; + + bool udp; + bool ipv6; + bool has_counter; + bool pin_counter; + bool cookie; + bool cookie_is_64; + + uint32_t cookie_offset; + uint32_t session_offset; + uint32_t counter_offset; +}; + +static int l2tpv3_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *counter; + + if (td->udp) + *(uint32_t *) header = cpu_to_be32(L2TPV3_DATA_PACKET); + (*(uint32_t *) (header + td->session_offset)) = td->tx_session; + + if (td->cookie) { + if (td->cookie_is_64) + (*(uint64_t *)(header + td->cookie_offset)) = + td->tx_cookie; + else + (*(uint32_t *)(header + td->cookie_offset)) = + td->tx_cookie; + } + if (td->has_counter) { + counter = (uint32_t *)(header + td->counter_offset); + if (td->pin_counter) { + *counter = 0; + } else { + td->counter++; + *counter = cpu_to_be32(td->counter); + } + } + return 0; +} + +static int gre_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_gre_data *td = vp->transport_data; + uint32_t *sequence; + *((uint32_t *) header) = *((uint32_t *) &td->expected_header); + if (td->key) + (*(uint32_t *) (header + td->key_offset)) = td->tx_key; + if (td->has_sequence) { + sequence = (uint32_t *)(header + td->sequence_offset); + if (td->pin_sequence) + *sequence = 0; + else + *sequence = cpu_to_be32(++td->sequence); + } + return 0; +} + +static int raw_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + virtio_net_hdr_from_skb(skb, vheader, virtio_legacy_is_little_endian(), false); + + return 0; +} + +static int l2tpv3_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *session; + uint64_t cookie; + + if ((!td->udp) && (!td->ipv6)) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + /* we do not do a strict check for "data" packets as per + * the RFC spec because the pure IP spec does not have + * that anyway. + */ + + if (td->cookie) { + if (td->cookie_is_64) + cookie = *(uint64_t *)(header + td->cookie_offset); + else + cookie = *(uint32_t *)(header + td->cookie_offset); + if (cookie != td->rx_cookie) { + printk(KERN_ERR "uml_l2tpv3: unknown cookie id"); + return -1; + } + } + session = (uint32_t *) (header + td->session_offset); + if (*session != td->rx_session) { + printk(KERN_ERR "uml_l2tpv3: session mismatch"); + return -1; + } + return 0; +} + +static int gre_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + + uint32_t key; + struct uml_gre_data *td = vp->transport_data; + + if (!td->ipv6) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + if (*((uint32_t *) header) != *((uint32_t *) &td->expected_header)) { + printk(KERN_ERR "header type disagreement, expecting %0x, got %0x", + *((uint32_t *) &td->expected_header), + *((uint32_t *) header) + ); + return -1; + } + + if (td->key) { + key = (*(uint32_t *)(header + td->key_offset)); + if (key != td->rx_key) { + printk(KERN_ERR "unknown key id %0x, expecting %0x", + key, td->rx_key); + return -1; + } + } + return 0; +} + +static int raw_verify_header ( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { + printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); + return -1; /* GSO, we cannot process this */ + } + if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) + return 1; + else + virtio_net_hdr_to_skb(skb, vheader, virtio_legacy_is_little_endian()); + return 0; +} + +static bool get_uint_param( + struct arglist *def, char *param, unsigned int *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoint(arg, 0, result) == 0) + return true; + } + return false; +} + +static bool get_ulong_param( + struct arglist *def, char *param, unsigned long *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoul(arg, 0, result) == 0) + return true; + return true; + } + return false; +} + +static int build_gre_transport_data(struct vector_private *vp) +{ + struct uml_gre_data *td; + int temp_int; + int temp_rx; + int temp_tx; + + vp->transport_data = kmalloc(sizeof(struct uml_gre_data), GFP_KERNEL); + if (vp->transport_data == NULL) + return -ENOMEM; + td = vp->transport_data; + td->sequence = 0; + + td->expected_header.arptype = GRE_IRB; + td->expected_header.header = 0; + + vp->form_header = &gre_form_header; + vp->verify_header = &gre_verify_header; + vp->header_size = 4; + td->key_offset = 4; + td->sequence_offset = 4; + td->checksum_offset = 4; + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + td->key = false; + if (get_uint_param(vp->parsed, "rx_key", &temp_rx)) { + if (get_uint_param(vp->parsed, "tx_key", &temp_tx)) { + td->key = true; + td->expected_header.header |= GRE_MODE_KEY; + td->rx_key = cpu_to_be32(temp_rx); + td->tx_key = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->sequence_offset += 4; + } else { + return -EINVAL; + } + } + + td->sequence = false; + if (get_uint_param(vp->parsed, "sequence", &temp_int)) { + if (temp_int > 0) { + vp->header_size += 4; + td->has_sequence = true; + td->expected_header.header |= GRE_MODE_SEQUENCE; + if (get_uint_param( + vp->parsed, "pin_sequence", &temp_int)) { + if (temp_int > 0) + td->pin_sequence = true; + } + } + } + vp->rx_header_size = vp->header_size; + if (!td->ipv6) + vp->rx_header_size += sizeof(struct iphdr); + return 0; +} + +static int build_l2tpv3_transport_data(struct vector_private *vp) +{ + + struct uml_l2tpv3_data *td; + int temp_int, temp_rxs, temp_txs; + unsigned long temp_rx; + unsigned long temp_tx; + + vp->transport_data = kmalloc( + sizeof(struct uml_l2tpv3_data), GFP_KERNEL); + + if (vp->transport_data == NULL) + return -ENOMEM; + + td = vp->transport_data; + + vp->form_header = &l2tpv3_form_header; + vp->verify_header = &l2tpv3_verify_header; + td->counter = 0; + + vp->header_size = 4; + td->session_offset = 0; + td->cookie_offset = 4; + td->counter_offset = 4; + + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + + if (get_uint_param(vp->parsed, "rx_session", &temp_rxs)) { + if (get_uint_param(vp->parsed, "tx_session", &temp_txs)) { + td->tx_session = cpu_to_be32(temp_txs); + td->rx_session = cpu_to_be32(temp_rxs); + } else { + return -EINVAL; + } + } else { + return -EINVAL; + } + + td->cookie_is_64 = false; + if (get_uint_param(vp->parsed, "cookie64", &temp_int)) { + if (temp_int > 0) + td->cookie_is_64 = true; + } + td->cookie = false; + if (get_ulong_param(vp->parsed, "rx_cookie", &temp_rx)) { + if (get_ulong_param(vp->parsed, "tx_cookie", &temp_tx)) { + td->cookie = true; + if (td->cookie_is_64) { + td->rx_cookie = cpu_to_be64(temp_rx); + td->tx_cookie = cpu_to_be64(temp_tx); + vp->header_size += 8; + td->counter_offset += 8; + } else { + td->rx_cookie = cpu_to_be32(temp_rx); + td->tx_cookie = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->counter_offset += 4; + } + } else { + return -EINVAL; + } + } + + td->has_counter = false; + if (get_uint_param(vp->parsed, "counter", &temp_int)) { + if (temp_int > 0) { + td->has_counter = true; + vp->header_size += 4; + if (get_uint_param( + vp->parsed, "pin_counter", &temp_int)) { + if (temp_int > 0) + td->pin_counter = true; + } + } + } + + if (get_uint_param(vp->parsed, "udp", &temp_int)) { + if (temp_int > 0) { + td->udp = true; + vp->header_size += 4; + td->counter_offset += 4; + td->session_offset += 4; + td->cookie_offset += 4; + } + } + + vp->rx_header_size = vp->header_size; + if ((!td->ipv6) && (!td->udp)) + vp->rx_header_size += sizeof(struct iphdr); + + return 0; +} + +static int build_raw_transport_data(struct vector_private *vp) +{ + if (uml_raw_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= NETIF_F_HW_CSUM; /* TSO does not work on RAW */ + printk(KERN_INFO "raw: using vnet headers to offload checksum"); + } + return 0; +} + +static int build_tap_transport_data(struct vector_private *vp) +{ + if (uml_raw_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= (NETIF_F_HW_CSUM | NETIF_F_TSO); + printk(KERN_INFO "tap/raw: using vnet headers for tso and tx/rx checksum"); + } else { + return 0; /* do not try to enable tap too if raw failed */ + } + if (uml_tap_enable_vnet_headers(vp->fds->tx_fd)) { + return 0; + } + return -1; +} + +int build_transport_data(struct vector_private *vp) +{ + char *transport = uml_vector_fetch_arg(vp->parsed, "transport"); + + if (strncmp(transport, TRANS_GRE, TRANS_GRE_LEN) == 0) + return build_gre_transport_data(vp); + if (strncmp(transport, TRANS_L2TPV3, TRANS_L2TPV3_LEN) == 0) + return build_l2tpv3_transport_data(vp); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return build_raw_transport_data(vp); + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return build_tap_transport_data(vp); + return 0; +} + diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c new file mode 100644 index 000000000000..9210bf2db569 --- /dev/null +++ b/arch/um/drivers/vector_user.c @@ -0,0 +1,528 @@ +/* + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#include <stdio.h> +#include <unistd.h> +#include <stdarg.h> +#include <errno.h> +#include <stddef.h> +#include <string.h> +#include <sys/ioctl.h> +#include <net/if.h> +#include <linux/if_tun.h> +#include <arpa/inet.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <net/ethernet.h> +#include <netinet/ip.h> +#include <netinet/ether.h> +#include <linux/if_ether.h> +#include <linux/if_packet.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <linux/virtio_net.h> +#include <netdb.h> +#include <stdlib.h> +#include <os.h> +#include <um_malloc.h> +#include "vector_user.h" + +#define ID_GRE 0 +#define ID_L2TPV3 1 +#define ID_MAX 1 + +#define TOKEN_IFNAME "ifname" + +#define TRANS_RAW "raw" +#define TRANS_RAW_LEN strlen(TRANS_RAW) + +/* This is very ugly and brute force lookup, but it is done + * only once at initialization so not worth doing hashes or + * anything more intelligent + */ + +char *uml_vector_fetch_arg(struct arglist *ifspec, char *token) +{ + int i; + + for (i = 0; i < ifspec->numargs; i++) { + if (strcmp(ifspec->tokens[i], token) == 0) + return ifspec->values[i]; + } + return NULL; + +} + +struct arglist *uml_parse_vector_ifspec(char *arg) +{ + struct arglist *result; + int pos, len; + bool parsing_token = true, next_starts = true; + + if (arg == NULL) + return NULL; + result = uml_kmalloc(sizeof(struct arglist), UM_GFP_KERNEL); + if (result == NULL) + return NULL; + result->numargs = 0; + len = strlen(arg); + for (pos = 0; pos < len; pos++) { + if (next_starts) { + if (parsing_token) { + result->tokens[result->numargs] = arg + pos; + } else { + result->values[result->numargs] = arg + pos; + result->numargs++; + } + next_starts = false; + } + if (*(arg + pos) == '=') { + if (parsing_token) + parsing_token = false; + else + goto cleanup; + next_starts = true; + (*(arg + pos)) = '\0'; + } + if (*(arg + pos) == ',') { + parsing_token = true; + next_starts = true; + (*(arg + pos)) = '\0'; + } + } + return result; +cleanup: + printk(UM_KERN_ERR "vector_setup - Couldn't parse '%s'\n", arg); + kfree(result); + return NULL; +} + +/* + * Socket/FD configuration functions. These return an structure + * of rx and tx descriptors to cover cases where these are not + * the same (f.e. read via raw socket and write via tap). + */ + +#define PATH_NET_TUN "/dev/net... [truncated message content] |
From: <ant...@ko...> - 2017-10-06 07:35:09
|
Patches 1 and 2 are the same as before, patch 3 adds TSO, GRO and full scatter gather support as well as ability to on/off all features from commandline and/or ethtool where applicable. The result from patch 3 is 5.6Gbit on tcp throughput using the tap/raw driver on a 3.5GHz AMD A4 desktop. I am parking this now for the time being as I needed it mostly as a test harness for another project which I now have to go and beat into shape :) Brgds, A. |
From: Anton I. <ant...@ko...> - 2017-10-05 13:50:23
|
Hi List, hi Richard, Slight bump - we are now at 5.6Gbit. Same host hardware - 3.5GHz Athlon. This is an incremental on top of the last patchset from this week which tidies up further the SG and TSO/GSO code and enables it both way to the extent possible. Richard, I would like, if possible, to start working on cleaning this up. IMHO it will be better to work off the last submitted patch version and add TSO/GSO in a next revision after (and of course if) we have it merged. TSO/GSO will drag a huge blob of ethtool (or mconsole) related code to control it, it will be better if it comes as an incremental and not with the core driver code. Example: aivanov@MonstrousNightmare:~$ iperf -c 192.168.125.118 ------------------------------------------------------------ Client connecting to 192.168.125.118, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.125.1 port 34304 connected with 192.168.125.118 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 5.70 GBytes 4.89 Gbits/sec aivanov@MonstrousNightmare:~$ iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.3.80 port 5001 connected with 192.168.125.118 port 52306 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 6.12 GBytes 5.26 Gbits/sec A. On 10/04/17 08:29, Anton Ivanov wrote: > I got the TSO and some offloads working for some (not all) cases of my > patchsets. > > Preliminary speeds and feeds on iperf: > > [ 4] local 192.168.98.1 port 5001 connected with 192.168.98.132 port > 54838 > [ 4] 0.0-10.0 sec 4.78 GBytes 4.11 Gbits/sec > > As a comparison - kvm/qemu on same machine does about a Gbit and vEth > to container does 20Gbit - so somewhere in between. > > I am having some trouble with TSO and raw sockets. Once I figure out > what is going on (and hopefully fix it) there will be an updated > version which should take us into 4x what kvm/QEMU does on one CPU > networkwise. > > A. > > > On 09/26/17 11:26, ant...@ko... wrote: >> Hi List, hi Richard, >> >> I found a logical error in the SG code. This revision fixes that. No >> other changes - the rest appears to work as it should in my testing. >> >> Brgds, >> >> A. >> >> >> ------------------------------------------------------------------------------ >> >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> User-mode-linux-devel mailing list >> Use...@li... >> https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel >> > |
From: <ant...@ko...> - 2017-10-04 14:42:23
|
From: Anton Ivanov <ant...@ca...> 1. Provides infrastructure for vector IO using recvmmsg/sendmmsg. 1.1. Multi-message read. 1.2. Multi-message write. 1.3. Optimized queue support for multi-packet enqueue/dequeue. 1.4. BQL/DQL support. 2. Implements transports for several transports as well support for direct wiring of PWEs to NIC. Allows direct connection of VMs to host, other VMs and network devices with no switch in use. 2.1. Raw socket >4 times faster than existing pcap based transport 2.2. Tap transport using socket RX and tap xmit. >3 times faster RX than existing tap, 10+ times faster on TX. 2.3. GRE transport - direct wiring to GRE PWE, >3 times faster RX than any existing transport. 2.4. L2TPv3 transport - direct wiring to L2TPv3 PWE, > 3 times faster RX than any existing transport. 2.5. TX in all cases shows only minor improvement, but consumes significantly less CPU under load. 3. Tuning and performance related information via ethtool. 4. Rudimentary BPF support - used in tap only to avoid software looping 5. Scatter Gather support. 6. VNET and checksum offload support for raw socket transport. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/Kconfig.net | 11 + arch/um/drivers/Makefile | 4 +- arch/um/drivers/net_kern.c | 4 +- arch/um/drivers/vector_kern.c | 1531 +++++++++++++++++++++++++++++++++++ arch/um/drivers/vector_kern.h | 128 +++ arch/um/drivers/vector_transports.c | 430 ++++++++++ arch/um/drivers/vector_user.c | 528 ++++++++++++ arch/um/drivers/vector_user.h | 90 ++ arch/um/include/asm/irq.h | 12 + arch/um/include/shared/net_kern.h | 2 + 10 files changed, 2737 insertions(+), 3 deletions(-) create mode 100644 arch/um/drivers/vector_kern.c create mode 100644 arch/um/drivers/vector_kern.h create mode 100644 arch/um/drivers/vector_transports.c create mode 100644 arch/um/drivers/vector_user.c create mode 100644 arch/um/drivers/vector_user.h diff --git a/arch/um/Kconfig.net b/arch/um/Kconfig.net index 820a56f00332..0516dc76e6aa 100644 --- a/arch/um/Kconfig.net +++ b/arch/um/Kconfig.net @@ -108,6 +108,17 @@ config UML_NET_DAEMON more than one without conflict. If you don't need UML networking, say N. +config UML_NET_VECTOR + bool "Vector I/O high performance network devices" + depends on UML_NET + help + This User-Mode Linux network driver uses multi-message send + and receive functions. The host running the UML guest must have + a linux kernel version above 3.0 and a libc version > 2.13. + This driver provides tap, raw, gre and l2tpv3 network transports + with up to 4 times higher network throughput than the UML network + drivers. + config UML_NET_VDE bool "VDE transport" depends on UML_NET diff --git a/arch/um/drivers/Makefile b/arch/um/drivers/Makefile index e7582e1d248c..16b3cebddafb 100644 --- a/arch/um/drivers/Makefile +++ b/arch/um/drivers/Makefile @@ -9,6 +9,7 @@ slip-objs := slip_kern.o slip_user.o slirp-objs := slirp_kern.o slirp_user.o daemon-objs := daemon_kern.o daemon_user.o +vector-objs := vector_kern.o vector_user.o vector_transports.o umcast-objs := umcast_kern.o umcast_user.o net-objs := net_kern.o net_user.o mconsole-objs := mconsole_kern.o mconsole_user.o @@ -43,6 +44,7 @@ obj-$(CONFIG_STDERR_CONSOLE) += stderr_console.o obj-$(CONFIG_UML_NET_SLIP) += slip.o slip_common.o obj-$(CONFIG_UML_NET_SLIRP) += slirp.o slip_common.o obj-$(CONFIG_UML_NET_DAEMON) += daemon.o +obj-$(CONFIG_UML_NET_VECTOR) += vector.o obj-$(CONFIG_UML_NET_VDE) += vde.o obj-$(CONFIG_UML_NET_MCAST) += umcast.o obj-$(CONFIG_UML_NET_PCAP) += pcap.o @@ -61,7 +63,7 @@ obj-$(CONFIG_BLK_DEV_COW_COMMON) += cow_user.o obj-$(CONFIG_UML_RANDOM) += random.o # pcap_user.o must be added explicitly. -USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o +USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o vector_user.o CFLAGS_null.o = -DDEV_NULL=$(DEV_NULL_PATH) include arch/um/scripts/Makefile.rules diff --git a/arch/um/drivers/net_kern.c b/arch/um/drivers/net_kern.c index 1669240c7a25..03cc5857a3b6 100644 --- a/arch/um/drivers/net_kern.c +++ b/arch/um/drivers/net_kern.c @@ -288,7 +288,7 @@ static void uml_net_user_timer_expire(unsigned long _conn) #endif } -static void setup_etheraddr(struct net_device *dev, char *str) +void uml_net_setup_etheraddr(struct net_device *dev, char *str) { unsigned char *addr = dev->dev_addr; char *end; @@ -412,7 +412,7 @@ static void eth_configure(int n, void *init, char *mac, */ snprintf(dev->name, sizeof(dev->name), "eth%d", n); - setup_etheraddr(dev, mac); + uml_net_setup_etheraddr(dev, mac); printk(KERN_INFO "Netdevice %d (%pM) : ", n, dev->dev_addr); diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c new file mode 100644 index 000000000000..f0ea7f98b86c --- /dev/null +++ b/arch/um/drivers/vector_kern.c @@ -0,0 +1,1531 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright (C) 2001 Lennert Buytenhek (bu...@gn...) and + * James Leu (jl...@mi...). + * Copyright (C) 2001 by various other people who didn't put their name here. + * Licensed under the GPL. + */ + +#include <linux/version.h> +#include <linux/bootmem.h> +#include <linux/etherdevice.h> +#include <linux/ethtool.h> +#include <linux/inetdevice.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/rtnetlink.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <linux/interrupt.h> +#include <init.h> +#include <irq_kern.h> +#include <irq_user.h> +#include <net_kern.h> +#include <os.h> +#include "mconsole_kern.h" +#include "vector_user.h" +#include "vector_kern.h" + +/* + * Adapted from network devices with the following major changes: + * All transports are static - simplifies the code significantly + * Multiple FDs/IRQs per device + * Vector IO optionally used for read/write, falling back to legacy + * based on configuration and/or availability + * Configuration is no longer positional - L2TPv3 and GRE require up to + * 10 parameters, passing this as positional is not fit for purpose. + * Only socket transports are supported + */ + + +#define DRIVER_NAME "uml-vector" +#define DRIVER_VERSION "01" +struct vector_cmd_line_arg { + struct list_head list; + int unit; + char *arguments; +}; + +struct vector_device { + struct list_head list; + struct net_device *dev; + struct platform_device pdev; + int unit; + int opened; +}; + +static LIST_HEAD(vec_cmd_line); + +static DEFINE_SPINLOCK(vector_devices_lock); +static LIST_HEAD(vector_devices); + +static int driver_registered; + +static void vector_eth_configure(int n, struct arglist *def); + +/* Argument accessors to set variables (and/or set default values) + * mtu, buffer sizing, default headroom, etc + */ + +#define DEFAULT_HEADROOM 2 +#define SAFETY_MARGIN 32 +#define DEFAULT_VECTOR_SIZE 64 +#define TX_SMALL_PACKET 128 +#define MAX_IOV_SIZE 8 + +static const struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { "rx_queue_max" }, + { "rx_queue_running_average" }, + { "tx_queue_max" }, + { "tx_queue_running_average" }, + { "rx_encaps_errors" }, + { "tx_timeout_count" }, + { "tx_restart_queue" }, + { "tx_kicks" }, + { "tx_flow_control_xon" }, + { "tx_flow_control_xoff" }, + { "rx_csum_offload_good" }, + { "rx_csum_offload_errors"}, + { "sg_ok"}, + { "sg_linearized"}, +}; + +#define VECTOR_NUM_STATS ARRAY_SIZE(ethtool_stats_keys) + +/* Used in the 4.4 OpenWRT backport */ + +#if LINUX_VERSION_CODE <= KERNEL_VERSION(4, 7, 0) +static inline void netif_trans_update(struct net_device *dev) +{ + struct netdev_queue *txq = netdev_get_tx_queue(dev, 0); + + if (txq->trans_start != jiffies) + txq->trans_start = jiffies; +} +#endif + +static void vector_reset_stats(struct vector_private *vp) +{ + vp->estats.rx_queue_max = 0; + vp->estats.rx_queue_running_average = 0; + vp->estats.tx_queue_max = 0; + vp->estats.tx_queue_running_average = 0; + vp->estats.rx_encaps_errors = 0; + vp->estats.tx_timeout_count = 0; + vp->estats.tx_restart_queue = 0; + vp->estats.tx_kicks = 0; + vp->estats.tx_flow_control_xon = 0; + vp->estats.tx_flow_control_xoff = 0; + vp->estats.sg_ok = 0; + vp->estats.sg_linearized = 0; +} + +static int get_mtu(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "mtu"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return ETH_MAX_PACKET; +} + +static int get_depth(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "depth"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_VECTOR_SIZE; +} + +static int get_headroom(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "headroom"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_HEADROOM; +} + +static int get_transport_options(struct arglist *def) +{ + char *transport = uml_vector_fetch_arg(def, "transport"); + + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return (VECTOR_RX | VECTOR_BPF); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); + return (VECTOR_TX | VECTOR_RX); +} + + +/* A mini-buffer for packet drop read + * All of our supported transports are datagram oriented and we always + * read using recvmsg or recvmmsg. If we pass a buffer which is smaller + * than the packet size it still counts as full packet read and will + * clean the incoming stream to keep sigio/epoll happy + */ + +#define DROP_BUFFER_SIZE 32 + +static char *drop_buffer; + +/* Array backed queues optimized for bulk enqueue/dequeue and + * 1:N (small values of N) or 1:1 enqueuer/dequeuer ratios. + * For more details and full design rationale see + * http://foswiki.cambridgegreys.com/Main/EatYourTailAndEnjoyIt + */ + + +/* + * Advance the mmsg queue head by n = advance. Resets the queue to + * maximum enqueue/dequeue-at-once capacity if possible. Called by + * dequeuers. Caller must hold the head_lock! + */ + +static int vector_advancehead(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->head = + (qi->head + advance) + % qi->max_depth; + + + spin_lock(&qi->tail_lock); + qi->queue_depth -= advance; + + /* we are at 0, use this to + * reset head and tail so we can use max size vectors + */ + + if (qi->queue_depth == 0) { + qi->head = 0; + qi->tail = 0; + } + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +/* Advance the queue tail by n = advance. + * This is called by enqueuers which should hold the + * head lock already + */ + +static int vector_advancetail(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->tail = + (qi->tail + advance) + % qi->max_depth; + spin_lock(&qi->head_lock); + qi->queue_depth += advance; + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + return queue_depth; +} + +static int prep_msg(struct vector_private *vp, struct sk_buff *skb, struct iovec *iov) +{ + int iov_index = 0; + int nr_frags, frag; + skb_frag_t *skb_frag; + + nr_frags = skb_shinfo(skb)->nr_frags; + if (nr_frags > MAX_IOV_SIZE) { + if (skb_linearize(skb) != 0) + goto drop; + } + if (vp->header_size > 0) { + iov[iov_index].iov_len = vp->header_size; + vp->form_header(iov[iov_index].iov_base, skb, vp); + iov_index++; + } + iov[iov_index].iov_base = skb->data; + if (nr_frags > 0) { + iov[iov_index].iov_len = skb->len - skb->data_len; + vp->estats.sg_ok++; + } else + iov[iov_index].iov_len = skb->len; + iov_index++; + for (frag = 0; frag < nr_frags; frag++) { + skb_frag = &skb_shinfo(skb)->frags[frag]; + iov[iov_index].iov_base = skb_frag_address_safe(skb_frag); + iov[iov_index].iov_len = skb_frag_size(skb_frag); + iov_index++; + } + return iov_index; +drop: + return -1; +} +/* + * Generic vector enqueue with support for forming headers using transport + * specific callback. Allows GRE, L2TPv3, RAW and other transports + * to use a common enqueue procedure in vector mode + */ + +static int vector_enqueue(struct vector_queue *qi, struct sk_buff *skb) +{ + struct vector_private *vp = netdev_priv(qi->dev); + int queue_depth; + int packet_len; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + int iov_count; + + spin_lock(&qi->tail_lock); + spin_lock(&qi->head_lock); + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + + if (skb) + packet_len = skb->len; + + if (queue_depth < qi->max_depth) { + + *(qi->skbuff_vector + qi->tail) = skb; + mmsg_vector += qi->tail; + iov_count = prep_msg( + vp, + skb, + mmsg_vector->msg_hdr.msg_iov + ); + if (iov_count < 1) + goto drop; + mmsg_vector->msg_hdr.msg_iovlen = iov_count; + mmsg_vector->msg_hdr.msg_name = vp->fds->remote_addr; + mmsg_vector->msg_hdr.msg_namelen = vp->fds->remote_addr_size; + queue_depth = vector_advancetail(qi, 1); + } else + goto drop; + spin_unlock(&qi->tail_lock); + return queue_depth; +drop: + qi->dev->stats.tx_dropped++; + if (skb != NULL) { + packet_len = skb->len; + dev_consume_skb_any(skb); + netdev_completed_queue(qi->dev, 1, packet_len); + } + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +static int consume_vector_skbs(struct vector_queue *qi, int count) +{ + struct sk_buff *skb; + int skb_index; + int bytes_compl = 0; + + for (skb_index = qi->head; skb_index < qi->head + count; skb_index++) { + skb = *(qi->skbuff_vector + skb_index); + /* mark as empty to ensure correct destruction if + * needed + */ + bytes_compl += skb->len; + *(qi->skbuff_vector + skb_index) = NULL; + dev_consume_skb_any(skb); + } + qi->dev->stats.tx_bytes += bytes_compl; + qi->dev->stats.tx_packets += count; + netdev_completed_queue(qi->dev, count, bytes_compl); + return vector_advancehead(qi, count); +} + +/* + * Generic vector deque via sendmmsg with support for forming headers + * using transport specific callback. Allows GRE, L2TPv3, RAW and + * other transports to use a common dequeue procedure in vector mode + */ + + +static int vector_send(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *send_from; + int result = 0, send_len, queue_depth = qi->max_depth; + + if (spin_trylock(&qi->head_lock)) { + if (spin_trylock(&qi->tail_lock)) { + /* update queue_depth to current value */ + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + while (queue_depth > 0) { + /* Calculate the start of the vector */ + send_len = queue_depth; + send_from = qi->mmsg_vector; + send_from += qi->head; + /* Adjust vector size if wraparound */ + if (send_len + qi->head > qi->max_depth) + send_len = qi->max_depth - qi->head; + /* Try to TX as many packets as possible */ + if (send_len > 0) { + result = uml_vector_sendmmsg( + vp->fds->tx_fd, + send_from, + send_len, + 0 + ); + vp->in_write_poll = (result != send_len); + } + /* For some of the sendmmsg error scenarios + * we may end being unsure in the TX success + * for all packets. It is safer to declare + * them all TX-ed and blame the network. + */ + if (result < 0) { + printk(KERN_ERR "sendmmsg err=%i\n", + result); + result = send_len; + } + if (result > 0) { + queue_depth = consume_vector_skbs(qi, result); + /* This is equivalent to an TX IRQ. + * Restart the upper layers to feed us + * more packets. + */ + if (result > vp->estats.tx_queue_max) + vp->estats.tx_queue_max = result; + vp->estats.tx_queue_running_average = + (vp->estats.tx_queue_running_average + result) >> 1; + } + netif_trans_update(qi->dev); + netif_wake_queue(qi->dev); + /* if TX is busy, break out of the send loop, poll + * write IRQ will reschedule xmit for us + */ + if (result != send_len) { + vp->estats.tx_restart_queue++; + break; + } + } + } + spin_unlock(&qi->head_lock); + } else { + tasklet_schedule(&vp->tx_poll); + } + return queue_depth; +} + +/* Queue destructor. Deliberately stateless so we can use + * it in queue cleanup if initialization fails. + */ + +static void destroy_queue(struct vector_queue *qi) +{ + int i; + struct iovec *iov; + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *mmsg_vector; + + if (qi == NULL) + return; + /* deallocate any skbuffs - we rely on any unused to be + * set to NULL. + */ + if (qi->skbuff_vector != NULL) { + for (i = 0; i < qi->max_depth; i++) { + if (*(qi->skbuff_vector + i) != NULL) + dev_kfree_skb_any(*(qi->skbuff_vector + i)); + } + kfree(qi->skbuff_vector); + } + /* deallocate matching IOV structures including header buffs */ + if (qi->mmsg_vector != NULL) { + mmsg_vector = qi->mmsg_vector; + for (i = 0; i < qi->max_depth; i++) { + iov = mmsg_vector->msg_hdr.msg_iov; + if (iov != NULL) { + if ((vp->header_size > 0) && + (iov->iov_base != NULL)) + kfree(iov->iov_base); + kfree(iov); + } + mmsg_vector++; + } + kfree(qi->mmsg_vector); + } + kfree(qi); +} + +/* + * Queue constructor. Create a queue with a given side. + */ +static struct vector_queue *create_queue( + struct vector_private *vp, + int max_size, + int header_size, + int num_extra_frags) +{ + struct vector_queue *result; + int i; + struct iovec *iov; + struct mmsghdr *mmsg_vector; + + result = kmalloc(sizeof(struct vector_queue), GFP_KERNEL); + if (result == NULL) + goto out_fail; + result->max_depth = max_size; + result->dev = vp->dev; + result->mmsg_vector = kmalloc( + (sizeof(struct mmsghdr) * max_size), GFP_KERNEL); + result->skbuff_vector = kmalloc( + (sizeof(void *) * max_size), GFP_KERNEL); + if (result->mmsg_vector == NULL || result->skbuff_vector == NULL) + goto out_fail; + + mmsg_vector = result->mmsg_vector; + for (i = 0; i < max_size; i++) { + /* Clear all pointers - we use non-NULL as marking on + * what to free on destruction + */ + *(result->skbuff_vector + i) = NULL; + mmsg_vector->msg_hdr.msg_iov = NULL; + mmsg_vector++; + } + mmsg_vector = result->mmsg_vector; + result->max_iov_frags = num_extra_frags; + for (i = 0; i < max_size; i++) { + if (vp->header_size > 0) + iov = kmalloc(sizeof(struct iovec) * (3 + num_extra_frags), GFP_KERNEL); + else + iov = kmalloc(sizeof(struct iovec) * (2 + num_extra_frags), GFP_KERNEL); + if (iov == NULL) + goto out_fail; + mmsg_vector->msg_hdr.msg_iov = iov; + mmsg_vector->msg_hdr.msg_iovlen = 1; + mmsg_vector->msg_hdr.msg_control = NULL; + mmsg_vector->msg_hdr.msg_controllen = 0; + mmsg_vector->msg_hdr.msg_flags = MSG_DONTWAIT; + mmsg_vector->msg_hdr.msg_name = NULL; + mmsg_vector->msg_hdr.msg_namelen = 0; + if (vp->header_size > 0) { + iov->iov_base = kmalloc(header_size, GFP_KERNEL); + if (iov->iov_base == NULL) + goto out_fail; + iov->iov_len = header_size; + mmsg_vector->msg_hdr.msg_iovlen = 2; + iov++; + } + iov->iov_base = NULL; + iov->iov_len = 0; + mmsg_vector++; + } + spin_lock_init(&result->head_lock); + spin_lock_init(&result->tail_lock); + result->queue_depth = 0; + result->head = 0; + result->tail = 0; + return result; +out_fail: + destroy_queue(result); + return NULL; +} + +/* + * We do not use the RX queue as a proper wraparound queue for now + * This is not necessary because the consumption via netif_rx() + * happens in-line. While we can try using the return code of + * netif_rx() for flow control there are no drivers doing this today. + * For this RX specific use we ignore the tail/head locks and + * just read into a prepared queue filled with skbuffs. + */ + +/* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ + +static void prep_queue_for_rx(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct sk_buff *skb; + struct iovec *iov; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int i; + + if (qi->queue_depth == 0) + return; + for (i = 0; i < qi->queue_depth; i++) { + /* it is OK if allocation fails - recvmmsg with NULL data in + * iov argument still performs an RX, just drops the packet + * This allows us stop faffing around with a "drop buffer" + */ + + skb = netdev_alloc_skb( + vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + iov = mmsg_vector->msg_hdr.msg_iov; + mmsg_vector->msg_len = 0; + if (vp->header_size > 0) + iov++; + if (skb != NULL) { + skb_reserve(skb, vp->headroom); + skb->dev = qi->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + skb->ip_summed = CHECKSUM_NONE; + iov->iov_base = skb->data; + iov->iov_len = vp->max_packet; + } else { + iov->iov_base = NULL; + iov->iov_len = 0; + } + *skbuff_vector = skb; + skbuff_vector++; + mmsg_vector++; + } + qi->queue_depth = 0; +} + +static struct vector_device *find_device(int n) +{ + struct vector_device *device; + struct list_head *ele; + + spin_lock(&vector_devices_lock); + list_for_each(ele, &vector_devices) { + device = list_entry(ele, struct vector_device, list); + if (device->unit == n) + goto out; + } + device = NULL; + out: + spin_unlock(&vector_devices_lock); + return device; +} + +static int vector_parse(char *str, int *index_out, char **str_out, + char **error_out) +{ + int n, len, err = -EINVAL; + char *start = str; + + len = strlen(str); + + while ((*str != ':') && (strlen(str) > 1)) + str++; + if (*str != ':') { + *error_out = "Expected ':' after device number"; + return err; + } else + *str = '\0'; + + err = kstrtouint(start, 0, &n); + if (err < 0) { + *error_out = "Bad device number"; + return err; + } + + str++; + if (find_device(n)) { + *error_out = "Device already configured"; + return err; + } + + *index_out = n; + *str_out = str; + return 0; +} + +static int vector_config(char *str, char **error_out) +{ + int err, n; + char *params; + struct arglist *parsed; + + err = vector_parse(str, &n, ¶ms, error_out); + if (err != 0) + return err; + + /* This string is broken up and the pieces used by the underlying + * driver. We should copy it to make sure things do not go wrong + * later. + */ + + params = kstrdup(params, GFP_KERNEL); + if (str == NULL) { + *error_out = "vector_config failed to strdup string"; + return -ENOMEM; + } + + parsed = uml_parse_vector_ifspec(params); + + if (parsed == NULL) { + *error_out = "vector_config failed to parse parameters"; + return -EINVAL; + } + + vector_eth_configure(n, parsed); + return 0; +} + +static int vector_id(char **str, int *start_out, int *end_out) +{ + char *end; + int n; + + n = simple_strtoul(*str, &end, 0); + if ((*end != '\0') || (end == *str)) + return -1; + + *start_out = n; + *end_out = n; + *str = end; + return n; +} + +static int vector_remove(int n, char **error_out) +{ + struct vector_device *vec_d; + struct net_device *dev; + struct vector_private *vp; + + vec_d = find_device(n); + if (vec_d == NULL) + return -ENODEV; + dev = vec_d->dev; + vp = netdev_priv(dev); + if (vp->fds != NULL) + return -EBUSY; + unregister_netdev(dev); + platform_device_unregister(&vec_d->pdev); + return 0; +} + +/* + * There is no shared per-transport initialization code, so + * we will just initialize each interface one by one and + * add them to a list + */ + +static struct platform_driver uml_net_driver = { + .driver = { + .name = DRIVER_NAME, + }, +}; + + +static void vector_device_release(struct device *dev) +{ + struct vector_device *device = dev_get_drvdata(dev); + struct net_device *netdev = device->dev; + + list_del(&device->list); + kfree(device); + free_netdev(netdev); +} + +/* Bog standard recv using recvmsg - not used normally unless the user + * explicitly specifies not to use recvmmsg vector RX. + */ + +static int vector_legacy_rx(struct vector_private *vp) +{ + int pkt_len; + struct user_msghdr hdr; + struct iovec iov[2]; /* header + data use case only */ + int iovpos = 0; + struct sk_buff *skb; + int header_check; + + hdr.msg_name = NULL; + hdr.msg_namelen = 0; + hdr.msg_iov = (struct iovec *) &iov; + hdr.msg_iovlen = 1; + hdr.msg_control = NULL; + hdr.msg_controllen = 0; + hdr.msg_flags = 0; + + if (vp->header_size > 0) { + iov[iovpos].iov_base = vp->header_rxbuffer; + iov[iovpos].iov_len = vp->rx_header_size; + hdr.msg_iovlen++; + iovpos++; + } + + skb = netdev_alloc_skb(vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + if (skb == NULL) { + /* Read a packet into drop_buffer and don't do + * anything with it. + */ + iov[iovpos].iov_base = drop_buffer; + iov[iovpos].iov_len = DROP_BUFFER_SIZE; + vp->dev->stats.rx_dropped++; + } else { + skb_reserve(skb, vp->headroom); + skb->dev = vp->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + iov[iovpos].iov_base = skb->data; + iov[iovpos].iov_len = vp->max_packet; + } + + pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); + + if (skb != NULL) { + if (pkt_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + vp->header_rxbuffer, skb, vp); + if (header_check < 0) { + dev_kfree_skb_irq(skb); + vp->dev->stats.rx_dropped++; + vp->estats.rx_encaps_errors++; + return 0; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, pkt_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + dev_kfree_skb_irq(skb); + } + } + return pkt_len; +} + +/* + * Packet at a time TX which falls back to vector TX if the + * underlying transport is busy. + */ + + + +static int writev_tx(struct vector_private *vp, struct sk_buff *skb) +{ + struct iovec iov[3 + MAX_IOV_SIZE]; + int iov_count, pkt_len = 0; + + iov[0].iov_base = vp->header_txbuffer; + iov_count = prep_msg(vp, skb, (struct iovec *) &iov); + + if (iov_count < 1) + goto drop; + pkt_len = uml_vector_writev(vp->fds->tx_fd, (struct iovec *) &iov, iov_count); + + netif_trans_update(vp->dev); + netif_wake_queue(vp->dev); + + if (pkt_len > 0) { + vp->dev->stats.tx_bytes += skb->len; + vp->dev->stats.tx_packets++; + } else { + vp->dev->stats.tx_dropped++; + } + consume_skb(skb); + return pkt_len; +drop: + vp->dev->stats.tx_dropped++; + consume_skb(skb); + return pkt_len; +} + +/* + * Receive as many messages as we can in one call using the special + * mmsg vector matched to an skb vector which we prepared earlier. + */ + +static int vector_mmsg_rx(struct vector_private *vp) +{ + int packet_count, i; + struct vector_queue *qi = vp->rx_queue; + struct sk_buff *skb; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int header_check; + + /* Refresh the vector and make sure it is with new skbs and the + * iovs are updated to point to them. + */ + + prep_queue_for_rx(qi); + + /* Fire the Lazy Gun - get as many packets as we can in one go. */ + + packet_count = uml_vector_recvmmsg( + vp->fds->rx_fd, qi->mmsg_vector, qi->max_depth, 0); + + if (packet_count <= 0) + return packet_count; + + /* We treat packet processing as enqueue, buffer refresh as dequeue + * The queue_depth tells us how many buffers have been used and how + * many do we need to prep the next time prep_queue_for_rx() is called. + */ + + qi->queue_depth = packet_count; + + for (i = 0; i < packet_count; i++) { + skb = (*skbuff_vector); + if (mmsg_vector->msg_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + mmsg_vector->msg_hdr.msg_iov->iov_base, skb, vp); + if (header_check < 0) { + /* Overlay header failed to verify - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too + * complex. + */ + dev_kfree_skb_irq(skb); + vp->estats.rx_encaps_errors++; + continue; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, + mmsg_vector->msg_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + /* + * We do not need to lock on updating stats here + * The interrupt loop is non-reentrant. + */ + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + /* Overlay header too short to do anything - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too complex. + */ + if (skb != NULL) + dev_kfree_skb_irq(skb); + } + (*skbuff_vector) = NULL; + /* Move to the next buffer element */ + mmsg_vector++; + skbuff_vector++; + } + if (packet_count > 0) { + if (vp->estats.rx_queue_max < packet_count) + vp->estats.rx_queue_max = packet_count; + vp->estats.rx_queue_running_average = + (vp->estats.rx_queue_running_average + packet_count) >> 1; + } + return packet_count; +} + +static void vector_rx(struct vector_private *vp) +{ + int err; + + if ((vp->options & VECTOR_RX) > 0) + while ((err = vector_mmsg_rx(vp)) > 0) + ; + else + while ((err = vector_legacy_rx(vp)) > 0) + ; + if (err != 0) + printk(KERN_ERR "vector_rx: error(%d)\n", err); +} + +static int vector_net_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + int queue_depth = 0; + + if ((vp->options & VECTOR_TX) == 0) { + writev_tx(vp, skb); + return NETDEV_TX_OK; + } + + /* We do BQL only in the vector path, no point doing it in + * packet at a time mode as there is no device queue + */ + + netdev_sent_queue(vp->dev, skb->len); + queue_depth = vector_enqueue(vp->tx_queue, skb); + + /* if the device queue is full, stop the upper layers and + * flush it. + */ + + if (queue_depth >= vp->tx_queue->max_depth - 1) { + vp->estats.tx_kicks++; + netif_stop_queue(dev); + vector_send(vp->tx_queue); + return NETDEV_TX_OK; + } + if (skb->xmit_more) { + mod_timer(&vp->tl, vp->coalesce); + return NETDEV_TX_OK; + } + if (skb->len < TX_SMALL_PACKET) { + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); + } else + tasklet_schedule(&vp->tx_poll); + return NETDEV_TX_OK; +} + +static irqreturn_t vector_rx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + vector_rx(vp); + return IRQ_HANDLED; + +} + +static irqreturn_t vector_tx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + /* We need to pay attention to it only if we got + * -EAGAIN or -ENOBUFFS from sendmmsg. Otherwise + * we ignore it. In the future, it may be worth + * it to improve the IRQ controller a bit to make + * tweaking the IRQ mask less costly + */ + + if (vp->in_write_poll) { + tasklet_schedule(&vp->tx_poll); + } + return IRQ_HANDLED; + +} + +static int irq_rr; + +static int vector_net_close(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + + netif_stop_queue(dev); + del_timer(&vp->tl); + + if (vp->fds == NULL) + return 0; + + /* Disable and free all IRQS */ + if (vp->rx_irq > 0) { + um_free_irq(vp->rx_irq, dev); + vp->rx_irq = 0; + } + if (vp->tx_irq > 0) { + um_free_irq(vp->tx_irq, dev); + vp->tx_irq = 0; + } + tasklet_kill(&vp->tx_poll); + if (vp->fds->rx_fd > 0) { + os_close_file(vp->fds->rx_fd); + vp->fds->rx_fd = -1; + } + if (vp->fds->tx_fd > 0) { + os_close_file(vp->fds->tx_fd); + vp->fds->tx_fd = -1; + } + if (vp->bpf != NULL) + kfree(vp->bpf); + if (vp->fds->remote_addr != NULL) + kfree(vp->fds->remote_addr); + if (vp->transport_data != NULL) + kfree(vp->transport_data); + if (vp->header_rxbuffer != NULL) + kfree(vp->header_rxbuffer); + if (vp->header_txbuffer != NULL) + kfree(vp->header_txbuffer); + if (vp->rx_queue != NULL) + destroy_queue(vp->rx_queue); + if (vp->tx_queue != NULL) + destroy_queue(vp->tx_queue); + kfree(vp->fds); + vp->fds = NULL; + spin_lock_irqsave(&vp->lock, flags); + vp->opened = false; + spin_unlock_irqrestore(&vp->lock, flags); + return 0; +} + +/* TX tasklet */ + +static void vector_tx_poll(unsigned long data) +{ + struct vector_private *vp = (struct vector_private *)data; + + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} +static void vector_reset_tx(struct work_struct *work) +{ + struct vector_private *vp = + container_of(work, struct vector_private, reset_tx); + netdev_reset_queue(vp->dev); + netif_start_queue(vp->dev); + netif_wake_queue(vp->dev); +} +static int vector_net_open(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + int err = -EINVAL; + struct vector_device *vdevice; + + spin_lock_irqsave(&vp->lock, flags); + if (vp->opened) + return -ENXIO; + vp->opened = true; + spin_unlock_irqrestore(&vp->lock, flags); + + vp->fds = uml_vector_user_open(vp->unit, vp->parsed); + + if (vp->fds == NULL) + goto out_close; + + if (build_transport_data(vp) < 0) + goto out_close; + + if ((vp->options & VECTOR_RX) > 0) { + vp->rx_queue = create_queue( + vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp->rx_queue->queue_depth = get_depth(vp->parsed); + } else { + vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); + if (vp->header_rxbuffer == NULL) + goto out_close; + } + if ((vp->options & VECTOR_TX) > 0) { + vp->tx_queue = create_queue( + vp, get_depth(vp->parsed), vp->header_size, MAX_IOV_SIZE); + } else { + vp->header_txbuffer = kmalloc(vp->header_size, GFP_KERNEL); + if (vp->header_txbuffer == NULL) + goto out_close; + } + + /* READ IRQ */ + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->rx_fd, + IRQ_READ, vector_rx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR "vector_open: failed to get rx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->rx_irq = irq_rr + VECTOR_BASE_IRQ; + dev->irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + + /* WRITE IRQ - we need it only if we have vector TX */ + if ((vp->options & VECTOR_TX) > 0) { + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->tx_fd, + IRQ_WRITE, vector_tx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR + "vector_open: failed to get tx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->tx_irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + } + + if ((vp->options & VECTOR_BPF) != 0) { + vp->bpf = uml_vector_default_bpf(vp->fds->rx_fd, dev->dev_addr); + } + + + /* Write Timeout Timer */ + + vp->tl.data = (unsigned long) vp; + netif_start_queue(dev); + + /* clear buffer - it can happen that the host side of the interface + * is full when we get here. In this case, new data is never queued, + * SIGIOs never arrive, and the net never works. + */ + + vector_rx(vp); + + vector_reset_stats(vp); + vdevice = find_device(vp->unit); + vdevice->opened = 1; + + if ((vp->options & VECTOR_TX) != 0) { + add_timer(&vp->tl); + } + return 0; +out_close: + vector_net_close(dev); + return err; +} + + +static void vector_net_set_multicast_list(struct net_device *dev) +{ + /* TODO: - we can do some BPF games here */ + return; +} + +static void vector_net_tx_timeout(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + vp->estats.tx_timeout_count++; + netif_trans_update(dev); + schedule_work(&vp->reset_tx); +} + +#ifdef CONFIG_NET_POLL_CONTROLLER +static void vector_net_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + uml_net_interrupt(dev->irq, dev); + enable_irq(dev->irq); +} +#endif + +static void vector_net_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *info) +{ + strlcpy(info->driver, DRIVER_NAME, sizeof(info->driver)); + strlcpy(info->version, DRIVER_VERSION, sizeof(info->version)); +} + +static void vector_get_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct vector_private *vp = netdev_priv(netdev); + + ring->rx_max_pending = vp->rx_queue->max_depth; + ring->tx_max_pending = vp->tx_queue->max_depth; + ring->rx_pending = vp->rx_queue->max_depth; + ring->tx_pending = vp->tx_queue->max_depth; +} + +static void vector_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch (stringset) { + case ETH_SS_TEST: + *buf = '\0'; + break; + case ETH_SS_STATS: + memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + default: + WARN_ON(1); + break; + } +} + +static int vector_get_sset_count(struct net_device *dev, int sset) +{ + switch (sset) { + case ETH_SS_TEST: + return 0; + case ETH_SS_STATS: + return VECTOR_NUM_STATS; + default: + return -EOPNOTSUPP; + } +} + +static void vector_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *estats, u64 *tmp_stats) +{ + struct vector_private *vp = netdev_priv(dev); + + memcpy(tmp_stats, &vp->estats, sizeof(struct vector_estats)); +} + +static int vector_get_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + ec->tx_coalesce_usecs = (vp->coalesce * 1000000) / HZ; + return 0; +} + +static int vector_set_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + vp->coalesce = (ec->tx_coalesce_usecs * HZ) / 1000000; + if (vp->coalesce == 0) + vp->coalesce = 1; + return 0; +} + +static const struct ethtool_ops vector_net_ethtool_ops = { + .get_drvinfo = vector_net_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_ts_info = ethtool_op_get_ts_info, + .get_ringparam = vector_get_ringparam, + .get_strings = vector_get_strings, + .get_sset_count = vector_get_sset_count, + .get_ethtool_stats = vector_get_ethtool_stats, + .get_coalesce = vector_get_coalesce, + .set_coalesce = vector_set_coalesce, +}; + + +static const struct net_device_ops vector_netdev_ops = { + .ndo_open = vector_net_open, + .ndo_stop = vector_net_close, + .ndo_start_xmit = vector_net_start_xmit, + .ndo_set_rx_mode = vector_net_set_multicast_list, + .ndo_tx_timeout = vector_net_tx_timeout, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = vector_net_poll_controller, +#endif +}; + + +static void vector_timer_expire(unsigned long _conn) +{ + struct vector_private *vp = (struct vector_private *)_conn; + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} + +static void vector_eth_configure( + int n, + struct arglist *def + ) +{ + struct vector_device *device; + struct net_device *dev; + struct vector_private *vp; + int err; + + printk(KERN_INFO "Vector Netdevice %d\n", n); + + device = kzalloc(sizeof(*device), GFP_KERNEL); + if (device == NULL) { + printk(KERN_ERR "eth_configure failed to allocate struct " + "vector_device\n"); + return; + } + printk(KERN_INFO "vector etherdevice start %d\n", n); + dev = alloc_etherdev(sizeof(struct vector_private)); + if (dev == NULL) { + printk(KERN_ERR "eth_configure: failed to allocate struct " + "net_device for vec%d\n", n); + goto out_free_device; + } + + dev->mtu = get_mtu(def); + + INIT_LIST_HEAD(&device->list); + device->unit = n; + + /* If this name ends up conflicting with an existing registered + * netdevice, that is OK, register_netdev{,ice}() will notice this + * and fail. + */ + snprintf(dev->name, sizeof(dev->name), "vec%d", n); + uml_net_setup_etheraddr(dev, uml_vector_fetch_arg(def, "mac")); + vp = netdev_priv(dev); + + /* sysfs register */ + if (!driver_registered) { + platform_driver_register(¨_net_driver); + driver_registered = 1; + } + device->pdev.id = n; + device->pdev.name = DRIVER_NAME; + device->pdev.dev.release = vector_device_release; + dev_set_drvdata(&device->pdev.dev, device); + if (platform_device_register(&device->pdev)) + goto out_free_netdev; + SET_NETDEV_DEV(dev, &device->pdev.dev); + + device->dev = dev; + + *vp = ((struct vector_private) + { + .list = LIST_HEAD_INIT(vp->list), + .dev = dev, + .unit = n, + .options = get_transport_options(def), + .rx_irq = 0, + .tx_irq = 0, + .parsed = def, + .max_packet = get_mtu(def) + ETH_HEADER_OTHER, + /* TODO - we need to calculate headroom so that ip header + * is 16 byte aligned all the time + */ + .headroom = get_headroom(def), + .form_header = NULL, + .verify_header = NULL, + .header_rxbuffer = NULL, + .header_txbuffer = NULL, + .header_size = 0, + .rx_header_size = 0, + .rexmit_scheduled = false, + .opened = false, + .transport_data = NULL, + .in_write_poll = false, + .coalesce = 2 + }); + + dev->features = NETIF_F_SG; + tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); + INIT_WORK(&vp->reset_tx, vector_reset_tx); + + init_timer(&vp->tl); + spin_lock_init(&vp->lock); + vp->tl.function = vector_timer_expire; + + /* FIXME */ + dev->netdev_ops = &vector_netdev_ops; + dev->ethtool_ops = &vector_net_ethtool_ops; + dev->watchdog_timeo = (HZ >> 1); + /* primary IRQ - fixme */ + dev->irq = 0; /* we will adjust this once opened */ + + rtnl_lock(); + err = register_netdevice(dev); + rtnl_unlock(); + if (err) + goto out_undo_user_init; + + spin_lock(&vector_devices_lock); + list_add(&device->list, &vector_devices); + spin_unlock(&vector_devices_lock); + + return; + +out_undo_user_init: + return; +out_free_netdev: + free_netdev(dev); +out_free_device: + kfree(device); +} + + + + +/* + * Invoked late in the init + */ + +static int __init vector_init(void) +{ + struct list_head *ele; + struct vector_cmd_line_arg *def; + struct arglist *parsed; + + list_for_each(ele, &vec_cmd_line) { + def = list_entry(ele, struct vector_cmd_line_arg, list); + parsed = uml_parse_vector_ifspec(def->arguments); + if (parsed != NULL) + vector_eth_configure(def->unit, parsed); + } + return 0; +} + + +/* Invoked at initial argument parsing, only stores + * arguments until a proper vector_init is called + * later + */ + +static int __init vector_setup(char *str) +{ + char *error; + int n, err; + struct vector_cmd_line_arg *new; + + err = vector_parse(str, &n, &str, &error); + if (err) { + printk(KERN_ERR "vector_setup - Couldn't parse '%s' : %s\n", + str, error); + return 1; + } + new = alloc_bootmem(sizeof(*new)); + INIT_LIST_HEAD(&new->list); + new->unit = n; + new->arguments = str; + list_add_tail(&new->list, &vec_cmd_line); + return 1; +} + +__setup("vec", vector_setup); +__uml_help(vector_setup, +"vec[0-9]+:<option>=<value>,<option>=<value>\n" +" Configure a vector io network device.\n\n" +); + +late_initcall(vector_init); + +static struct mc_device vector_mc = { + .list = LIST_HEAD_INIT(vector_mc.list), + .name = "vec", + .config = vector_config, + .get_config = NULL, + .id = vector_id, + .remove = vector_remove, +}; + +#ifdef CONFIG_INET +static int vector_inetaddr_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + return NOTIFY_DONE; +} + +static struct notifier_block vector_inetaddr_notifier = { + .notifier_call = vector_inetaddr_event, +}; + +static void inet_register(void) +{ + register_inetaddr_notifier(&vector_inetaddr_notifier); +} +#else +static inline void inet_register(void) +{ +} +#endif + +static int vector_net_init(void) +{ + mconsole_register_dev(&vector_mc); + inet_register(); + return 0; +} + +__initcall(vector_net_init); + + + diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h new file mode 100644 index 000000000000..a9ade0851fda --- /dev/null +++ b/arch/um/drivers/vector_kern.h @@ -0,0 +1,128 @@ +/* + * Copyright (C) 2002 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#ifndef __UM_VECTOR_KERN_H +#define __UM_VECTOR_KERN_H + +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/skbuff.h> +#include <linux/socket.h> +#include <linux/list.h> +#include <linux/ctype.h> +#include <linux/workqueue.h> +#include <linux/interrupt.h> +#include "vector_user.h" + +/* Queue structure specially adapted for multiple enqueue/dequeue + * in a mmsgrecv/mmsgsend context + */ + +/* Dequeue method */ + +#define QUEUE_SENDMSG 0 +#define QUEUE_SENDMMSG 1 + +#define VECTOR_RX 1 +#define VECTOR_TX (1 << 1) +#define VECTOR_BPF (1 << 2) + +#define ETH_MAX_PACKET 1500 +#define ETH_HEADER_OTHER 32 /* just in case someone decides to go mad on QnQ */ + +struct vector_queue { + struct mmsghdr *mmsg_vector; + void **skbuff_vector; + /* backlink to device which owns us */ + struct net_device *dev; + spinlock_t head_lock; + spinlock_t tail_lock; + int queue_depth, head, tail, max_depth, max_iov_frags; + short options; +}; + +struct vector_estats { + uint64_t rx_queue_max; + uint64_t rx_queue_running_average; + uint64_t tx_queue_max; + uint64_t tx_queue_running_average; + uint64_t rx_encaps_errors; + uint64_t tx_timeout_count; + uint64_t tx_restart_queue; + uint64_t tx_kicks; + uint64_t tx_flow_control_xon; + uint64_t tx_flow_control_xoff; + uint64_t rx_csum_offload_good; + uint64_t rx_csum_offload_errors; + uint64_t sg_ok; + uint64_t sg_linearized; +}; + +#define VERIFY_HEADER_NOK -1 +#define VERIFY_HEADER_OK 0 +#define VERIFY_CSUM_OK 1 + +struct vector_private { + struct list_head list; + spinlock_t lock; + struct net_device *dev; + + int unit; + + /* Timeout timer in TX */ + + struct timer_list tl; + + /* Scheduled "remove device" work */ + struct work_struct reset_tx; + struct vector_fds *fds; + + struct vector_queue *rx_queue; + struct vector_queue *tx_queue; + + int rx_irq; + int tx_irq; + + struct arglist *parsed; + + void *transport_data; /* transport specific params if needed */ + + int max_packet; + int headroom; + + int options; + + /* remote address if any - some transports will leave this as null */ + + int header_size; + int rx_header_size; + int coalesce; + + void *header_rxbuffer; + void *header_txbuffer; + + int (*form_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + int (*verify_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + + spinlock_t stats_lock; + + struct tasklet_struct tx_poll; + bool rexmit_scheduled; + bool opened; + bool in_write_poll; + + /* ethtool stats */ + + struct vector_estats estats; + void *bpf; + + char user[0]; +}; + +extern int build_transport_data(struct vector_private *vp); + +#endif diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c new file mode 100644 index 000000000000..9f07d585f71b --- /dev/null +++ b/arch/um/drivers/vector_transports.c @@ -0,0 +1,430 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Licensed under the GPL. + */ + +#include <linux/etherdevice.h> +#include <linux/netdevice.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <asm/byteorder.h> +#include <uapi/linux/ip.h> +#include <uapi/linux/virtio_net.h> +#include <linux/virtio_net.h> +#include <linux/virtio_byteorder.h> +#include <linux/netdev_features.h> +#include "vector_user.h" +#include "vector_kern.h" + +#define GOOD_LINEAR 512 + +struct gre_minimal_header { + uint16_t header; + uint16_t arptype; +}; + + +struct uml_gre_data { + uint32_t rx_key; + uint32_t tx_key; + uint32_t sequence; + + bool ipv6; + bool has_sequence; + bool pin_sequence; + bool checksum; + bool key; + struct gre_minimal_header expected_header; + + uint32_t checksum_offset; + uint32_t key_offset; + uint32_t sequence_offset; + +}; + +struct uml_l2tpv3_data { + uint64_t rx_cookie; + uint64_t tx_cookie; + uint64_t rx_session; + uint64_t tx_session; + uint32_t counter; + + bool udp; + bool ipv6; + bool has_counter; + bool pin_counter; + bool cookie; + bool cookie_is_64; + + uint32_t cookie_offset; + uint32_t session_offset; + uint32_t counter_offset; +}; + +static int l2tpv3_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *counter; + + if (td->udp) + *(uint32_t *) header = cpu_to_be32(L2TPV3_DATA_PACKET); + (*(uint32_t *) (header + td->session_offset)) = td->tx_session; + + if (td->cookie) { + if (td->cookie_is_64) + (*(uint64_t *)(header + td->cookie_offset)) = + td->tx_cookie; + else + (*(uint32_t *)(header + td->cookie_offset)) = + td->tx_cookie; + } + if (td->has_counter) { + counter = (uint32_t *)(header + td->counter_offset); + if (td->pin_counter) { + *counter = 0; + } else { + td->counter++; + *counter = cpu_to_be32(td->counter); + } + } + return 0; +} + +static int gre_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_gre_data *td = vp->transport_data; + uint32_t *sequence; + *((uint32_t *) header) = *((uint32_t *) &td->expected_header); + if (td->key) + (*(uint32_t *) (header + td->key_offset)) = td->tx_key; + if (td->has_sequence) { + sequence = (uint32_t *)(header + td->sequence_offset); + if (td->pin_sequence) + *sequence = 0; + else + *sequence = cpu_to_be32(++td->sequence); + } + return 0; +} + +static int raw_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + virtio_net_hdr_from_skb(skb, vheader, virtio_legacy_is_little_endian(), false); + + return 0; +} + +static int l2tpv3_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *session; + uint64_t cookie; + + if ((!td->udp) && (!td->ipv6)) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + /* we do not do a strict check for "data" packets as per + * the RFC spec because the pure IP spec does not have + * that anyway. + */ + + if (td->cookie) { + if (td->cookie_is_64) + cookie = *(uint64_t *)(header + td->cookie_offset); + else + cookie = *(uint32_t *)(header + td->cookie_offset); + if (cookie != td->rx_cookie) { + printk(KERN_ERR "uml_l2tpv3: unknown cookie id"); + return -1; + } + } + session = (uint32_t *) (header + td->session_offset); + if (*session != td->rx_session) { + printk(KERN_ERR "uml_l2tpv3: session mismatch"); + return -1; + } + return 0; +} + +static int gre_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + + uint32_t key; + struct uml_gre_data *td = vp->transport_data; + + if (!td->ipv6) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + if (*((uint32_t *) header) != *((uint32_t *) &td->expected_header)) { + printk(KERN_ERR "header type disagreement, expecting %0x, got %0x", + *((uint32_t *) &td->expected_header), + *((uint32_t *) header) + ); + return -1; + } + + if (td->key) { + key = (*(uint32_t *)(header + td->key_offset)); + if (key != td->rx_key) { + printk(KERN_ERR "unknown key id %0x, expecting %0x", + key, td->rx_key); + return -1; + } + } + return 0; +} + +static int raw_verify_header ( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { + printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); + return -1; /* GSO, we cannot process this */ + } + if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) + return 1; + else + virtio_net_hdr_to_skb(skb, vheader, virtio_legacy_is_little_endian()); + return 0; +} + +static bool get_uint_param( + struct arglist *def, char *param, unsigned int *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoint(arg, 0, result) == 0) + return true; + } + return false; +} + +static bool get_ulong_param( + struct arglist *def, char *param, unsigned long *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoul(arg, 0, result) == 0) + return true; + return true; + } + return false; +} + +static int build_gre_transport_data(struct vector_private *vp) +{ + struct uml_gre_data *td; + int temp_int; + int temp_rx; + int temp_tx; + + vp->transport_data = kmalloc(sizeof(struct uml_gre_data), GFP_KERNEL); + if (vp->transport_data == NULL) + return -ENOMEM; + td = vp->transport_data; + td->sequence = 0; + + td->expected_header.arptype = GRE_IRB; + td->expected_header.header = 0; + + vp->form_header = &gre_form_header; + vp->verify_header = &gre_verify_header; + vp->header_size = 4; + td->key_offset = 4; + td->sequence_offset = 4; + td->checksum_offset = 4; + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + td->key = false; + if (get_uint_param(vp->parsed, "rx_key", &temp_rx)) { + if (get_uint_param(vp->parsed, "tx_key", &temp_tx)) { + td->key = true; + td->expected_header.header |= GRE_MODE_KEY; + td->rx_key = cpu_to_be32(temp_rx); + td->tx_key = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->sequence_offset += 4; + } else { + return -EINVAL; + } + } + + td->sequence = false; + if (get_uint_param(vp->parsed, "sequence", &temp_int)) { + if (temp_int > 0) { + vp->header_size += 4; + td->has_sequence = true; + td->expected_header.header |= GRE_MODE_SEQUENCE; + if (get_uint_param( + vp->parsed, "pin_sequence", &temp_int)) { + if (temp_int > 0) + td->pin_sequence = true; + } + } + } + vp->rx_header_size = vp->header_size; + if (!td->ipv6) + vp->rx_header_size += sizeof(struct iphdr); + return 0; +} + +static int build_l2tpv3_transport_data(struct vector_private *vp) +{ + + struct uml_l2tpv3_data *td; + int temp_int, temp_rxs, temp_txs; + unsigned long temp_rx; + unsigned long temp_tx; + + vp->transport_data = kmalloc( + sizeof(struct uml_l2tpv3_data), GFP_KERNEL); + + if (vp->transport_data == NULL) + return -ENOMEM; + + td = vp->transport_data; + + vp->form_header = &l2tpv3_form_header; + vp->verify_header = &l2tpv3_verify_header; + td->counter = 0; + + vp->header_size = 4; + td->session_offset = 0; + td->cookie_offset = 4; + td->counter_offset = 4; + + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + + if (get_uint_param(vp->parsed, "rx_session", &temp_rxs)) { + if (get_uint_param(vp->parsed, "tx_session", &temp_txs)) { + td->tx_session = cpu_to_be32(temp_txs); + td->rx_session = cpu_to_be32(temp_rxs); + } else { + return -EINVAL; + } + } else { + return -EINVAL; + } + + td->cookie_is_64 = false; + if (get_uint_param(vp->parsed, "cookie64", &temp_int)) { + if (temp_int > 0) + td->cookie_is_64 = true; + } + td->cookie = false; + if (get_ulong_param(vp->parsed, "rx_cookie", &temp_rx)) { + if (get_ulong_param(vp->parsed, "tx_cookie", &temp_tx)) { + td->cookie = true; + if (td->cookie_is_64) { + td->rx_cookie = cpu_to_be64(temp_rx); + td->tx_cookie = cpu_to_be64(temp_tx); + vp->header_size += 8; + td->counter_offset += 8; + } else { + td->rx_cookie = cpu_to_be32(temp_rx); + td->tx_cookie = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->counter_offset += 4; + } + } else { + return -EINVAL; + } + } + + td->has_counter = false; + if (get_uint_param(vp->parsed, "counter", &temp_int)) { + if (temp_int > 0) { + td->has_counter = true; + vp->header_size += 4; + if (get_uint_param( + vp->parsed, "pin_counter", &temp_int)) { + if (temp_int > 0) + td->pin_counter = true; + } + } + } + + if (get_uint_param(vp->parsed, "udp", &temp_int)) { + if (temp_int > 0) { + td->udp = true; + vp->header_size += 4; + td->counter_offset += 4; + td->session_offset += 4; + td->cookie_offset += 4; + } + } + + vp->rx_header_size = vp->header_size; + if ((!td->ipv6) && (!td->udp)) + vp->rx_header_size += sizeof(struct iphdr); + + return 0; +} + +static int build_raw_transport_data(struct vector_private *vp) +{ + if (uml_raw_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= NETIF_F_HW_CSUM; /* TSO does not work on RAW */ + printk(KERN_INFO "raw: using vnet headers to offload checksum"); + } + return 0; +} + +static int build_tap_transport_data(struct vector_private *vp) +{ + if (uml_raw_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= (NETIF_F_HW_CSUM | NETIF_F_TSO); + printk(KERN_INFO "tap/raw: using vnet headers for tso and tx/rx checksum"); + } else { + return 0; /* do not try to enable tap too if raw failed */ + } + if (uml_tap_enable_vnet_headers(vp->fds->tx_fd)) { + return 0; + } + return -1; +} + +int build_transport_data(struct vector_private *vp) +{ + char *transport = uml_vector_fetch_arg(vp->parsed, "transport"); + + if (strncmp(transport, TRANS_GRE, TRANS_GRE_LEN) == 0) + return build_gre_transport_data(vp); + if (strncmp(transport, TRANS_L2TPV3, TRANS_L2TPV3_LEN) == 0) + return build_l2tpv3_transport_data(vp); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return build_raw_transport_data(vp); + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return build_tap_transport_data(vp); + return 0; +} + diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c new file mode 100644 index 000000000000..9210bf2db569 --- /dev/null +++ b/arch/um/drivers/vector_user.c @@ -0,0 +1,528 @@ +/* + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#include <stdio.h> +#include <unistd.h> +#include <stdarg.h> +#include <errno.h> +#include <stddef.h> +#include <string.h> +#include <sys/ioctl.h> +#include <net/if.h> +#include <linux/if_tun.h> +#include <arpa/inet.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <net/ethernet.h> +#include <netinet/ip.h> +#include <netinet/ether.h> +#include <linux/if_ether.h> +#include <linux/if_packet.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <linux/virtio_net.h> +#include <netdb.h> +#include <stdlib.h> +#include <os.h> +#include <um_malloc.h> +#include "vector_user.h" + +#define ID_GRE 0 +#define ID_L2TPV3 1 +#define ID_MAX 1 + +#define TOKEN_IFNAME "ifname" + +#define TRANS_RAW "raw" +#define TRANS_RAW_LEN strlen(TRANS_RAW) + +/* This is very ugly and brute force lookup, but it is done + * only once at initialization so not worth doing hashes or + * anything more intelligent + */ + +char *uml_vector_fetch_arg(struct arglist *ifspec, char *token) +{ + int i; + + for (i = 0; i < ifspec->numargs; i++) { + if (strcmp(ifspec->tokens[i], token) == 0) + return ifspec->values[i]; + } + return NULL; + +} + +struct arglist *uml_parse_vector_ifspec(char *arg) +{ + struct arglist *result; + int pos, len; + bool parsing_token = true, next_starts = true; + + if (arg == NULL) + return NULL; + result = uml_kmalloc(sizeof(struct arglist), UM_GFP_KERNEL); + if (result == NULL) + return NULL; + result->numargs = 0; + len = strlen(arg); + for (pos = 0; pos < len; pos++) { + if (next_starts) { + if (parsing_token) { + result->tokens[result->numargs] = arg + pos; + } else { + result->values[result->numargs] = arg + pos; + result->numargs++; + } + next_starts = false; + } + if (*(arg + pos) == '=') { + if (parsing_token) + parsing_token = false; + else + goto cleanup; + next_starts = true; + (*(arg + pos)) = '\0'; + } + if (*(arg + pos) == ',') { + parsing_token = true; + next_starts = true; + (*(arg + pos)) = '\0'; + } + } + return result; +cleanup: + printk(UM_KERN_ERR "vector_setup - Couldn't parse '%s'\n", arg); + kfree(result); + return NULL; +} + +/* + * Socket/FD configuration functions. These return an structure + * of rx and tx descriptors to cover cases where these are not + * the same (f.e. read via raw socket and write via tap). + */ + +#define PATH_NET_TUN "/dev/net... [truncated message content] |
From: <ant...@ko...> - 2017-10-04 14:42:22
|
From: Anton Ivanov <ant...@ca...> 1. Removes the need to walk the IRQ/Device list to determine who triggered the IRQ. 2. Improves scalability (up to several times performance improvement for cases with 10s of devices). 3. Improves UML baseline IO performance for one disk + one NIC use case by up to 10%. 4. Introduces write poll triggered IRQs. 5. Prerequisite for introducing high performance mmesg family of functions in network IO. 6. Fixes RNG shutdown which was leaking a file descriptor Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/chan_kern.c | 53 +---- arch/um/drivers/line.c | 2 +- arch/um/drivers/random.c | 11 +- arch/um/drivers/ubd_kern.c | 4 +- arch/um/include/shared/irq_user.h | 12 +- arch/um/include/shared/os.h | 17 +- arch/um/kernel/irq.c | 460 ++++++++++++++++++++++++-------------- arch/um/os-Linux/irq.c | 202 +++++++++-------- 8 files changed, 444 insertions(+), 317 deletions(-) diff --git a/arch/um/drivers/chan_kern.c b/arch/um/drivers/chan_kern.c index acbe6c67afba..05588f9466c7 100644 --- a/arch/um/drivers/chan_kern.c +++ b/arch/um/drivers/chan_kern.c @@ -171,56 +171,19 @@ int enable_chan(struct line *line) return err; } -/* Items are added in IRQ context, when free_irq can't be called, and - * removed in process context, when it can. - * This handles interrupt sources which disappear, and which need to - * be permanently disabled. This is discovered in IRQ context, but - * the freeing of the IRQ must be done later. - */ -static DEFINE_SPINLOCK(irqs_to_free_lock); -static LIST_HEAD(irqs_to_free); - -void free_irqs(void) -{ - struct chan *chan; - LIST_HEAD(list); - struct list_head *ele; - unsigned long flags; - - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_splice_init(&irqs_to_free, &list); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - - list_for_each(ele, &list) { - chan = list_entry(ele, struct chan, free_list); - - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } -} - static void close_one_chan(struct chan *chan, int delay_free_irq) { - unsigned long flags; - if (!chan->opened) return; - if (delay_free_irq) { - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_add(&chan->free_list, &irqs_to_free); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - } - else { - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } + /* we can safely call free now - it will be marked + * as free and freed once the IRQ stopped processing + */ + if (chan->input && chan->enabled) + um_free_irq(chan->line->driver->read_irq, chan); + if (chan->output && chan->enabled) + um_free_irq(chan->line->driver->write_irq, chan); + chan->enabled = 0; if (chan->ops->close != NULL) (*chan->ops->close)(chan->fd, chan->data); diff --git a/arch/um/drivers/line.c b/arch/um/drivers/line.c index 366e57f5e8d6..8d80b27502e6 100644 --- a/arch/um/drivers/line.c +++ b/arch/um/drivers/line.c @@ -284,7 +284,7 @@ int line_setup_irq(int fd, int input, int output, struct line *line, void *data) if (err) return err; if (output) - err = um_request_irq(driver->write_irq, fd, IRQ_WRITE, + err = um_request_irq(driver->write_irq, fd, IRQ_NONE, line_write_interrupt, IRQF_SHARED, driver->write_irq_name, data); return err; diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c index 37c51a6be690..778a0e52d5a5 100644 --- a/arch/um/drivers/random.c +++ b/arch/um/drivers/random.c @@ -13,6 +13,7 @@ #include <linux/miscdevice.h> #include <linux/delay.h> #include <linux/uaccess.h> +#include <init.h> #include <irq_kern.h> #include <os.h> @@ -154,7 +155,14 @@ static int __init rng_init (void) /* * rng_cleanup - shutdown RNG module */ -static void __exit rng_cleanup (void) + +static void cleanup(void) +{ + free_irq_by_fd(random_fd); + os_close_file(random_fd); +} + +static void __exit rng_cleanup(void) { os_close_file(random_fd); misc_deregister (&rng_miscdev); @@ -162,6 +170,7 @@ static void __exit rng_cleanup (void) module_init (rng_init); module_exit (rng_cleanup); +__uml_exitcall(cleanup); MODULE_DESCRIPTION("UML Host Random Number Generator (RNG) driver"); MODULE_LICENSE("GPL"); diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index b55fe9bf5d3e..d4e8c497ae86 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1587,11 +1587,11 @@ int io_thread(void *arg) do { res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n); - if (res > 0) { + if (res >= 0) { written += res; } else { if (res != -EAGAIN) { - printk("io_thread - read failed, fd = %d, " + printk("io_thread - write failed, fd = %d, " "err = %d\n", kernel_fd, -n); } } diff --git a/arch/um/include/shared/irq_user.h b/arch/um/include/shared/irq_user.h index df5633053957..a7a6120f19d5 100644 --- a/arch/um/include/shared/irq_user.h +++ b/arch/um/include/shared/irq_user.h @@ -7,6 +7,7 @@ #define __IRQ_USER_H__ #include <sysdep/ptrace.h> +#include <stdbool.h> struct irq_fd { struct irq_fd *next; @@ -15,10 +16,17 @@ struct irq_fd { int type; int irq; int events; - int current_events; + bool active; + bool pending; + bool purge; }; -enum { IRQ_READ, IRQ_WRITE }; +#define IRQ_READ 0 +#define IRQ_WRITE 1 +#define IRQ_NONE 2 +#define MAX_IRQ_TYPE (IRQ_NONE + 1) + + struct siginfo; extern void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 574e03fc7ba2..edbe6563d5c5 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -290,15 +290,16 @@ extern void halt_skas(void); extern void reboot_skas(void); /* irq.c */ -extern int os_waiting_for_events(struct irq_fd *active_fds); -extern int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds); -extern void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2); -extern void os_free_irq_later(struct irq_fd *active_fds, - int irq, void *dev_id); -extern int os_get_pollfd(int i); -extern void os_set_pollfd(int i, int fd); +extern int os_waiting_for_events_epoll(void); +extern void *os_epoll_get_data_pointer(int index); +extern int os_epoll_triggered(int index, int events); +extern int os_event_mask(int irq_type); +extern int os_setup_epoll(void); +extern int os_add_epoll_fd(int events, int fd, void *data); +extern int os_mod_epoll_fd(int events, int fd, void *data); +extern int os_del_epoll_fd(int fd); extern void os_set_ioignore(void); +extern void os_close_epoll_fd(void); /* sigio.c */ extern int add_sigio_fd(int fd); diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 23cb9350d47e..980148d56537 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -1,4 +1,6 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL * Derived (i.e. mostly copied) from arch/i386/kernel/irq.c: @@ -16,243 +18,361 @@ #include <as-layout.h> #include <kern_util.h> #include <os.h> +#include <irq_user.h> -/* - * This list is accessed under irq_lock, except in sigio_handler, - * where it is safe from being modified. IRQ handlers won't change it - - * if an IRQ source has vanished, it will be freed by free_irqs just - * before returning from sigio_handler. That will process a separate - * list of irqs to free, with its own locking, coming back here to - * remove list elements, taking the irq_lock to do so. + +/* When epoll triggers we do not know why it did so + * we can also have different IRQs for read and write. + * This is why we keep a small irq_fd array for each fd - + * one entry per IRQ type */ -static struct irq_fd *active_fds = NULL; -static struct irq_fd **last_irq_ptr = &active_fds; -extern void free_irqs(void); +struct irq_entry { + struct irq_entry *next; + int fd; + struct irq_fd *irq_array[MAX_IRQ_TYPE + 1]; +}; + +static struct irq_entry *active_fds; + +static DEFINE_SPINLOCK(irq_lock); + +static void irq_io_loop(struct irq_fd *irq, struct uml_pt_regs *regs) +{ +/* + * irq->active guards against reentry + * irq->pending accumulates pending requests + * if pending is raised the irq_handler is re-run + * until pending is cleared + */ + if (irq->active) { + irq->active = false; + do { + irq->pending = false; + do_IRQ(irq->irq, regs); + } while (irq->pending && (!irq->purge)); + if (!irq->purge) + irq->active = true; + } else { + irq->pending = true; + } +} void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs) { - struct irq_fd *irq_fd; - int n; + struct irq_entry *irq_entry; + struct irq_fd *irq; + + int n, i, j; while (1) { - n = os_waiting_for_events(active_fds); + /* This is now lockless - epoll keeps back-referencesto the irqs + * which have trigger it so there is no need to walk the irq + * list and lock it every time. We avoid locking by turning off + * IO for a specific fd by executing os_del_epoll_fd(fd) before + * we do any changes to the actual data structures + */ + n = os_waiting_for_events_epoll(); + if (n <= 0) { if (n == -EINTR) continue; - else break; + else + break; } - for (irq_fd = active_fds; irq_fd != NULL; - irq_fd = irq_fd->next) { - if (irq_fd->current_events != 0) { - irq_fd->current_events = 0; - do_IRQ(irq_fd->irq, regs); + for (i = 0; i < n ; i++) { + /* Epoll back reference is the entry with 3 irq_fd + * leaves - one for each irq type. + */ + irq_entry = (struct irq_entry *) + os_epoll_get_data_pointer(i); + for (j = 0; j < MAX_IRQ_TYPE ; j++) { + irq = irq_entry->irq_array[j]; + if (irq == NULL) + continue; + if (os_epoll_triggered(i, irq->events) > 0) + irq_io_loop(irq, regs); + if (irq->purge) { + irq_entry->irq_array[j] = NULL; + kfree(irq); + } } } } +} + +static int assign_epoll_events_to_irq(struct irq_entry *irq_entry) +{ + int i; + int events = 0; + struct irq_fd *irq; - free_irqs(); + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + irq = irq_entry->irq_array[i]; + if (irq != NULL) + events = irq->events | events; + } + if (events > 0) { + /* os_add_epoll will call os_mod_epoll if this already exists */ + return os_add_epoll_fd(events, irq_entry->fd, irq_entry); + } + /* No events - delete */ + return os_del_epoll_fd(irq_entry->fd); } -static DEFINE_SPINLOCK(irq_lock); + static int activate_fd(int irq, int fd, int type, void *dev_id) { - struct pollfd *tmp_pfd; - struct irq_fd *new_fd, *irq_fd; + struct irq_fd *new_fd; + struct irq_entry *irq_entry; + int i, err, events; unsigned long flags; - int events, err, n; err = os_set_fd_async(fd); if (err < 0) goto out; - err = -ENOMEM; - new_fd = kmalloc(sizeof(struct irq_fd), GFP_KERNEL); - if (new_fd == NULL) - goto out; + spin_lock_irqsave(&irq_lock, flags); - if (type == IRQ_READ) - events = UM_POLLIN | UM_POLLPRI; - else events = UM_POLLOUT; - *new_fd = ((struct irq_fd) { .next = NULL, - .id = dev_id, - .fd = fd, - .type = type, - .irq = irq, - .events = events, - .current_events = 0 } ); + /* Check if we have an entry for this fd */ err = -EBUSY; - spin_lock_irqsave(&irq_lock, flags); - for (irq_fd = active_fds; irq_fd != NULL; irq_fd = irq_fd->next) { - if ((irq_fd->fd == fd) && (irq_fd->type == type)) { - printk(KERN_ERR "Registering fd %d twice\n", fd); - printk(KERN_ERR "Irqs : %d, %d\n", irq_fd->irq, irq); - printk(KERN_ERR "Ids : 0x%p, 0x%p\n", irq_fd->id, - dev_id); + for (irq_entry = active_fds; + irq_entry != NULL; irq_entry = irq_entry->next) { + if (irq_entry->fd == fd) + break; + } + + if (irq_entry == NULL) { + /* This needs to be atomic as it may be called from an + * IRQ context. + */ + irq_entry = kmalloc(sizeof(struct irq_entry), GFP_ATOMIC); + if (irq_entry == NULL) { + printk(KERN_ERR + "Failed to allocate new IRQ entry\n"); goto out_unlock; } + irq_entry->fd = fd; + for (i = 0; i < MAX_IRQ_TYPE; i++) + irq_entry->irq_array[i] = NULL; + irq_entry->next = active_fds; + active_fds = irq_entry; } - if (type == IRQ_WRITE) - fd = -1; - - tmp_pfd = NULL; - n = 0; + /* Check if we are trying to re-register an interrupt for a + * particular fd + */ - while (1) { - n = os_create_pollfd(fd, events, tmp_pfd, n); - if (n == 0) - break; + if (irq_entry->irq_array[type] != NULL) { + printk(KERN_ERR + "Trying to reregister IRQ %d FD %d TYPE %d ID %p\n", + irq, fd, type, dev_id + ); + goto out_unlock; + } else { + /* New entry for this fd */ + + err = -ENOMEM; + new_fd = kmalloc(sizeof(struct irq_fd), GFP_ATOMIC); + if (new_fd == NULL) + goto out_unlock; - /* - * n > 0 - * It means we couldn't put new pollfd to current pollfds - * and tmp_fds is NULL or too small for new pollfds array. - * Needed size is equal to n as minimum. - * - * Here we have to drop the lock in order to call - * kmalloc, which might sleep. - * If something else came in and changed the pollfds array - * so we will not be able to put new pollfd struct to pollfds - * then we free the buffer tmp_fds and try again. + events = os_event_mask(type); + + *new_fd = ((struct irq_fd) { + .id = dev_id, + .irq = irq, + .type = type, + .events = events, + .active = true, + .pending = false, + .purge = false + }); + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop */ - spin_unlock_irqrestore(&irq_lock, flags); - kfree(tmp_pfd); - - tmp_pfd = kmalloc(n, GFP_KERNEL); - if (tmp_pfd == NULL) - goto out_kfree; - - spin_lock_irqsave(&irq_lock, flags); + os_del_epoll_fd(irq_entry->fd); + irq_entry->irq_array[type] = new_fd; } - *last_irq_ptr = new_fd; - last_irq_ptr = &new_fd->next; - + /* Turn back IO on with the correct (new) IO event mask */ + assign_epoll_events_to_irq(irq_entry); spin_unlock_irqrestore(&irq_lock, flags); - - /* - * This calls activate_fd, so it has to be outside the critical - * section. - */ - maybe_sigio_broken(fd, (type == IRQ_READ)); + maybe_sigio_broken(fd, (type != IRQ_NONE)); return 0; - - out_unlock: +out_unlock: spin_unlock_irqrestore(&irq_lock, flags); - out_kfree: - kfree(new_fd); - out: +out: return err; } -static void free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg) +/* + * Walk the IRQ list and dispose of any unused entries. + * Should be done under irq_lock. + */ + +static void garbage_collect_irq_entries(void) { - unsigned long flags; + int i; + bool reap; + struct irq_entry *walk; + struct irq_entry *previous = NULL; + struct irq_entry *to_free; - spin_lock_irqsave(&irq_lock, flags); - os_free_irq_by_cb(test, arg, active_fds, &last_irq_ptr); - spin_unlock_irqrestore(&irq_lock, flags); + if (active_fds == NULL) + return; + walk = active_fds; + while (walk != NULL) { + reap = true; + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (walk->irq_array[i] != NULL) { + reap = false; + break; + } + } + if (reap) { + if (previous == NULL) + active_fds = walk->next; + else + previous->next = walk->next; + to_free = walk; + } else { + to_free = NULL; + } + walk = walk->next; + if (to_free != NULL) + kfree(to_free); + } } -struct irq_and_dev { - int irq; - void *dev; -}; +/* + * Walk the IRQ list and get the descriptor for our FD + */ -static int same_irq_and_dev(struct irq_fd *irq, void *d) +static struct irq_entry *get_irq_entry_by_fd(int fd) { - struct irq_and_dev *data = d; + struct irq_entry *walk = active_fds; - return ((irq->irq == data->irq) && (irq->id == data->dev)); + while (walk != NULL) { + if (walk->fd == fd) + return walk; + walk = walk->next; + } + return NULL; } -static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) -{ - struct irq_and_dev data = ((struct irq_and_dev) { .irq = irq, - .dev = dev }); - free_irq_by_cb(same_irq_and_dev, &data); -} +/* + * Walk the IRQ list and dispose of an entry for a specific + * device, fd and number. Note - if sharing an IRQ for read + * and writefor the same FD it will be disposed in either case. + * If this behaviour is undesirable use different IRQ ids. + */ -static int same_fd(struct irq_fd *irq, void *fd) -{ - return (irq->fd == *((int *)fd)); -} +#define IGNORE_IRQ 1 +#define IGNORE_DEV (1<<1) -void free_irq_by_fd(int fd) +static void do_free_by_irq_and_dev( + struct irq_entry *irq_entry, + unsigned int irq, + void *dev, + int flags +) { - free_irq_by_cb(same_fd, &fd); + int i; + struct irq_fd *to_free; + + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (irq_entry->irq_array[i] != NULL) { + if ( + ((flags & IGNORE_IRQ) || + (irq_entry->irq_array[i]->irq == irq)) && + ((flags & IGNORE_DEV) || + (irq_entry->irq_array[i]->id == dev)) + ) { + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop + */ + os_del_epoll_fd(irq_entry->fd); + to_free = irq_entry->irq_array[i]; + irq_entry->irq_array[i] = NULL; + assign_epoll_events_to_irq(irq_entry); + if (to_free->active) + to_free->purge = true; + else + kfree(to_free); + } + } + } } -/* Must be called with irq_lock held */ -static struct irq_fd *find_irq_by_fd(int fd, int irqnum, int *index_out) +void free_irq_by_fd(int fd) { - struct irq_fd *irq; - int i = 0; - int fdi; + struct irq_entry *to_free; + unsigned long flags; - for (irq = active_fds; irq != NULL; irq = irq->next) { - if ((irq->fd == fd) && (irq->irq == irqnum)) - break; - i++; - } - if (irq == NULL) { - printk(KERN_ERR "find_irq_by_fd doesn't have descriptor %d\n", - fd); - goto out; - } - fdi = os_get_pollfd(i); - if ((fdi != -1) && (fdi != fd)) { - printk(KERN_ERR "find_irq_by_fd - mismatch between active_fds " - "and pollfds, fd %d vs %d, need %d\n", irq->fd, - fdi, fd); - irq = NULL; - goto out; + spin_lock_irqsave(&irq_lock, flags); + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); } - *index_out = i; - out: - return irq; + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); } -void reactivate_fd(int fd, int irqnum) +static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irq, + dev, + 0 + ); + to_free = to_free->next; } - os_set_pollfd(i, irq->fd); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); +} - add_sigio_fd(fd); + +void reactivate_fd(int fd, int irqnum) +{ + /** NOP - we do auto-EOI now **/ } void deactivate_fd(int fd, int irqnum) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; + os_del_epoll_fd(fd); spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irqnum, + NULL, + IGNORE_DEV + ); } - - os_set_pollfd(i, -1); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); - ignore_sigio_fd(fd); } EXPORT_SYMBOL(deactivate_fd); @@ -265,17 +385,28 @@ EXPORT_SYMBOL(deactivate_fd); */ int deactivate_all_fds(void) { - struct irq_fd *irq; - int err; + unsigned long flags; + struct irq_entry *to_free; - for (irq = active_fds; irq != NULL; irq = irq->next) { - err = os_clear_fd_async(irq->fd); - if (err) - return err; - } - /* If there is a signal already queued, after unblocking ignore it */ + spin_lock_irqsave(&irq_lock, flags); + /* Stop IO. The IRQ loop has no lock so this is our + * only way of making sure we are safe to dispose + * of all IRQ handlers + */ os_set_ioignore(); - + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); + to_free = to_free->next; + } + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); + os_close_epoll_fd(); return 0; } @@ -353,8 +484,11 @@ void __init init_IRQ(void) irq_set_chip_and_handler(TIMER_IRQ, &SIGVTALRM_irq_type, handle_edge_irq); + for (i = 1; i < NR_IRQS; i++) irq_set_chip_and_handler(i, &normal_irq_type, handle_edge_irq); + /* Initialize EPOLL Loop */ + os_setup_epoll(); } /* diff --git a/arch/um/os-Linux/irq.c b/arch/um/os-Linux/irq.c index b9afb74b79ad..365823010346 100644 --- a/arch/um/os-Linux/irq.c +++ b/arch/um/os-Linux/irq.c @@ -1,135 +1,147 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL */ #include <stdlib.h> #include <errno.h> -#include <poll.h> +#include <sys/epoll.h> #include <signal.h> #include <string.h> #include <irq_user.h> #include <os.h> #include <um_malloc.h> +/* Epoll support */ + +static int epollfd = -1; + +#define MAX_EPOLL_EVENTS 64 + +static struct epoll_event epoll_events[MAX_EPOLL_EVENTS]; + +/* Helper to return an Epoll data pointer from an epoll event structure. + * We need to keep this one on the userspace side to keep includes separate + */ + +void *os_epoll_get_data_pointer(int index) +{ + return epoll_events[index].data.ptr; +} + +/* Helper to compare events versus the events in the epoll structure. + * Same as above - needs to be on the userspace side + */ + + +int os_epoll_triggered(int index, int events) +{ + return epoll_events[index].events & events; +} +/* Helper to set the event mask. + * The event mask is opaque to the kernel side, because it does not have + * access to the right includes/defines for EPOLL constants. + */ + +int os_event_mask(int irq_type) +{ + if (irq_type == IRQ_READ) + return EPOLLIN | EPOLLPRI; + if (irq_type == IRQ_WRITE) + return EPOLLOUT; + return 0; +} + /* - * Locked by irq_lock in arch/um/kernel/irq.c. Changed by os_create_pollfd - * and os_free_irq_by_cb, which are called under irq_lock. + * Initial Epoll Setup */ -static struct pollfd *pollfds = NULL; -static int pollfds_num = 0; -static int pollfds_size = 0; +int os_setup_epoll(void) +{ + epollfd = epoll_create(MAX_EPOLL_EVENTS); + return epollfd; +} -int os_waiting_for_events(struct irq_fd *active_fds) +/* + * Helper to run the actual epoll_wait + */ +int os_waiting_for_events_epoll(void) { - struct irq_fd *irq_fd; - int i, n, err; + int n, err; - n = poll(pollfds, pollfds_num, 0); + n = epoll_wait(epollfd, + (struct epoll_event *) &epoll_events, MAX_EPOLL_EVENTS, 0); if (n < 0) { err = -errno; if (errno != EINTR) - printk(UM_KERN_ERR "os_waiting_for_events:" - " poll returned %d, errno = %d\n", n, errno); + printk( + UM_KERN_ERR "os_waiting_for_events:" + " epoll returned %d, error = %s\n", n, + strerror(errno) + ); return err; } - - if (n == 0) - return 0; - - irq_fd = active_fds; - - for (i = 0; i < pollfds_num; i++) { - if (pollfds[i].revents != 0) { - irq_fd->current_events = pollfds[i].revents; - pollfds[i].fd = -1; - } - irq_fd = irq_fd->next; - } return n; } -int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds) -{ - if (pollfds_num == pollfds_size) { - if (size_tmpfds <= pollfds_size * sizeof(pollfds[0])) { - /* return min size needed for new pollfds area */ - return (pollfds_size + 1) * sizeof(pollfds[0]); - } - - if (pollfds != NULL) { - memcpy(tmp_pfd, pollfds, - sizeof(pollfds[0]) * pollfds_size); - /* remove old pollfds */ - kfree(pollfds); - } - pollfds = tmp_pfd; - pollfds_size++; - } else - kfree(tmp_pfd); /* remove not used tmp_pfd */ - - pollfds[pollfds_num] = ((struct pollfd) { .fd = fd, - .events = events, - .revents = 0 }); - pollfds_num++; - - return 0; -} -void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2) +/* + * Helper to add a fd to epoll + */ +int os_add_epoll_fd(int events, int fd, void *data) { - struct irq_fd **prev; - int i = 0; - - prev = &active_fds; - while (*prev != NULL) { - if ((*test)(*prev, arg)) { - struct irq_fd *old_fd = *prev; - if ((pollfds[i].fd != -1) && - (pollfds[i].fd != (*prev)->fd)) { - printk(UM_KERN_ERR "os_free_irq_by_cb - " - "mismatch between active_fds and " - "pollfds, fd %d vs %d\n", - (*prev)->fd, pollfds[i].fd); - goto out; - } - - pollfds_num--; - - /* - * This moves the *whole* array after pollfds[i] - * (though it doesn't spot as such)! - */ - memmove(&pollfds[i], &pollfds[i + 1], - (pollfds_num - i) * sizeof(pollfds[0])); - if (*last_irq_ptr2 == &old_fd->next) - *last_irq_ptr2 = prev; - - *prev = (*prev)->next; - if (old_fd->type == IRQ_WRITE) - ignore_sigio_fd(old_fd->fd); - kfree(old_fd); - continue; - } - prev = &(*prev)->next; - i++; - } - out: - return; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events | EPOLLET; + result = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &event); + if ((result) && (errno == EEXIST)) + result = os_mod_epoll_fd(events, fd, data); + if (result) + printk("epollctl add err fd %d, %s\n", fd, strerror(errno)); + return result; } -int os_get_pollfd(int i) +/* + * Helper to mod the fd event mask and/or data backreference + */ +int os_mod_epoll_fd(int events, int fd, void *data) { - return pollfds[i].fd; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events; + result = epoll_ctl(epollfd, EPOLL_CTL_MOD, fd, &event); + if (result) + printk(UM_KERN_ERR + "epollctl mod err fd %d, %s\n", fd, strerror(errno)); + return result; } -void os_set_pollfd(int i, int fd) +/* + * Helper to delete the epoll fd + */ +int os_del_epoll_fd(int fd) { - pollfds[i].fd = fd; + struct epoll_event event; + int result; + /* This is quiet as we use this as IO ON/OFF - so it is often + * invoked on a non-existent fd + */ + result = epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, &event); + return result; } void os_set_ioignore(void) { signal(SIGIO, SIG_IGN); } + +void os_close_epoll_fd(void) +{ + /* Needed so we do not leak an fd when rebooting */ + os_close_file(epollfd); +} -- 2.11.0 |
From: <ant...@ko...> - 2017-10-04 14:42:13
|
This update cleans up the code and simplifies the xmit routines to share some of the code. It also brings tap up to date to do scatter gather and TSO on transmit. The result is ... drum roll... 4.5Gbit+ on iperf on my laptop. As a reference QEMU+kvm with all bells and whistles does under 1GBit and LXC with vEth does ~ 20GBit. TSO on RX is still not implemented - this is XMIT only. The same code should also work for raw, but it does not. It looks like a BUG in the actual linux af_packet.c raw socket implementation. I am probably going to implement TSO on RX so I have a full frame of reference to see if the RX TSO in raw sockets is broken (or not) next. So for the time being raw does only checksums. I will try to find some time to add Geneve and VXLAN to this. It will be after I figure out RAW and TSO on RX for RAW and TAP. Brgds, A. |
From: Anton I. <ant...@ko...> - 2017-10-04 07:29:18
|
I got the TSO and some offloads working for some (not all) cases of my patchsets. Preliminary speeds and feeds on iperf: [ 4] local 192.168.98.1 port 5001 connected with 192.168.98.132 port 54838 [ 4] 0.0-10.0 sec 4.78 GBytes 4.11 Gbits/sec As a comparison - kvm/qemu on same machine does about a Gbit and vEth to container does 20Gbit - so somewhere in between. I am having some trouble with TSO and raw sockets. Once I figure out what is going on (and hopefully fix it) there will be an updated version which should take us into 4x what kvm/QEMU does on one CPU networkwise. A. On 09/26/17 11:26, ant...@ko... wrote: > Hi List, hi Richard, > > I found a logical error in the SG code. This revision fixes that. No > other changes - the rest appears to work as it should in my testing. > > Brgds, > > A. > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > User-mode-linux-devel mailing list > Use...@li... > https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel > |
From: <ant...@ko...> - 2017-09-26 10:27:20
|
From: Anton Ivanov <ant...@ca...> 1. Provides infrastructure for vector IO using recvmmsg/sendmmsg. 1.1. Multi-message read. 1.2. Multi-message write. 1.3. Optimized queue support for multi-packet enqueue/dequeue. 1.4. BQL/DQL support. 2. Implements transports for several transports as well support for direct wiring of PWEs to NIC. Allows direct connection of VMs to host, other VMs and network devices with no switch in use. 2.1. Raw socket >4 times faster than existing pcap based transport 2.2. Tap transport using socket RX and tap xmit. >3 times faster RX than existing tap. 2.3. GRE transport - direct wiring to GRE PWE, >3 times faster RX than any existing transport. 2.4. L2TPv3 transport - direct wiring to L2TPv3 PWE, > 3 times faster RX than any existing transport. 2.5. TX in all cases shows only minor improvement, but consumes significantly less CPU under load. 3. Tuning and performance related information via ethtool. 4. Rudimentary BPF support - used in tap only to avoid software looping 5. Scatter Gather support. 6. VNET and checksum offload support for raw socket transport. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/Kconfig.net | 11 + arch/um/drivers/Makefile | 4 +- arch/um/drivers/net_kern.c | 4 +- arch/um/drivers/vector_kern.c | 1517 +++++++++++++++++++++++++++++++++++ arch/um/drivers/vector_kern.h | 128 +++ arch/um/drivers/vector_transports.c | 407 ++++++++++ arch/um/drivers/vector_user.c | 512 ++++++++++++ arch/um/drivers/vector_user.h | 89 ++ arch/um/include/asm/irq.h | 12 + arch/um/include/shared/net_kern.h | 2 + 10 files changed, 2683 insertions(+), 3 deletions(-) create mode 100644 arch/um/drivers/vector_kern.c create mode 100644 arch/um/drivers/vector_kern.h create mode 100644 arch/um/drivers/vector_transports.c create mode 100644 arch/um/drivers/vector_user.c create mode 100644 arch/um/drivers/vector_user.h diff --git a/arch/um/Kconfig.net b/arch/um/Kconfig.net index 820a56f00332..0516dc76e6aa 100644 --- a/arch/um/Kconfig.net +++ b/arch/um/Kconfig.net @@ -108,6 +108,17 @@ config UML_NET_DAEMON more than one without conflict. If you don't need UML networking, say N. +config UML_NET_VECTOR + bool "Vector I/O high performance network devices" + depends on UML_NET + help + This User-Mode Linux network driver uses multi-message send + and receive functions. The host running the UML guest must have + a linux kernel version above 3.0 and a libc version > 2.13. + This driver provides tap, raw, gre and l2tpv3 network transports + with up to 4 times higher network throughput than the UML network + drivers. + config UML_NET_VDE bool "VDE transport" depends on UML_NET diff --git a/arch/um/drivers/Makefile b/arch/um/drivers/Makefile index e7582e1d248c..16b3cebddafb 100644 --- a/arch/um/drivers/Makefile +++ b/arch/um/drivers/Makefile @@ -9,6 +9,7 @@ slip-objs := slip_kern.o slip_user.o slirp-objs := slirp_kern.o slirp_user.o daemon-objs := daemon_kern.o daemon_user.o +vector-objs := vector_kern.o vector_user.o vector_transports.o umcast-objs := umcast_kern.o umcast_user.o net-objs := net_kern.o net_user.o mconsole-objs := mconsole_kern.o mconsole_user.o @@ -43,6 +44,7 @@ obj-$(CONFIG_STDERR_CONSOLE) += stderr_console.o obj-$(CONFIG_UML_NET_SLIP) += slip.o slip_common.o obj-$(CONFIG_UML_NET_SLIRP) += slirp.o slip_common.o obj-$(CONFIG_UML_NET_DAEMON) += daemon.o +obj-$(CONFIG_UML_NET_VECTOR) += vector.o obj-$(CONFIG_UML_NET_VDE) += vde.o obj-$(CONFIG_UML_NET_MCAST) += umcast.o obj-$(CONFIG_UML_NET_PCAP) += pcap.o @@ -61,7 +63,7 @@ obj-$(CONFIG_BLK_DEV_COW_COMMON) += cow_user.o obj-$(CONFIG_UML_RANDOM) += random.o # pcap_user.o must be added explicitly. -USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o +USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o vector_user.o CFLAGS_null.o = -DDEV_NULL=$(DEV_NULL_PATH) include arch/um/scripts/Makefile.rules diff --git a/arch/um/drivers/net_kern.c b/arch/um/drivers/net_kern.c index 1669240c7a25..03cc5857a3b6 100644 --- a/arch/um/drivers/net_kern.c +++ b/arch/um/drivers/net_kern.c @@ -288,7 +288,7 @@ static void uml_net_user_timer_expire(unsigned long _conn) #endif } -static void setup_etheraddr(struct net_device *dev, char *str) +void uml_net_setup_etheraddr(struct net_device *dev, char *str) { unsigned char *addr = dev->dev_addr; char *end; @@ -412,7 +412,7 @@ static void eth_configure(int n, void *init, char *mac, */ snprintf(dev->name, sizeof(dev->name), "eth%d", n); - setup_etheraddr(dev, mac); + uml_net_setup_etheraddr(dev, mac); printk(KERN_INFO "Netdevice %d (%pM) : ", n, dev->dev_addr); diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c new file mode 100644 index 000000000000..5fe5393ff25e --- /dev/null +++ b/arch/um/drivers/vector_kern.c @@ -0,0 +1,1517 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright (C) 2001 Lennert Buytenhek (bu...@gn...) and + * James Leu (jl...@mi...). + * Copyright (C) 2001 by various other people who didn't put their name here. + * Licensed under the GPL. + */ + +#include <linux/version.h> +#include <linux/bootmem.h> +#include <linux/etherdevice.h> +#include <linux/ethtool.h> +#include <linux/inetdevice.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/rtnetlink.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <linux/interrupt.h> +#include <init.h> +#include <irq_kern.h> +#include <irq_user.h> +#include <net_kern.h> +#include <os.h> +#include "mconsole_kern.h" +#include "vector_user.h" +#include "vector_kern.h" + +/* + * Adapted from network devices with the following major changes: + * All transports are static - simplifies the code significantly + * Multiple FDs/IRQs per device + * Vector IO optionally used for read/write, falling back to legacy + * based on configuration and/or availability + * Configuration is no longer positional - L2TPv3 and GRE require up to + * 10 parameters, passing this as positional is not fit for purpose. + * Only socket transports are supported + */ + + +#define DRIVER_NAME "uml-vector" +#define DRIVER_VERSION "01" +struct vector_cmd_line_arg { + struct list_head list; + int unit; + char *arguments; +}; + +struct vector_device { + struct list_head list; + struct net_device *dev; + struct platform_device pdev; + int unit; + int opened; +}; + +static LIST_HEAD(vec_cmd_line); + +static DEFINE_SPINLOCK(vector_devices_lock); +static LIST_HEAD(vector_devices); + +static int driver_registered; + +static void vector_eth_configure(int n, struct arglist *def); + +/* Argument accessors to set variables (and/or set default values) + * mtu, buffer sizing, default headroom, etc + */ + +#define DEFAULT_HEADROOM 2 +#define SAFETY_MARGIN 32 +#define DEFAULT_VECTOR_SIZE 64 +#define TX_SMALL_PACKET 128 +#define MAX_IOV_SIZE 8 + +static const struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { "rx_queue_max" }, + { "rx_queue_running_average" }, + { "tx_queue_max" }, + { "tx_queue_running_average" }, + { "rx_encaps_errors" }, + { "tx_timeout_count" }, + { "tx_restart_queue" }, + { "tx_kicks" }, + { "tx_flow_control_xon" }, + { "tx_flow_control_xoff" }, + { "rx_csum_offload_good" }, + { "rx_csum_offload_errors"}, + { "sg_ok"}, + { "sg_linearized"}, +}; + +#define VECTOR_NUM_STATS ARRAY_SIZE(ethtool_stats_keys) + +/* Used in the 4.4 OpenWRT backport */ + +#if LINUX_VERSION_CODE <= KERNEL_VERSION(4, 7, 0) +static inline void netif_trans_update(struct net_device *dev) +{ + struct netdev_queue *txq = netdev_get_tx_queue(dev, 0); + + if (txq->trans_start != jiffies) + txq->trans_start = jiffies; +} +#endif + +static void vector_reset_stats(struct vector_private *vp) +{ + vp->estats.rx_queue_max = 0; + vp->estats.rx_queue_running_average = 0; + vp->estats.tx_queue_max = 0; + vp->estats.tx_queue_running_average = 0; + vp->estats.rx_encaps_errors = 0; + vp->estats.tx_timeout_count = 0; + vp->estats.tx_restart_queue = 0; + vp->estats.tx_kicks = 0; + vp->estats.tx_flow_control_xon = 0; + vp->estats.tx_flow_control_xoff = 0; + vp->estats.sg_ok = 0; + vp->estats.sg_linearized = 0; +} + +static int get_mtu(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "mtu"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return ETH_MAX_PACKET; +} + +static int get_depth(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "depth"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_VECTOR_SIZE; +} + +static int get_headroom(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "headroom"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_HEADROOM; +} + +static int get_transport_options(struct arglist *def) +{ + char *transport = uml_vector_fetch_arg(def, "transport"); + + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return (VECTOR_RX | VECTOR_BPF); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); + return (VECTOR_TX | VECTOR_RX); +} + + +/* A mini-buffer for packet drop read + * All of our supported transports are datagram oriented and we always + * read using recvmsg or recvmmsg. If we pass a buffer which is smaller + * than the packet size it still counts as full packet read and will + * clean the incoming stream to keep sigio/epoll happy + */ + +#define DROP_BUFFER_SIZE 32 + +static char *drop_buffer; + +/* Array backed queues optimized for bulk enqueue/dequeue and + * 1:N (small values of N) or 1:1 enqueuer/dequeuer ratios. + * For more details and full design rationale see + * http://foswiki.cambridgegreys.com/Main/EatYourTailAndEnjoyIt + */ + + +/* + * Advance the mmsg queue head by n = advance. Resets the queue to + * maximum enqueue/dequeue-at-once capacity if possible. Called by + * dequeuers. Caller must hold the head_lock! + */ + +static int vector_advancehead(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->head = + (qi->head + advance) + % qi->max_depth; + + + spin_lock(&qi->tail_lock); + qi->queue_depth -= advance; + + /* we are at 0, use this to + * reset head and tail so we can use max size vectors + */ + + if (qi->queue_depth == 0) { + qi->head = 0; + qi->tail = 0; + } + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +/* Advance the queue tail by n = advance. + * This is called by enqueuers which should hold the + * head lock already + */ + +static int vector_advancetail(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->tail = + (qi->tail + advance) + % qi->max_depth; + spin_lock(&qi->head_lock); + qi->queue_depth += advance; + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + return queue_depth; +} + +/* + * Generic vector enqueue with support for forming headers using transport + * specific callback. Allows GRE, L2TPv3, RAW and other transports + * to use a common enqueue procedure in vector mode + */ + +static int vector_enqueue(struct vector_queue *qi, struct sk_buff *skb) +{ + struct vector_private *vp = netdev_priv(qi->dev); + int queue_depth; + int nr_frags, frag, packet_len; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + struct iovec *iov; + skb_frag_t *skb_frag; + + spin_lock(&qi->tail_lock); + spin_lock(&qi->head_lock); + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + + if (skb) + packet_len = skb->len; + + if (queue_depth < qi->max_depth) { + + *(qi->skbuff_vector + qi->tail) = skb; + mmsg_vector += qi->tail; + iov = mmsg_vector->msg_hdr.msg_iov; + + mmsg_vector->msg_hdr.msg_name = vp->fds->remote_addr; + mmsg_vector->msg_hdr.msg_namelen = vp->fds->remote_addr_size; + nr_frags = skb_shinfo(skb)->nr_frags; + if (nr_frags > qi->max_iov_frags) { + if (skb_linearize(skb) != 0) + goto drop; + else + vp->estats.sg_linearized++; + } + if (vp->header_size > 0) { + vp->form_header(iov->iov_base, skb, vp); + iov++; + mmsg_vector->msg_hdr.msg_iovlen = 2 + nr_frags; + } else + mmsg_vector->msg_hdr.msg_iovlen = 1 + nr_frags; + iov->iov_base = skb->data; + if (nr_frags > 0) { + iov->iov_len = skb->len - skb->data_len; + vp->estats.sg_ok++; + } else + iov->iov_len = skb->len; + for (frag = 0; frag < nr_frags; frag++) { + iov++; + skb_frag = &skb_shinfo(skb)->frags[frag]; + iov->iov_base = skb_frag_address_safe(skb_frag); + iov->iov_len = skb_frag_size(skb_frag); + } + queue_depth = vector_advancetail(qi, 1); + } else + goto drop; + spin_unlock(&qi->tail_lock); + return queue_depth; +drop: + qi->dev->stats.tx_dropped++; + if (skb != NULL) { + dev_consume_skb_any(skb); + netdev_completed_queue(qi->dev, 1, packet_len); + } + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +static int consume_vector_skbs(struct vector_queue *qi, int count) +{ + struct sk_buff *skb; + int skb_index; + int bytes_compl = 0; + + for (skb_index = qi->head; skb_index < qi->head + count; skb_index++) { + skb = *(qi->skbuff_vector + skb_index); + /* mark as empty to ensure correct destruction if + * needed + */ + bytes_compl += skb->len; + *(qi->skbuff_vector + skb_index) = NULL; + dev_consume_skb_any(skb); + } + qi->dev->stats.tx_bytes += bytes_compl; + qi->dev->stats.tx_packets += count; + netdev_completed_queue(qi->dev, count, bytes_compl); + return vector_advancehead(qi, count); +} + +/* + * Generic vector deque via sendmmsg with support for forming headers + * using transport specific callback. Allows GRE, L2TPv3, RAW and + * other transports to use a common dequeue procedure in vector mode + */ + + +static int vector_send(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *send_from; + int result = 0, send_len, queue_depth = qi->max_depth; + + if (spin_trylock(&qi->head_lock)) { + if (spin_trylock(&qi->tail_lock)) { + /* update queue_depth to current value */ + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + while (queue_depth > 0) { + /* Calculate the start of the vector */ + send_len = queue_depth; + send_from = qi->mmsg_vector; + send_from += qi->head; + /* Adjust vector size if wraparound */ + if (send_len + qi->head > qi->max_depth) + send_len = qi->max_depth - qi->head; + /* Try to TX as many packets as possible */ + if (send_len > 0) { + result = uml_vector_sendmmsg( + vp->fds->tx_fd, + send_from, + send_len, + 0 + ); + vp->in_write_poll = (result != send_len); + } + /* For some of the sendmmsg error scenarios + * we may end being unsure in the TX success + * for all packets. It is safer to declare + * them all TX-ed and blame the network. + */ + if (result < 0) { + printk(KERN_ERR "sendmmsg err=%i\n", + result); + result = send_len; + } + if (result > 0) { + queue_depth = consume_vector_skbs(qi, result); + /* This is equivalent to an TX IRQ. + * Restart the upper layers to feed us + * more packets. + */ + if (result > vp->estats.tx_queue_max) + vp->estats.tx_queue_max = result; + vp->estats.tx_queue_running_average = + (vp->estats.tx_queue_running_average + result) >> 1; + } + netif_trans_update(qi->dev); + netif_wake_queue(qi->dev); + /* if TX is busy, break out of the send loop, poll + * write IRQ will reschedule xmit for us + */ + if (result != send_len) { + vp->estats.tx_restart_queue++; + break; + } + } + } + spin_unlock(&qi->head_lock); + } else { + tasklet_schedule(&vp->tx_poll); + } + return queue_depth; +} + +/* Queue destructor. Deliberately stateless so we can use + * it in queue cleanup if initialization fails. + */ + +static void destroy_queue(struct vector_queue *qi) +{ + int i; + struct iovec *iov; + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *mmsg_vector; + + if (qi == NULL) + return; + /* deallocate any skbuffs - we rely on any unused to be + * set to NULL. + */ + if (qi->skbuff_vector != NULL) { + for (i = 0; i < qi->max_depth; i++) { + if (*(qi->skbuff_vector + i) != NULL) + dev_kfree_skb_any(*(qi->skbuff_vector + i)); + } + kfree(qi->skbuff_vector); + } + /* deallocate matching IOV structures including header buffs */ + if (qi->mmsg_vector != NULL) { + mmsg_vector = qi->mmsg_vector; + for (i = 0; i < qi->max_depth; i++) { + iov = mmsg_vector->msg_hdr.msg_iov; + if (iov != NULL) { + if ((vp->header_size > 0) && + (iov->iov_base != NULL)) + kfree(iov->iov_base); + kfree(iov); + } + mmsg_vector++; + } + kfree(qi->mmsg_vector); + } + kfree(qi); +} + +/* + * Queue constructor. Create a queue with a given side. + */ +static struct vector_queue *create_queue( + struct vector_private *vp, + int max_size, + int header_size, + int num_extra_frags) +{ + struct vector_queue *result; + int i; + struct iovec *iov; + struct mmsghdr *mmsg_vector; + + result = kmalloc(sizeof(struct vector_queue), GFP_KERNEL); + if (result == NULL) + goto out_fail; + result->max_depth = max_size; + result->dev = vp->dev; + result->mmsg_vector = kmalloc( + (sizeof(struct mmsghdr) * max_size), GFP_KERNEL); + result->skbuff_vector = kmalloc( + (sizeof(void *) * max_size), GFP_KERNEL); + if (result->mmsg_vector == NULL || result->skbuff_vector == NULL) + goto out_fail; + + mmsg_vector = result->mmsg_vector; + for (i = 0; i < max_size; i++) { + /* Clear all pointers - we use non-NULL as marking on + * what to free on destruction + */ + *(result->skbuff_vector + i) = NULL; + mmsg_vector->msg_hdr.msg_iov = NULL; + mmsg_vector++; + } + mmsg_vector = result->mmsg_vector; + result->max_iov_frags = num_extra_frags; + for (i = 0; i < max_size; i++) { + if (vp->header_size > 0) + iov = kmalloc(sizeof(struct iovec) * (2 + num_extra_frags), GFP_KERNEL); + else + iov = kmalloc(sizeof(struct iovec) * (1 + num_extra_frags), GFP_KERNEL); + if (iov == NULL) + goto out_fail; + mmsg_vector->msg_hdr.msg_iov = iov; + mmsg_vector->msg_hdr.msg_iovlen = 1; + mmsg_vector->msg_hdr.msg_control = NULL; + mmsg_vector->msg_hdr.msg_controllen = 0; + mmsg_vector->msg_hdr.msg_flags = 0; + mmsg_vector->msg_hdr.msg_name = NULL; + mmsg_vector->msg_hdr.msg_namelen = 0; + if (vp->header_size > 0) { + iov->iov_base = kmalloc(header_size, GFP_KERNEL); + if (iov->iov_base == NULL) + goto out_fail; + iov->iov_len = header_size; + mmsg_vector->msg_hdr.msg_iovlen = 2; + iov++; + } + iov->iov_base = NULL; + iov->iov_len = 0; + mmsg_vector++; + } + spin_lock_init(&result->head_lock); + spin_lock_init(&result->tail_lock); + result->queue_depth = 0; + result->head = 0; + result->tail = 0; + return result; +out_fail: + destroy_queue(result); + return NULL; +} + +/* + * We do not use the RX queue as a proper wraparound queue for now + * This is not necessary because the consumption via netif_rx() + * happens in-line. While we can try using the return code of + * netif_rx() for flow control there are no drivers doing this today. + * For this RX specific use we ignore the tail/head locks and + * just read into a prepared queue filled with skbuffs. + */ + +/* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ + +static void prep_queue_for_rx(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct sk_buff *skb; + struct iovec *iov; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int i; + + if (qi->queue_depth == 0) + return; + for (i = 0; i < qi->queue_depth; i++) { + /* it is OK if allocation fails - recvmmsg with NULL data in + * iov argument still performs an RX, just drops the packet + * This allows us stop faffing around with a "drop buffer" + */ + + skb = netdev_alloc_skb( + vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + iov = mmsg_vector->msg_hdr.msg_iov; + mmsg_vector->msg_len = 0; + if (vp->header_size > 0) + iov++; + if (skb != NULL) { + skb_reserve(skb, vp->headroom); + skb->dev = qi->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + skb->ip_summed = CHECKSUM_NONE; + iov->iov_base = skb->data; + iov->iov_len = vp->max_packet; + } else { + iov->iov_base = NULL; + iov->iov_len = 0; + } + *skbuff_vector = skb; + skbuff_vector++; + mmsg_vector++; + } + qi->queue_depth = 0; +} + +static struct vector_device *find_device(int n) +{ + struct vector_device *device; + struct list_head *ele; + + spin_lock(&vector_devices_lock); + list_for_each(ele, &vector_devices) { + device = list_entry(ele, struct vector_device, list); + if (device->unit == n) + goto out; + } + device = NULL; + out: + spin_unlock(&vector_devices_lock); + return device; +} + +static int vector_parse(char *str, int *index_out, char **str_out, + char **error_out) +{ + int n, len, err = -EINVAL; + char *start = str; + + len = strlen(str); + + while ((*str != ':') && (strlen(str) > 1)) + str++; + if (*str != ':') { + *error_out = "Expected ':' after device number"; + return err; + } else + *str = '\0'; + + err = kstrtouint(start, 0, &n); + if (err < 0) { + *error_out = "Bad device number"; + return err; + } + + str++; + if (find_device(n)) { + *error_out = "Device already configured"; + return err; + } + + *index_out = n; + *str_out = str; + return 0; +} + +static int vector_config(char *str, char **error_out) +{ + int err, n; + char *params; + struct arglist *parsed; + + err = vector_parse(str, &n, ¶ms, error_out); + if (err != 0) + return err; + + /* This string is broken up and the pieces used by the underlying + * driver. We should copy it to make sure things do not go wrong + * later. + */ + + params = kstrdup(params, GFP_KERNEL); + if (str == NULL) { + *error_out = "vector_config failed to strdup string"; + return -ENOMEM; + } + + parsed = uml_parse_vector_ifspec(params); + + if (parsed == NULL) { + *error_out = "vector_config failed to parse parameters"; + return -EINVAL; + } + + vector_eth_configure(n, parsed); + return 0; +} + +static int vector_id(char **str, int *start_out, int *end_out) +{ + char *end; + int n; + + n = simple_strtoul(*str, &end, 0); + if ((*end != '\0') || (end == *str)) + return -1; + + *start_out = n; + *end_out = n; + *str = end; + return n; +} + +static int vector_remove(int n, char **error_out) +{ + struct vector_device *vec_d; + struct net_device *dev; + struct vector_private *vp; + + vec_d = find_device(n); + if (vec_d == NULL) + return -ENODEV; + dev = vec_d->dev; + vp = netdev_priv(dev); + if (vp->fds != NULL) + return -EBUSY; + unregister_netdev(dev); + platform_device_unregister(&vec_d->pdev); + return 0; +} + +/* + * There is no shared per-transport initialization code, so + * we will just initialize each interface one by one and + * add them to a list + */ + +static struct platform_driver uml_net_driver = { + .driver = { + .name = DRIVER_NAME, + }, +}; + + +static void vector_device_release(struct device *dev) +{ + struct vector_device *device = dev_get_drvdata(dev); + struct net_device *netdev = device->dev; + + list_del(&device->list); + kfree(device); + free_netdev(netdev); +} + +/* Bog standard recv using recvmsg - not used normally unless the user + * explicitly specifies not to use recvmmsg vector RX. + */ + +static int vector_legacy_rx(struct vector_private *vp) +{ + int pkt_len; + struct user_msghdr hdr; + struct iovec iov[2]; /* header + data use case only */ + int iovpos = 0; + struct sk_buff *skb; + int header_check; + + hdr.msg_name = NULL; + hdr.msg_namelen = 0; + hdr.msg_iov = (struct iovec *) &iov; + hdr.msg_iovlen = 1; + hdr.msg_control = NULL; + hdr.msg_controllen = 0; + hdr.msg_flags = 0; + + if (vp->header_size > 0) { + iov[iovpos].iov_base = vp->header_rxbuffer; + iov[iovpos].iov_len = vp->rx_header_size; + hdr.msg_iovlen++; + iovpos++; + } + + skb = netdev_alloc_skb(vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + if (skb == NULL) { + /* Read a packet into drop_buffer and don't do + * anything with it. + */ + iov[iovpos].iov_base = drop_buffer; + iov[iovpos].iov_len = DROP_BUFFER_SIZE; + vp->dev->stats.rx_dropped++; + } else { + skb_reserve(skb, vp->headroom); + skb->dev = vp->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + iov[iovpos].iov_base = skb->data; + iov[iovpos].iov_len = vp->max_packet; + } + + pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); + + if (skb != NULL) { + if (pkt_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + vp->header_rxbuffer, skb, vp); + if (header_check < 0) { + dev_kfree_skb_irq(skb); + vp->dev->stats.rx_dropped++; + vp->estats.rx_encaps_errors++; + return 0; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, pkt_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + dev_kfree_skb_irq(skb); + } + } + return pkt_len; +} + +/* + * Packet at a time TX which falls back to vector TX if the + * underlying transport is busy. + */ + +static int writev_tx(struct vector_private *vp, struct sk_buff *skb) +{ + int pkt_len; + + if (vp->header_size == 0) { + pkt_len = os_write_file(vp->fds->tx_fd, skb->data, skb->len); + } else { + struct iovec iov[2]; + + iov[0].iov_base = vp->header_txbuffer; + iov[0].iov_len = vp->header_size; + vp->form_header(iov->iov_base, skb, vp); + iov[1].iov_base = skb->data; + iov[1].iov_len = skb->len; + pkt_len = uml_vector_writev(vp->fds->tx_fd, &iov, 2); + } + netif_trans_update(vp->dev); + netif_wake_queue(vp->dev); + + if (pkt_len > 0) { + vp->dev->stats.tx_bytes += skb->len; + vp->dev->stats.tx_packets++; + } else { + vp->dev->stats.tx_dropped++; + } + consume_skb(skb); + return pkt_len; +} + +/* + * Receive as many messages as we can in one call using the special + * mmsg vector matched to an skb vector which we prepared earlier. + */ + +static int vector_mmsg_rx(struct vector_private *vp) +{ + int packet_count, i; + struct vector_queue *qi = vp->rx_queue; + struct sk_buff *skb; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int header_check; + + /* Refresh the vector and make sure it is with new skbs and the + * iovs are updated to point to them. + */ + + prep_queue_for_rx(qi); + + /* Fire the Lazy Gun - get as many packets as we can in one go. */ + + packet_count = uml_vector_recvmmsg( + vp->fds->rx_fd, qi->mmsg_vector, qi->max_depth, 0); + + if (packet_count <= 0) + return packet_count; + + /* We treat packet processing as enqueue, buffer refresh as dequeue + * The queue_depth tells us how many buffers have been used and how + * many do we need to prep the next time prep_queue_for_rx() is called. + */ + + qi->queue_depth = packet_count; + + for (i = 0; i < packet_count; i++) { + skb = (*skbuff_vector); + if (mmsg_vector->msg_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + mmsg_vector->msg_hdr.msg_iov->iov_base, skb, vp); + if (header_check < 0) { + /* Overlay header failed to verify - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too + * complex. + */ + dev_kfree_skb_irq(skb); + vp->estats.rx_encaps_errors++; + continue; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, + mmsg_vector->msg_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + /* + * We do not need to lock on updating stats here + * The interrupt loop is non-reentrant. + */ + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + /* Overlay header too short to do anything - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too complex. + */ + if (skb != NULL) + dev_kfree_skb_irq(skb); + } + (*skbuff_vector) = NULL; + /* Move to the next buffer element */ + mmsg_vector++; + skbuff_vector++; + } + if (packet_count > 0) { + if (vp->estats.rx_queue_max < packet_count) + vp->estats.rx_queue_max = packet_count; + vp->estats.rx_queue_running_average = + (vp->estats.rx_queue_running_average + packet_count) >> 1; + } + return packet_count; +} + +static void vector_rx(struct vector_private *vp) +{ + int err; + + if ((vp->options & VECTOR_RX) > 0) + while ((err = vector_mmsg_rx(vp)) > 0) + ; + else + while ((err = vector_legacy_rx(vp)) > 0) + ; + if (err != 0) + printk(KERN_ERR "vector_rx: error(%d)\n", err); +} + +static int vector_net_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + int queue_depth = 0; + + if ((vp->options & VECTOR_TX) == 0) { + writev_tx(vp, skb); + return NETDEV_TX_OK; + } + + /* We do BQL only in the vector path, no point doing it in + * packet at a time mode as there is no device queue + */ + + netdev_sent_queue(vp->dev, skb->len); + queue_depth = vector_enqueue(vp->tx_queue, skb); + + /* if the device queue is full, stop the upper layers and + * flush it. + */ + + if (queue_depth >= vp->tx_queue->max_depth - 1) { + vp->estats.tx_kicks++; + netif_stop_queue(dev); + vector_send(vp->tx_queue); + return NETDEV_TX_OK; + } + if (skb->xmit_more) { + mod_timer(&vp->tl, vp->coalesce); + return NETDEV_TX_OK; + } + if (skb->len < TX_SMALL_PACKET) { + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); + } else + tasklet_schedule(&vp->tx_poll); + return NETDEV_TX_OK; +} + +static irqreturn_t vector_rx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + vector_rx(vp); + return IRQ_HANDLED; + +} + +static irqreturn_t vector_tx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + /* We need to pay attention to it only if we got + * -EAGAIN or -ENOBUFFS from sendmms. Otherwise + * we ignore it. In the future, it may be worth + * it to improve the IRQ controller a bit to make + * tweaking the IRQ mask less costly + */ + + if (vp->in_write_poll) { + tasklet_schedule(&vp->tx_poll); + } + return IRQ_HANDLED; + +} + +static int irq_rr; + +static int vector_net_close(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + + netif_stop_queue(dev); + del_timer(&vp->tl); + + if (vp->fds == NULL) + return 0; + + /* Disable and free all IRQS */ + if (vp->rx_irq > 0) { + um_free_irq(vp->rx_irq, dev); + vp->rx_irq = 0; + } + if (vp->tx_irq > 0) { + um_free_irq(vp->tx_irq, dev); + vp->tx_irq = 0; + } + tasklet_kill(&vp->tx_poll); + if (vp->fds->rx_fd > 0) { + os_close_file(vp->fds->rx_fd); + vp->fds->rx_fd = -1; + } + if (vp->fds->tx_fd > 0) { + os_close_file(vp->fds->tx_fd); + vp->fds->tx_fd = -1; + } + if (vp->bpf != NULL) + kfree(vp->bpf); + if (vp->fds->remote_addr != NULL) + kfree(vp->fds->remote_addr); + if (vp->transport_data != NULL) + kfree(vp->transport_data); + if (vp->header_rxbuffer != NULL) + kfree(vp->header_rxbuffer); + if (vp->header_txbuffer != NULL) + kfree(vp->header_txbuffer); + if (vp->rx_queue != NULL) + destroy_queue(vp->rx_queue); + if (vp->tx_queue != NULL) + destroy_queue(vp->tx_queue); + kfree(vp->fds); + vp->fds = NULL; + spin_lock_irqsave(&vp->lock, flags); + vp->opened = false; + spin_unlock_irqrestore(&vp->lock, flags); + return 0; +} + +/* TX tasklet */ + +static void vector_tx_poll(unsigned long data) +{ + struct vector_private *vp = (struct vector_private *)data; + + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} +static void vector_reset_tx(struct work_struct *work) +{ + struct vector_private *vp = + container_of(work, struct vector_private, reset_tx); + netdev_reset_queue(vp->dev); + netif_start_queue(vp->dev); + netif_wake_queue(vp->dev); +} +static int vector_net_open(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + int err = -EINVAL; + struct vector_device *vdevice; + + spin_lock_irqsave(&vp->lock, flags); + if (vp->opened) + return -ENXIO; + vp->opened = true; + spin_unlock_irqrestore(&vp->lock, flags); + + vp->fds = uml_vector_user_open(vp->unit, vp->parsed); + + if (vp->fds == NULL) + goto out_close; + + if (build_transport_data(vp) < 0) + goto out_close; + + if ((vp->options & VECTOR_RX) > 0) { + vp->rx_queue = create_queue( + vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp->rx_queue->queue_depth = get_depth(vp->parsed); + } else { + vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); + if (vp->header_rxbuffer == NULL) + goto out_close; + } + if ((vp->options & VECTOR_TX) > 0) { + vp->tx_queue = create_queue( + vp, get_depth(vp->parsed), vp->header_size, MAX_IOV_SIZE); + } else { + vp->header_txbuffer = kmalloc(vp->header_size, GFP_KERNEL); + if (vp->header_txbuffer == NULL) + goto out_close; + } + + /* READ IRQ */ + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->rx_fd, + IRQ_READ, vector_rx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR "vector_open: failed to get rx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->rx_irq = irq_rr + VECTOR_BASE_IRQ; + dev->irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + + /* WRITE IRQ - we need it only if we have vector TX */ + if ((vp->options & VECTOR_TX) > 0) { + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->tx_fd, + IRQ_WRITE, vector_tx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR + "vector_open: failed to get tx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->tx_irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + } + + if ((vp->options & VECTOR_BPF) != 0) { + vp->bpf = uml_vector_default_bpf(vp->fds->rx_fd, dev->dev_addr); + } + + + /* Write Timeout Timer */ + + vp->tl.data = (unsigned long) vp; + netif_start_queue(dev); + + /* clear buffer - it can happen that the host side of the interface + * is full when we get here. In this case, new data is never queued, + * SIGIOs never arrive, and the net never works. + */ + + vector_rx(vp); + + vector_reset_stats(vp); + vdevice = find_device(vp->unit); + vdevice->opened = 1; + + if ((vp->options & VECTOR_TX) != 0) { + add_timer(&vp->tl); + } + return 0; +out_close: + vector_net_close(dev); + return err; +} + + +static void vector_net_set_multicast_list(struct net_device *dev) +{ + /* TODO: - we can do some BPF games here */ + return; +} + +static void vector_net_tx_timeout(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + vp->estats.tx_timeout_count++; + netif_trans_update(dev); + schedule_work(&vp->reset_tx); +} + +#ifdef CONFIG_NET_POLL_CONTROLLER +static void vector_net_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + uml_net_interrupt(dev->irq, dev); + enable_irq(dev->irq); +} +#endif + +static void vector_net_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *info) +{ + strlcpy(info->driver, DRIVER_NAME, sizeof(info->driver)); + strlcpy(info->version, DRIVER_VERSION, sizeof(info->version)); +} + +static void vector_get_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct vector_private *vp = netdev_priv(netdev); + + ring->rx_max_pending = vp->rx_queue->max_depth; + ring->tx_max_pending = vp->tx_queue->max_depth; + ring->rx_pending = vp->rx_queue->max_depth; + ring->tx_pending = vp->tx_queue->max_depth; +} + +static void vector_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch (stringset) { + case ETH_SS_TEST: + *buf = '\0'; + break; + case ETH_SS_STATS: + memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + default: + WARN_ON(1); + break; + } +} + +static int vector_get_sset_count(struct net_device *dev, int sset) +{ + switch (sset) { + case ETH_SS_TEST: + return 0; + case ETH_SS_STATS: + return VECTOR_NUM_STATS; + default: + return -EOPNOTSUPP; + } +} + +static void vector_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *estats, u64 *tmp_stats) +{ + struct vector_private *vp = netdev_priv(dev); + + memcpy(tmp_stats, &vp->estats, sizeof(struct vector_estats)); +} + +static int vector_get_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + ec->tx_coalesce_usecs = (vp->coalesce * 1000000) / HZ; + return 0; +} + +static int vector_set_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + vp->coalesce = (ec->tx_coalesce_usecs * HZ) / 1000000; + if (vp->coalesce == 0) + vp->coalesce = 1; + return 0; +} + +static const struct ethtool_ops vector_net_ethtool_ops = { + .get_drvinfo = vector_net_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_ts_info = ethtool_op_get_ts_info, + .get_ringparam = vector_get_ringparam, + .get_strings = vector_get_strings, + .get_sset_count = vector_get_sset_count, + .get_ethtool_stats = vector_get_ethtool_stats, + .get_coalesce = vector_get_coalesce, + .set_coalesce = vector_set_coalesce, +}; + + +static const struct net_device_ops vector_netdev_ops = { + .ndo_open = vector_net_open, + .ndo_stop = vector_net_close, + .ndo_start_xmit = vector_net_start_xmit, + .ndo_set_rx_mode = vector_net_set_multicast_list, + .ndo_tx_timeout = vector_net_tx_timeout, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = vector_net_poll_controller, +#endif +}; + + +static void vector_timer_expire(unsigned long _conn) +{ + struct vector_private *vp = (struct vector_private *)_conn; + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} + +static void vector_eth_configure( + int n, + struct arglist *def + ) +{ + struct vector_device *device; + struct net_device *dev; + struct vector_private *vp; + int err; + + printk(KERN_INFO "Vector Netdevice %d\n", n); + + device = kzalloc(sizeof(*device), GFP_KERNEL); + if (device == NULL) { + printk(KERN_ERR "eth_configure failed to allocate struct " + "vector_device\n"); + return; + } + printk(KERN_INFO "vector etherdevice start %d\n", n); + dev = alloc_etherdev(sizeof(struct vector_private)); + if (dev == NULL) { + printk(KERN_ERR "eth_configure: failed to allocate struct " + "net_device for vec%d\n", n); + goto out_free_device; + } + + dev->mtu = get_mtu(def); + + INIT_LIST_HEAD(&device->list); + device->unit = n; + + /* If this name ends up conflicting with an existing registered + * netdevice, that is OK, register_netdev{,ice}() will notice this + * and fail. + */ + snprintf(dev->name, sizeof(dev->name), "vec%d", n); + uml_net_setup_etheraddr(dev, uml_vector_fetch_arg(def, "mac")); + vp = netdev_priv(dev); + + /* sysfs register */ + if (!driver_registered) { + platform_driver_register(¨_net_driver); + driver_registered = 1; + } + device->pdev.id = n; + device->pdev.name = DRIVER_NAME; + device->pdev.dev.release = vector_device_release; + dev_set_drvdata(&device->pdev.dev, device); + if (platform_device_register(&device->pdev)) + goto out_free_netdev; + SET_NETDEV_DEV(dev, &device->pdev.dev); + + device->dev = dev; + + *vp = ((struct vector_private) + { + .list = LIST_HEAD_INIT(vp->list), + .dev = dev, + .unit = n, + .options = get_transport_options(def), + .rx_irq = 0, + .tx_irq = 0, + .parsed = def, + .max_packet = get_mtu(def) + ETH_HEADER_OTHER, + /* TODO - we need to calculate headroom so that ip header + * is 16 byte aligned all the time + */ + .headroom = get_headroom(def), + .form_header = NULL, + .verify_header = NULL, + .header_rxbuffer = NULL, + .header_txbuffer = NULL, + .header_size = 0, + .rx_header_size = 0, + .rexmit_scheduled = false, + .opened = false, + .transport_data = NULL, + .in_write_poll = false, + .coalesce = 2 + }); + + /* if we can do vector TX, we can do scatter/gather too */ + if ((vp->options & VECTOR_TX) > 0) + dev->features = NETIF_F_SG; + tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); + INIT_WORK(&vp->reset_tx, vector_reset_tx); + + init_timer(&vp->tl); + spin_lock_init(&vp->lock); + vp->tl.function = vector_timer_expire; + + /* FIXME */ + dev->netdev_ops = &vector_netdev_ops; + dev->ethtool_ops = &vector_net_ethtool_ops; + dev->watchdog_timeo = (HZ >> 1); + /* primary IRQ - fixme */ + dev->irq = 0; /* we will adjust this once opened */ + + rtnl_lock(); + err = register_netdevice(dev); + rtnl_unlock(); + if (err) + goto out_undo_user_init; + + spin_lock(&vector_devices_lock); + list_add(&device->list, &vector_devices); + spin_unlock(&vector_devices_lock); + + return; + +out_undo_user_init: + return; +out_free_netdev: + free_netdev(dev); +out_free_device: + kfree(device); +} + + + + +/* + * Invoked late in the init + */ + +static int __init vector_init(void) +{ + struct list_head *ele; + struct vector_cmd_line_arg *def; + struct arglist *parsed; + + list_for_each(ele, &vec_cmd_line) { + def = list_entry(ele, struct vector_cmd_line_arg, list); + parsed = uml_parse_vector_ifspec(def->arguments); + if (parsed != NULL) + vector_eth_configure(def->unit, parsed); + } + return 0; +} + + +/* Invoked at initial argument parsing, only stores + * arguments until a proper vector_init is called + * later + */ + +static int __init vector_setup(char *str) +{ + char *error; + int n, err; + struct vector_cmd_line_arg *new; + + err = vector_parse(str, &n, &str, &error); + if (err) { + printk(KERN_ERR "vector_setup - Couldn't parse '%s' : %s\n", + str, error); + return 1; + } + new = alloc_bootmem(sizeof(*new)); + INIT_LIST_HEAD(&new->list); + new->unit = n; + new->arguments = str; + list_add_tail(&new->list, &vec_cmd_line); + return 1; +} + +__setup("vec", vector_setup); +__uml_help(vector_setup, +"vec[0-9]+:<option>=<value>,<option>=<value>\n" +" Configure a vector io network device.\n\n" +); + +late_initcall(vector_init); + +static struct mc_device vector_mc = { + .list = LIST_HEAD_INIT(vector_mc.list), + .name = "vec", + .config = vector_config, + .get_config = NULL, + .id = vector_id, + .remove = vector_remove, +}; + +#ifdef CONFIG_INET +static int vector_inetaddr_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + return NOTIFY_DONE; +} + +static struct notifier_block vector_inetaddr_notifier = { + .notifier_call = vector_inetaddr_event, +}; + +static void inet_register(void) +{ + register_inetaddr_notifier(&vector_inetaddr_notifier); +} +#else +static inline void inet_register(void) +{ +} +#endif + +static int vector_net_init(void) +{ + mconsole_register_dev(&vector_mc); + inet_register(); + return 0; +} + +__initcall(vector_net_init); + + + diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h new file mode 100644 index 000000000000..fd724ef6806e --- /dev/null +++ b/arch/um/drivers/vector_kern.h @@ -0,0 +1,128 @@ +/* + * Copyright (C) 2002 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#ifndef __UM_VECTOR_KERN_H +#define __UM_VECTOR_KERN_H + +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/skbuff.h> +#include <linux/socket.h> +#include <linux/list.h> +#include <linux/ctype.h> +#include <linux/workqueue.h> +#include <linux/interrupt.h> +#include "vector_user.h" + +/* Queue structure specially adapted for multiple enqueue/dequeue + * in a mmsgrecv/mmsgsend context + */ + +/* Dequeue method */ + +#define QUEUE_SENDMSG 0 +#define QUEUE_SENDMMSG 1 + +#define VECTOR_RX 1 +#define VECTOR_TX (1 << 1) +#define VECTOR_BPF (1 << 2) + +#define ETH_MAX_PACKET 1500 +#define ETH_HEADER_OTHER 32 /* just in case someone decides to go mad on QnQ */ + +struct vector_queue { + struct mmsghdr *mmsg_vector; + void **skbuff_vector; + /* backlink to device which owns us */ + struct net_device *dev; + spinlock_t head_lock; + spinlock_t tail_lock; + int queue_depth, head, tail, max_depth, max_iov_frags; + short options; +}; + +struct vector_estats { + uint64_t rx_queue_max; + uint64_t rx_queue_running_average; + uint64_t tx_queue_max; + uint64_t tx_queue_running_average; + uint64_t rx_encaps_errors; + uint64_t tx_timeout_count; + uint64_t tx_restart_queue; + uint64_t tx_kicks; + uint64_t tx_flow_control_xon; + uint64_t tx_flow_control_xoff; + uint64_t rx_csum_offload_good; + uint64_t rx_csum_offload_errors; + uint64_t sg_ok; + uint64_t sg_linearized; +}; + +#define VERIFY_HEADER_NOK -1 +#define VERIFY_HEADER_OK 0 +#define VERIFY_CSUM_OK 1 + +struct vector_private { + struct list_head list; + spinlock_t lock; + struct net_device *dev; + + int unit; + + /* Timeout timer in TX */ + + struct timer_list tl; + + /* Scheduled "remove device" work */ + struct work_struct reset_tx; + struct vector_fds *fds; + + struct vector_queue *rx_queue; + struct vector_queue *tx_queue; + + int rx_irq; + int tx_irq; + + struct arglist *parsed; + + void *transport_data; /* transport specific params if needed */ + + int max_packet; + int headroom; + + int options; + + /* remote address if any - some transports will leave this as null */ + + int header_size; + int rx_header_size; + int coalesce; + + void *header_rxbuffer; + void *header_txbuffer; + + void (*form_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + int (*verify_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + + spinlock_t stats_lock; + + struct tasklet_struct tx_poll; + bool rexmit_scheduled; + bool opened; + bool in_write_poll; + + /* ethtool stats */ + + struct vector_estats estats; + void *bpf; + + char user[0]; +}; + +extern int build_transport_data(struct vector_private *vp); + +#endif diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c new file mode 100644 index 000000000000..d33830d1f74d --- /dev/null +++ b/arch/um/drivers/vector_transports.c @@ -0,0 +1,407 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Licensed under the GPL. + */ + +#include <linux/etherdevice.h> +#include <linux/netdevice.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <asm/byteorder.h> +#include <uapi/linux/ip.h> +#include <uapi/linux/virtio_net.h> +#include <linux/virtio_net.h> +#include <linux/virtio_byteorder.h> +#include <linux/netdev_features.h> +#include "vector_user.h" +#include "vector_kern.h" + + +struct gre_minimal_header { + uint16_t header; + uint16_t arptype; +}; + + +struct uml_gre_data { + uint32_t rx_key; + uint32_t tx_key; + uint32_t sequence; + + bool ipv6; + bool has_sequence; + bool pin_sequence; + bool checksum; + bool key; + struct gre_minimal_header expected_header; + + uint32_t checksum_offset; + uint32_t key_offset; + uint32_t sequence_offset; + +}; + +struct uml_l2tpv3_data { + uint64_t rx_cookie; + uint64_t tx_cookie; + uint64_t rx_session; + uint64_t tx_session; + uint32_t counter; + + bool udp; + bool ipv6; + bool has_counter; + bool pin_counter; + bool cookie; + bool cookie_is_64; + + uint32_t cookie_offset; + uint32_t session_offset; + uint32_t counter_offset; +}; + +static void l2tpv3_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *counter; + + if (td->udp) + *(uint32_t *) header = cpu_to_be32(L2TPV3_DATA_PACKET); + (*(uint32_t *) (header + td->session_offset)) = td->tx_session; + + if (td->cookie) { + if (td->cookie_is_64) + (*(uint64_t *)(header + td->cookie_offset)) = + td->tx_cookie; + else + (*(uint32_t *)(header + td->cookie_offset)) = + td->tx_cookie; + } + if (td->has_counter) { + counter = (uint32_t *)(header + td->counter_offset); + if (td->pin_counter) { + *counter = 0; + } else { + td->counter++; + *counter = cpu_to_be32(td->counter); + } + } +} + +static void gre_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_gre_data *td = vp->transport_data; + uint32_t *sequence; + *((uint32_t *) header) = *((uint32_t *) &td->expected_header); + if (td->key) + (*(uint32_t *) (header + td->key_offset)) = td->tx_key; + if (td->has_sequence) { + sequence = (uint32_t *)(header + td->sequence_offset); + if (td->pin_sequence) + *sequence = 0; + else + *sequence = cpu_to_be32(++td->sequence); + } +} + +static void raw_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + virtio_net_hdr_from_skb(skb, vheader, virtio_legacy_is_little_endian(), false); +} + +static int l2tpv3_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *session; + uint64_t cookie; + + if ((!td->udp) && (!td->ipv6)) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + /* we do not do a strict check for "data" packets as per + * the RFC spec because the pure IP spec does not have + * that anyway. + */ + + if (td->cookie) { + if (td->cookie_is_64) + cookie = *(uint64_t *)(header + td->cookie_offset); + else + cookie = *(uint32_t *)(header + td->cookie_offset); + if (cookie != td->rx_cookie) { + printk(KERN_ERR "uml_l2tpv3: unknown cookie id"); + return -1; + } + } + session = (uint32_t *) (header + td->session_offset); + if (*session != td->rx_session) { + printk(KERN_ERR "uml_l2tpv3: session mismatch"); + return -1; + } + return 0; +} + +static int gre_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + + uint32_t key; + struct uml_gre_data *td = vp->transport_data; + + if (!td->ipv6) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + if (*((uint32_t *) header) != *((uint32_t *) &td->expected_header)) { + printk(KERN_ERR "header type disagreement, expecting %0x, got %0x", + *((uint32_t *) &td->expected_header), + *((uint32_t *) header) + ); + return -1; + } + + if (td->key) { + key = (*(uint32_t *)(header + td->key_offset)); + if (key != td->rx_key) { + printk(KERN_ERR "unknown key id %0x, expecting %0x", + key, td->rx_key); + return -1; + } + } + return 0; +} + +static int raw_verify_header ( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { + printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); + return -1; /* GSO, we cannot process this */ + } + if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) + return 1; + else + virtio_net_hdr_to_skb(skb, vheader, virtio_legacy_is_little_endian()); + return 0; +} + +static bool get_uint_param( + struct arglist *def, char *param, unsigned int *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoint(arg, 0, result) == 0) + return true; + } + return false; +} + +static bool get_ulong_param( + struct arglist *def, char *param, unsigned long *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoul(arg, 0, result) == 0) + return true; + return true; + } + return false; +} + +static int build_gre_transport_data(struct vector_private *vp) +{ + struct uml_gre_data *td; + int temp_int; + int temp_rx; + int temp_tx; + + vp->transport_data = kmalloc(sizeof(struct uml_gre_data), GFP_KERNEL); + if (vp->transport_data == NULL) + return -ENOMEM; + td = vp->transport_data; + td->sequence = 0; + + td->expected_header.arptype = GRE_IRB; + td->expected_header.header = 0; + + vp->form_header = &gre_form_header; + vp->verify_header = &gre_verify_header; + vp->header_size = 4; + td->key_offset = 4; + td->sequence_offset = 4; + td->checksum_offset = 4; + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + td->key = false; + if (get_uint_param(vp->parsed, "rx_key", &temp_rx)) { + if (get_uint_param(vp->parsed, "tx_key", &temp_tx)) { + td->key = true; + td->expected_header.header |= GRE_MODE_KEY; + td->rx_key = cpu_to_be32(temp_rx); + td->tx_key = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->sequence_offset += 4; + } else { + return -EINVAL; + } + } + + td->sequence = false; + if (get_uint_param(vp->parsed, "sequence", &temp_int)) { + if (temp_int > 0) { + vp->header_size += 4; + td->has_sequence = true; + td->expected_header.header |= GRE_MODE_SEQUENCE; + if (get_uint_param( + vp->parsed, "pin_sequence", &temp_int)) { + if (temp_int > 0) + td->pin_sequence = true; + } + } + } + vp->rx_header_size = vp->header_size; + if (!td->ipv6) + vp->rx_header_size += sizeof(struct iphdr); + return 0; +} + +static int build_l2tpv3_transport_data(struct vector_private *vp) +{ + + struct uml_l2tpv3_data *td; + int temp_int, temp_rxs, temp_txs; + unsigned long temp_rx; + unsigned long temp_tx; + + vp->transport_data = kmalloc( + sizeof(struct uml_l2tpv3_data), GFP_KERNEL); + + if (vp->transport_data == NULL) + return -ENOMEM; + + td = vp->transport_data; + + vp->form_header = &l2tpv3_form_header; + vp->verify_header = &l2tpv3_verify_header; + td->counter = 0; + + vp->header_size = 4; + td->session_offset = 0; + td->cookie_offset = 4; + td->counter_offset = 4; + + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + + if (get_uint_param(vp->parsed, "rx_session", &temp_rxs)) { + if (get_uint_param(vp->parsed, "tx_session", &temp_txs)) { + td->tx_session = cpu_to_be32(temp_txs); + td->rx_session = cpu_to_be32(temp_rxs); + } else { + return -EINVAL; + } + } else { + return -EINVAL; + } + + td->cookie_is_64 = false; + if (get_uint_param(vp->parsed, "cookie64", &temp_int)) { + if (temp_int > 0) + td->cookie_is_64 = true; + } + td->cookie = false; + if (get_ulong_param(vp->parsed, "rx_cookie", &temp_rx)) { + if (get_ulong_param(vp->parsed, "tx_cookie", &temp_tx)) { + td->cookie = true; + if (td->cookie_is_64) { + td->rx_cookie = cpu_to_be64(temp_rx); + td->tx_cookie = cpu_to_be64(temp_tx); + vp->header_size += 8; + td->counter_offset += 8; + } else { + td->rx_cookie = cpu_to_be32(temp_rx); + td->tx_cookie = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->counter_offset += 4; + } + } else { + return -EINVAL; + } + } + + td->has_counter = false; + if (get_uint_param(vp->parsed, "counter", &temp_int)) { + if (temp_int > 0) { + td->has_counter = true; + vp->header_size += 4; + if (get_uint_param( + vp->parsed, "pin_counter", &temp_int)) { + if (temp_int > 0) + td->pin_counter = true; + } + } + } + + if (get_uint_param(vp->parsed, "udp", &temp_int)) { + if (temp_int > 0) { + td->udp = true; + vp->header_size += 4; + td->counter_offset += 4; + td->session_offset += 4; + td->cookie_offset += 4; + } + } + + vp->rx_header_size = vp->header_size; + if ((!td->ipv6) && (!td->udp)) + vp->rx_header_size += sizeof(struct iphdr); + + return 0; +} + +static int build_raw_transport_data(struct vector_private *vp) +{ + int temp_int; + if (uml_vector_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= NETIF_F_HW_CSUM; + printk(KERN_INFO "raw: using vnet headers to offload checksum"); + } + return 0; +} + + +int build_transport_data(struct vector_private *vp) +{ + char *transport = uml_vector_fetch_arg(vp->parsed, "transport"); + + if (strncmp(transport, TRANS_GRE, TRANS_GRE_LEN) == 0) + return build_gre_transport_data(vp); + if (strncmp(transport, TRANS_L2TPV3, TRANS_L2TPV3_LEN) == 0) + return build_l2tpv3_transport_data(vp); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return build_raw_transport_data(vp); + return 0; +} + diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c new file mode 100644 index 000000000000..6d51289cb9cb --- /dev/null +++ b/arch/um/drivers/vector_user.c @@ -0,0 +1,512 @@ +/* + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#include <stdio.h> +#include <unistd.h> +#include <stdarg.h> +#include <errno.h> +#include <stddef.h> +#include <string.h> +#include <sys/ioctl.h> +#include <net/if.h> +#include <linux/if_tun.h> +#include <arpa/inet.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <net/ethernet.h> +#include <netinet/ip.h> +#include <netinet/ether.h> +#include <linux/if_ether.h> +#include <linux/if_packet.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <netdb.h> +#include <stdlib.h> +#include <os.h> +#include <um_malloc.h> +#include "vector_user.h" + +#define ID_GRE 0 +#define ID_L2TPV3 1 +#define ID_MAX 1 + +#define TOKEN_IFNAME "ifname" + +#define TRANS_RAW "raw" +#define TRANS_RAW_LEN strlen(TRANS_RAW) + +/* This is very ugly and brute force lookup, but it is done + * only once at initialization so not worth doing hashes or + * anything more intelligent + */ + +char *uml_vector_fetch_arg(struct arglist *ifspec, char *token) +{ + int i; + + for (i = 0; i < ifspec->numargs; i++) { + if (strcmp(ifspec->tokens[i], token) == 0) + return ifspec->values[i]; + } + return NULL; + +} + +struct arglist *uml_parse_vector_ifspec(char *arg) +{ + struct arglist *result; + int pos, len; + bool parsing_token = true, next_starts = true; + + if (arg == NULL) + return NULL; + result = uml_kmalloc(sizeof(struct arglist), UM_GFP_KERNEL); + if (result == NULL) + return NULL; + result->numargs = 0; + len = strlen(arg); + for (pos = 0; pos < len; pos++) { + if (next_starts) { + if (parsing_token) { + result->tokens[result->numargs] = arg + pos; + } else { + result->values[result->numargs] = arg + pos; + result->numargs++; + } + next_starts = false; + } + if (*(arg + pos) == '=') { + if (parsing_token) + parsing_token = false; + else + goto cleanup; + next_starts = true; + (*(arg + pos)) = '\0'; + } + if (*(arg + pos) == ',') { + parsing_token = true; + next_starts = true; + (*(arg + pos)) = '\0'; + } + } + return result; +cleanup: + printk(UM_KERN_ERR "vector_setup - Couldn't parse '%s'\n", arg); + kfree(result); + return NULL; +} + +/* + * Socket/FD configuration functions. These return an structure + * of rx and tx descriptors to cover cases where these are not + * the same (f.e. read via raw socket and write via tap). + */ + +#define PATH_NET_TUN "/dev/net/tun" + +static struct vector_fds *user_init_tap_fds(struct arglist *ifspec) +{ + struct ifreq ifr; + int fd = -1; + struct sockaddr_ll sock; + int err = -ENOMEM; + char *iface; + struct vector_fds *result = NULL; + + iface = uml_vector_fetch_arg(ifspec, TOKEN_IFNAME); + if (iface == NULL) { + printk(UM_KERN_ERR "uml_tap: failed to parse interface spec\n"); + goto tap_cleanup; + } + + result = uml_kmalloc(sizeof(struct vector_fds), UM_GFP_KERNEL); + if (result == NULL) { + printk(UM_KERN_ERR "uml_tap: failed to allocate file descriptors\n"); + goto tap_cleanup; + } + result->rx_fd = -1; + result->tx_fd = -1; + result->remote_addr = NULL; + result->remote_addr_size = 0; + + /* TAP */ + + fd = open(PATH_NET_TUN, O_RDWR); + if (fd < 0) { + printk(UM_KERN_ERR "uml_tap: failed to open tun device\n"); + goto tap_cleanup; + } + result->tx_fd = fd; + memset(&ifr, 0, sizeof(ifr)); + ifr.ifr_flags = IFF_TAP | IFF_NO_PI; + strncpy((char *)&ifr.ifr_name, iface, sizeof(ifr.... [truncated message content] |
From: <ant...@ko...> - 2017-09-26 10:27:16
|
From: Anton Ivanov <ant...@ca...> 1. Removes the need to walk the IRQ/Device list to determine who triggered the IRQ. 2. Improves scalability (up to several times performance improvement for cases with 10s of devices). 3. Improves UML baseline IO performance for one disk + one NIC use case by up to 10%. 4. Introduces write poll triggered IRQs. 5. Prerequisite for introducing high performance mmesg family of functions in network IO. 6. Fixes RNG shutdown which was leaking a file descriptor Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/chan_kern.c | 53 +---- arch/um/drivers/line.c | 2 +- arch/um/drivers/random.c | 11 +- arch/um/drivers/ubd_kern.c | 4 +- arch/um/include/shared/irq_user.h | 12 +- arch/um/include/shared/os.h | 17 +- arch/um/kernel/irq.c | 460 ++++++++++++++++++++++++-------------- arch/um/os-Linux/irq.c | 202 +++++++++-------- 8 files changed, 444 insertions(+), 317 deletions(-) diff --git a/arch/um/drivers/chan_kern.c b/arch/um/drivers/chan_kern.c index acbe6c67afba..05588f9466c7 100644 --- a/arch/um/drivers/chan_kern.c +++ b/arch/um/drivers/chan_kern.c @@ -171,56 +171,19 @@ int enable_chan(struct line *line) return err; } -/* Items are added in IRQ context, when free_irq can't be called, and - * removed in process context, when it can. - * This handles interrupt sources which disappear, and which need to - * be permanently disabled. This is discovered in IRQ context, but - * the freeing of the IRQ must be done later. - */ -static DEFINE_SPINLOCK(irqs_to_free_lock); -static LIST_HEAD(irqs_to_free); - -void free_irqs(void) -{ - struct chan *chan; - LIST_HEAD(list); - struct list_head *ele; - unsigned long flags; - - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_splice_init(&irqs_to_free, &list); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - - list_for_each(ele, &list) { - chan = list_entry(ele, struct chan, free_list); - - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } -} - static void close_one_chan(struct chan *chan, int delay_free_irq) { - unsigned long flags; - if (!chan->opened) return; - if (delay_free_irq) { - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_add(&chan->free_list, &irqs_to_free); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - } - else { - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } + /* we can safely call free now - it will be marked + * as free and freed once the IRQ stopped processing + */ + if (chan->input && chan->enabled) + um_free_irq(chan->line->driver->read_irq, chan); + if (chan->output && chan->enabled) + um_free_irq(chan->line->driver->write_irq, chan); + chan->enabled = 0; if (chan->ops->close != NULL) (*chan->ops->close)(chan->fd, chan->data); diff --git a/arch/um/drivers/line.c b/arch/um/drivers/line.c index 366e57f5e8d6..8d80b27502e6 100644 --- a/arch/um/drivers/line.c +++ b/arch/um/drivers/line.c @@ -284,7 +284,7 @@ int line_setup_irq(int fd, int input, int output, struct line *line, void *data) if (err) return err; if (output) - err = um_request_irq(driver->write_irq, fd, IRQ_WRITE, + err = um_request_irq(driver->write_irq, fd, IRQ_NONE, line_write_interrupt, IRQF_SHARED, driver->write_irq_name, data); return err; diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c index 37c51a6be690..778a0e52d5a5 100644 --- a/arch/um/drivers/random.c +++ b/arch/um/drivers/random.c @@ -13,6 +13,7 @@ #include <linux/miscdevice.h> #include <linux/delay.h> #include <linux/uaccess.h> +#include <init.h> #include <irq_kern.h> #include <os.h> @@ -154,7 +155,14 @@ static int __init rng_init (void) /* * rng_cleanup - shutdown RNG module */ -static void __exit rng_cleanup (void) + +static void cleanup(void) +{ + free_irq_by_fd(random_fd); + os_close_file(random_fd); +} + +static void __exit rng_cleanup(void) { os_close_file(random_fd); misc_deregister (&rng_miscdev); @@ -162,6 +170,7 @@ static void __exit rng_cleanup (void) module_init (rng_init); module_exit (rng_cleanup); +__uml_exitcall(cleanup); MODULE_DESCRIPTION("UML Host Random Number Generator (RNG) driver"); MODULE_LICENSE("GPL"); diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index b55fe9bf5d3e..d4e8c497ae86 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1587,11 +1587,11 @@ int io_thread(void *arg) do { res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n); - if (res > 0) { + if (res >= 0) { written += res; } else { if (res != -EAGAIN) { - printk("io_thread - read failed, fd = %d, " + printk("io_thread - write failed, fd = %d, " "err = %d\n", kernel_fd, -n); } } diff --git a/arch/um/include/shared/irq_user.h b/arch/um/include/shared/irq_user.h index df5633053957..a7a6120f19d5 100644 --- a/arch/um/include/shared/irq_user.h +++ b/arch/um/include/shared/irq_user.h @@ -7,6 +7,7 @@ #define __IRQ_USER_H__ #include <sysdep/ptrace.h> +#include <stdbool.h> struct irq_fd { struct irq_fd *next; @@ -15,10 +16,17 @@ struct irq_fd { int type; int irq; int events; - int current_events; + bool active; + bool pending; + bool purge; }; -enum { IRQ_READ, IRQ_WRITE }; +#define IRQ_READ 0 +#define IRQ_WRITE 1 +#define IRQ_NONE 2 +#define MAX_IRQ_TYPE (IRQ_NONE + 1) + + struct siginfo; extern void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 574e03fc7ba2..edbe6563d5c5 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -290,15 +290,16 @@ extern void halt_skas(void); extern void reboot_skas(void); /* irq.c */ -extern int os_waiting_for_events(struct irq_fd *active_fds); -extern int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds); -extern void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2); -extern void os_free_irq_later(struct irq_fd *active_fds, - int irq, void *dev_id); -extern int os_get_pollfd(int i); -extern void os_set_pollfd(int i, int fd); +extern int os_waiting_for_events_epoll(void); +extern void *os_epoll_get_data_pointer(int index); +extern int os_epoll_triggered(int index, int events); +extern int os_event_mask(int irq_type); +extern int os_setup_epoll(void); +extern int os_add_epoll_fd(int events, int fd, void *data); +extern int os_mod_epoll_fd(int events, int fd, void *data); +extern int os_del_epoll_fd(int fd); extern void os_set_ioignore(void); +extern void os_close_epoll_fd(void); /* sigio.c */ extern int add_sigio_fd(int fd); diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 23cb9350d47e..980148d56537 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -1,4 +1,6 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL * Derived (i.e. mostly copied) from arch/i386/kernel/irq.c: @@ -16,243 +18,361 @@ #include <as-layout.h> #include <kern_util.h> #include <os.h> +#include <irq_user.h> -/* - * This list is accessed under irq_lock, except in sigio_handler, - * where it is safe from being modified. IRQ handlers won't change it - - * if an IRQ source has vanished, it will be freed by free_irqs just - * before returning from sigio_handler. That will process a separate - * list of irqs to free, with its own locking, coming back here to - * remove list elements, taking the irq_lock to do so. + +/* When epoll triggers we do not know why it did so + * we can also have different IRQs for read and write. + * This is why we keep a small irq_fd array for each fd - + * one entry per IRQ type */ -static struct irq_fd *active_fds = NULL; -static struct irq_fd **last_irq_ptr = &active_fds; -extern void free_irqs(void); +struct irq_entry { + struct irq_entry *next; + int fd; + struct irq_fd *irq_array[MAX_IRQ_TYPE + 1]; +}; + +static struct irq_entry *active_fds; + +static DEFINE_SPINLOCK(irq_lock); + +static void irq_io_loop(struct irq_fd *irq, struct uml_pt_regs *regs) +{ +/* + * irq->active guards against reentry + * irq->pending accumulates pending requests + * if pending is raised the irq_handler is re-run + * until pending is cleared + */ + if (irq->active) { + irq->active = false; + do { + irq->pending = false; + do_IRQ(irq->irq, regs); + } while (irq->pending && (!irq->purge)); + if (!irq->purge) + irq->active = true; + } else { + irq->pending = true; + } +} void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs) { - struct irq_fd *irq_fd; - int n; + struct irq_entry *irq_entry; + struct irq_fd *irq; + + int n, i, j; while (1) { - n = os_waiting_for_events(active_fds); + /* This is now lockless - epoll keeps back-referencesto the irqs + * which have trigger it so there is no need to walk the irq + * list and lock it every time. We avoid locking by turning off + * IO for a specific fd by executing os_del_epoll_fd(fd) before + * we do any changes to the actual data structures + */ + n = os_waiting_for_events_epoll(); + if (n <= 0) { if (n == -EINTR) continue; - else break; + else + break; } - for (irq_fd = active_fds; irq_fd != NULL; - irq_fd = irq_fd->next) { - if (irq_fd->current_events != 0) { - irq_fd->current_events = 0; - do_IRQ(irq_fd->irq, regs); + for (i = 0; i < n ; i++) { + /* Epoll back reference is the entry with 3 irq_fd + * leaves - one for each irq type. + */ + irq_entry = (struct irq_entry *) + os_epoll_get_data_pointer(i); + for (j = 0; j < MAX_IRQ_TYPE ; j++) { + irq = irq_entry->irq_array[j]; + if (irq == NULL) + continue; + if (os_epoll_triggered(i, irq->events) > 0) + irq_io_loop(irq, regs); + if (irq->purge) { + irq_entry->irq_array[j] = NULL; + kfree(irq); + } } } } +} + +static int assign_epoll_events_to_irq(struct irq_entry *irq_entry) +{ + int i; + int events = 0; + struct irq_fd *irq; - free_irqs(); + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + irq = irq_entry->irq_array[i]; + if (irq != NULL) + events = irq->events | events; + } + if (events > 0) { + /* os_add_epoll will call os_mod_epoll if this already exists */ + return os_add_epoll_fd(events, irq_entry->fd, irq_entry); + } + /* No events - delete */ + return os_del_epoll_fd(irq_entry->fd); } -static DEFINE_SPINLOCK(irq_lock); + static int activate_fd(int irq, int fd, int type, void *dev_id) { - struct pollfd *tmp_pfd; - struct irq_fd *new_fd, *irq_fd; + struct irq_fd *new_fd; + struct irq_entry *irq_entry; + int i, err, events; unsigned long flags; - int events, err, n; err = os_set_fd_async(fd); if (err < 0) goto out; - err = -ENOMEM; - new_fd = kmalloc(sizeof(struct irq_fd), GFP_KERNEL); - if (new_fd == NULL) - goto out; + spin_lock_irqsave(&irq_lock, flags); - if (type == IRQ_READ) - events = UM_POLLIN | UM_POLLPRI; - else events = UM_POLLOUT; - *new_fd = ((struct irq_fd) { .next = NULL, - .id = dev_id, - .fd = fd, - .type = type, - .irq = irq, - .events = events, - .current_events = 0 } ); + /* Check if we have an entry for this fd */ err = -EBUSY; - spin_lock_irqsave(&irq_lock, flags); - for (irq_fd = active_fds; irq_fd != NULL; irq_fd = irq_fd->next) { - if ((irq_fd->fd == fd) && (irq_fd->type == type)) { - printk(KERN_ERR "Registering fd %d twice\n", fd); - printk(KERN_ERR "Irqs : %d, %d\n", irq_fd->irq, irq); - printk(KERN_ERR "Ids : 0x%p, 0x%p\n", irq_fd->id, - dev_id); + for (irq_entry = active_fds; + irq_entry != NULL; irq_entry = irq_entry->next) { + if (irq_entry->fd == fd) + break; + } + + if (irq_entry == NULL) { + /* This needs to be atomic as it may be called from an + * IRQ context. + */ + irq_entry = kmalloc(sizeof(struct irq_entry), GFP_ATOMIC); + if (irq_entry == NULL) { + printk(KERN_ERR + "Failed to allocate new IRQ entry\n"); goto out_unlock; } + irq_entry->fd = fd; + for (i = 0; i < MAX_IRQ_TYPE; i++) + irq_entry->irq_array[i] = NULL; + irq_entry->next = active_fds; + active_fds = irq_entry; } - if (type == IRQ_WRITE) - fd = -1; - - tmp_pfd = NULL; - n = 0; + /* Check if we are trying to re-register an interrupt for a + * particular fd + */ - while (1) { - n = os_create_pollfd(fd, events, tmp_pfd, n); - if (n == 0) - break; + if (irq_entry->irq_array[type] != NULL) { + printk(KERN_ERR + "Trying to reregister IRQ %d FD %d TYPE %d ID %p\n", + irq, fd, type, dev_id + ); + goto out_unlock; + } else { + /* New entry for this fd */ + + err = -ENOMEM; + new_fd = kmalloc(sizeof(struct irq_fd), GFP_ATOMIC); + if (new_fd == NULL) + goto out_unlock; - /* - * n > 0 - * It means we couldn't put new pollfd to current pollfds - * and tmp_fds is NULL or too small for new pollfds array. - * Needed size is equal to n as minimum. - * - * Here we have to drop the lock in order to call - * kmalloc, which might sleep. - * If something else came in and changed the pollfds array - * so we will not be able to put new pollfd struct to pollfds - * then we free the buffer tmp_fds and try again. + events = os_event_mask(type); + + *new_fd = ((struct irq_fd) { + .id = dev_id, + .irq = irq, + .type = type, + .events = events, + .active = true, + .pending = false, + .purge = false + }); + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop */ - spin_unlock_irqrestore(&irq_lock, flags); - kfree(tmp_pfd); - - tmp_pfd = kmalloc(n, GFP_KERNEL); - if (tmp_pfd == NULL) - goto out_kfree; - - spin_lock_irqsave(&irq_lock, flags); + os_del_epoll_fd(irq_entry->fd); + irq_entry->irq_array[type] = new_fd; } - *last_irq_ptr = new_fd; - last_irq_ptr = &new_fd->next; - + /* Turn back IO on with the correct (new) IO event mask */ + assign_epoll_events_to_irq(irq_entry); spin_unlock_irqrestore(&irq_lock, flags); - - /* - * This calls activate_fd, so it has to be outside the critical - * section. - */ - maybe_sigio_broken(fd, (type == IRQ_READ)); + maybe_sigio_broken(fd, (type != IRQ_NONE)); return 0; - - out_unlock: +out_unlock: spin_unlock_irqrestore(&irq_lock, flags); - out_kfree: - kfree(new_fd); - out: +out: return err; } -static void free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg) +/* + * Walk the IRQ list and dispose of any unused entries. + * Should be done under irq_lock. + */ + +static void garbage_collect_irq_entries(void) { - unsigned long flags; + int i; + bool reap; + struct irq_entry *walk; + struct irq_entry *previous = NULL; + struct irq_entry *to_free; - spin_lock_irqsave(&irq_lock, flags); - os_free_irq_by_cb(test, arg, active_fds, &last_irq_ptr); - spin_unlock_irqrestore(&irq_lock, flags); + if (active_fds == NULL) + return; + walk = active_fds; + while (walk != NULL) { + reap = true; + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (walk->irq_array[i] != NULL) { + reap = false; + break; + } + } + if (reap) { + if (previous == NULL) + active_fds = walk->next; + else + previous->next = walk->next; + to_free = walk; + } else { + to_free = NULL; + } + walk = walk->next; + if (to_free != NULL) + kfree(to_free); + } } -struct irq_and_dev { - int irq; - void *dev; -}; +/* + * Walk the IRQ list and get the descriptor for our FD + */ -static int same_irq_and_dev(struct irq_fd *irq, void *d) +static struct irq_entry *get_irq_entry_by_fd(int fd) { - struct irq_and_dev *data = d; + struct irq_entry *walk = active_fds; - return ((irq->irq == data->irq) && (irq->id == data->dev)); + while (walk != NULL) { + if (walk->fd == fd) + return walk; + walk = walk->next; + } + return NULL; } -static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) -{ - struct irq_and_dev data = ((struct irq_and_dev) { .irq = irq, - .dev = dev }); - free_irq_by_cb(same_irq_and_dev, &data); -} +/* + * Walk the IRQ list and dispose of an entry for a specific + * device, fd and number. Note - if sharing an IRQ for read + * and writefor the same FD it will be disposed in either case. + * If this behaviour is undesirable use different IRQ ids. + */ -static int same_fd(struct irq_fd *irq, void *fd) -{ - return (irq->fd == *((int *)fd)); -} +#define IGNORE_IRQ 1 +#define IGNORE_DEV (1<<1) -void free_irq_by_fd(int fd) +static void do_free_by_irq_and_dev( + struct irq_entry *irq_entry, + unsigned int irq, + void *dev, + int flags +) { - free_irq_by_cb(same_fd, &fd); + int i; + struct irq_fd *to_free; + + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (irq_entry->irq_array[i] != NULL) { + if ( + ((flags & IGNORE_IRQ) || + (irq_entry->irq_array[i]->irq == irq)) && + ((flags & IGNORE_DEV) || + (irq_entry->irq_array[i]->id == dev)) + ) { + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop + */ + os_del_epoll_fd(irq_entry->fd); + to_free = irq_entry->irq_array[i]; + irq_entry->irq_array[i] = NULL; + assign_epoll_events_to_irq(irq_entry); + if (to_free->active) + to_free->purge = true; + else + kfree(to_free); + } + } + } } -/* Must be called with irq_lock held */ -static struct irq_fd *find_irq_by_fd(int fd, int irqnum, int *index_out) +void free_irq_by_fd(int fd) { - struct irq_fd *irq; - int i = 0; - int fdi; + struct irq_entry *to_free; + unsigned long flags; - for (irq = active_fds; irq != NULL; irq = irq->next) { - if ((irq->fd == fd) && (irq->irq == irqnum)) - break; - i++; - } - if (irq == NULL) { - printk(KERN_ERR "find_irq_by_fd doesn't have descriptor %d\n", - fd); - goto out; - } - fdi = os_get_pollfd(i); - if ((fdi != -1) && (fdi != fd)) { - printk(KERN_ERR "find_irq_by_fd - mismatch between active_fds " - "and pollfds, fd %d vs %d, need %d\n", irq->fd, - fdi, fd); - irq = NULL; - goto out; + spin_lock_irqsave(&irq_lock, flags); + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); } - *index_out = i; - out: - return irq; + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); } -void reactivate_fd(int fd, int irqnum) +static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irq, + dev, + 0 + ); + to_free = to_free->next; } - os_set_pollfd(i, irq->fd); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); +} - add_sigio_fd(fd); + +void reactivate_fd(int fd, int irqnum) +{ + /** NOP - we do auto-EOI now **/ } void deactivate_fd(int fd, int irqnum) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; + os_del_epoll_fd(fd); spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irqnum, + NULL, + IGNORE_DEV + ); } - - os_set_pollfd(i, -1); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); - ignore_sigio_fd(fd); } EXPORT_SYMBOL(deactivate_fd); @@ -265,17 +385,28 @@ EXPORT_SYMBOL(deactivate_fd); */ int deactivate_all_fds(void) { - struct irq_fd *irq; - int err; + unsigned long flags; + struct irq_entry *to_free; - for (irq = active_fds; irq != NULL; irq = irq->next) { - err = os_clear_fd_async(irq->fd); - if (err) - return err; - } - /* If there is a signal already queued, after unblocking ignore it */ + spin_lock_irqsave(&irq_lock, flags); + /* Stop IO. The IRQ loop has no lock so this is our + * only way of making sure we are safe to dispose + * of all IRQ handlers + */ os_set_ioignore(); - + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); + to_free = to_free->next; + } + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); + os_close_epoll_fd(); return 0; } @@ -353,8 +484,11 @@ void __init init_IRQ(void) irq_set_chip_and_handler(TIMER_IRQ, &SIGVTALRM_irq_type, handle_edge_irq); + for (i = 1; i < NR_IRQS; i++) irq_set_chip_and_handler(i, &normal_irq_type, handle_edge_irq); + /* Initialize EPOLL Loop */ + os_setup_epoll(); } /* diff --git a/arch/um/os-Linux/irq.c b/arch/um/os-Linux/irq.c index b9afb74b79ad..365823010346 100644 --- a/arch/um/os-Linux/irq.c +++ b/arch/um/os-Linux/irq.c @@ -1,135 +1,147 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL */ #include <stdlib.h> #include <errno.h> -#include <poll.h> +#include <sys/epoll.h> #include <signal.h> #include <string.h> #include <irq_user.h> #include <os.h> #include <um_malloc.h> +/* Epoll support */ + +static int epollfd = -1; + +#define MAX_EPOLL_EVENTS 64 + +static struct epoll_event epoll_events[MAX_EPOLL_EVENTS]; + +/* Helper to return an Epoll data pointer from an epoll event structure. + * We need to keep this one on the userspace side to keep includes separate + */ + +void *os_epoll_get_data_pointer(int index) +{ + return epoll_events[index].data.ptr; +} + +/* Helper to compare events versus the events in the epoll structure. + * Same as above - needs to be on the userspace side + */ + + +int os_epoll_triggered(int index, int events) +{ + return epoll_events[index].events & events; +} +/* Helper to set the event mask. + * The event mask is opaque to the kernel side, because it does not have + * access to the right includes/defines for EPOLL constants. + */ + +int os_event_mask(int irq_type) +{ + if (irq_type == IRQ_READ) + return EPOLLIN | EPOLLPRI; + if (irq_type == IRQ_WRITE) + return EPOLLOUT; + return 0; +} + /* - * Locked by irq_lock in arch/um/kernel/irq.c. Changed by os_create_pollfd - * and os_free_irq_by_cb, which are called under irq_lock. + * Initial Epoll Setup */ -static struct pollfd *pollfds = NULL; -static int pollfds_num = 0; -static int pollfds_size = 0; +int os_setup_epoll(void) +{ + epollfd = epoll_create(MAX_EPOLL_EVENTS); + return epollfd; +} -int os_waiting_for_events(struct irq_fd *active_fds) +/* + * Helper to run the actual epoll_wait + */ +int os_waiting_for_events_epoll(void) { - struct irq_fd *irq_fd; - int i, n, err; + int n, err; - n = poll(pollfds, pollfds_num, 0); + n = epoll_wait(epollfd, + (struct epoll_event *) &epoll_events, MAX_EPOLL_EVENTS, 0); if (n < 0) { err = -errno; if (errno != EINTR) - printk(UM_KERN_ERR "os_waiting_for_events:" - " poll returned %d, errno = %d\n", n, errno); + printk( + UM_KERN_ERR "os_waiting_for_events:" + " epoll returned %d, error = %s\n", n, + strerror(errno) + ); return err; } - - if (n == 0) - return 0; - - irq_fd = active_fds; - - for (i = 0; i < pollfds_num; i++) { - if (pollfds[i].revents != 0) { - irq_fd->current_events = pollfds[i].revents; - pollfds[i].fd = -1; - } - irq_fd = irq_fd->next; - } return n; } -int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds) -{ - if (pollfds_num == pollfds_size) { - if (size_tmpfds <= pollfds_size * sizeof(pollfds[0])) { - /* return min size needed for new pollfds area */ - return (pollfds_size + 1) * sizeof(pollfds[0]); - } - - if (pollfds != NULL) { - memcpy(tmp_pfd, pollfds, - sizeof(pollfds[0]) * pollfds_size); - /* remove old pollfds */ - kfree(pollfds); - } - pollfds = tmp_pfd; - pollfds_size++; - } else - kfree(tmp_pfd); /* remove not used tmp_pfd */ - - pollfds[pollfds_num] = ((struct pollfd) { .fd = fd, - .events = events, - .revents = 0 }); - pollfds_num++; - - return 0; -} -void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2) +/* + * Helper to add a fd to epoll + */ +int os_add_epoll_fd(int events, int fd, void *data) { - struct irq_fd **prev; - int i = 0; - - prev = &active_fds; - while (*prev != NULL) { - if ((*test)(*prev, arg)) { - struct irq_fd *old_fd = *prev; - if ((pollfds[i].fd != -1) && - (pollfds[i].fd != (*prev)->fd)) { - printk(UM_KERN_ERR "os_free_irq_by_cb - " - "mismatch between active_fds and " - "pollfds, fd %d vs %d\n", - (*prev)->fd, pollfds[i].fd); - goto out; - } - - pollfds_num--; - - /* - * This moves the *whole* array after pollfds[i] - * (though it doesn't spot as such)! - */ - memmove(&pollfds[i], &pollfds[i + 1], - (pollfds_num - i) * sizeof(pollfds[0])); - if (*last_irq_ptr2 == &old_fd->next) - *last_irq_ptr2 = prev; - - *prev = (*prev)->next; - if (old_fd->type == IRQ_WRITE) - ignore_sigio_fd(old_fd->fd); - kfree(old_fd); - continue; - } - prev = &(*prev)->next; - i++; - } - out: - return; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events | EPOLLET; + result = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &event); + if ((result) && (errno == EEXIST)) + result = os_mod_epoll_fd(events, fd, data); + if (result) + printk("epollctl add err fd %d, %s\n", fd, strerror(errno)); + return result; } -int os_get_pollfd(int i) +/* + * Helper to mod the fd event mask and/or data backreference + */ +int os_mod_epoll_fd(int events, int fd, void *data) { - return pollfds[i].fd; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events; + result = epoll_ctl(epollfd, EPOLL_CTL_MOD, fd, &event); + if (result) + printk(UM_KERN_ERR + "epollctl mod err fd %d, %s\n", fd, strerror(errno)); + return result; } -void os_set_pollfd(int i, int fd) +/* + * Helper to delete the epoll fd + */ +int os_del_epoll_fd(int fd) { - pollfds[i].fd = fd; + struct epoll_event event; + int result; + /* This is quiet as we use this as IO ON/OFF - so it is often + * invoked on a non-existent fd + */ + result = epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, &event); + return result; } void os_set_ioignore(void) { signal(SIGIO, SIG_IGN); } + +void os_close_epoll_fd(void) +{ + /* Needed so we do not leak an fd when rebooting */ + os_close_file(epollfd); +} -- 2.11.0 |
From: <ant...@ko...> - 2017-09-26 10:27:10
|
Hi List, hi Richard, I found a logical error in the SG code. This revision fixes that. No other changes - the rest appears to work as it should in my testing. Brgds, A. |
From: Richard W. <ric...@gm...> - 2017-09-21 20:13:37
|
On Tue, Sep 12, 2017 at 2:10 PM, Petr Mladek <pm...@su...> wrote: > On Wed 2017-09-06 22:27:49, Helge Deller wrote: >> Use the %pS printk format for printing symbols from direct addresses. >> In usermode-linux there is actually no difference between %pS and %pF, but for >> consistency throughout the kernel fix the wrong usage here too. >> >> Signed-off-by: Helge Deller <de...@gm...> >> Cc: Jeff Dike <jd...@ad...> >> Cc: Richard Weinberger <ri...@no...> >> Cc: use...@li... >> --- >> arch/um/kernel/sysrq.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/um/kernel/sysrq.c b/arch/um/kernel/sysrq.c >> index 6b995e8..05585ee 100644 >> --- a/arch/um/kernel/sysrq.c >> +++ b/arch/um/kernel/sysrq.c >> @@ -20,7 +20,7 @@ >> >> static void _print_addr(void *data, unsigned long address, int reliable) >> { >> - pr_info(" [<%08lx>] %s%pF\n", address, reliable ? "" : "? ", >> + pr_info(" [<%08lx>] %s%pS\n", address, reliable ? "" : "? ", >> (void *)address); > > This seems to be used to print addresses from the stack. > IMHO, we should use %pB here. %pWTF? ;) Agreed, let's use %pB. -- Thanks, //richard |
From: Richard W. <ri...@no...> - 2017-09-16 18:59:14
|
Linus, The following changes since commit 569dbb88e80deb68974ef6fdd6a13edb9d686261: Linux 4.13 (2017-09-03 13:56:17 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml.git for-linus-4.14-rc1 for you to fetch changes up to 6d20e6b235aad0be463b23588093b079362bb2e4: um: return negative in tuntap_open_tramp() (2017-09-13 22:36:50 +0200) ---------------------------------------------------------------- This pull request contains updates for UML: - Minor improvements - Fixes for Debian's new gcc defaults (pie enabled by default) - Fixes for XSTATE/XSAVE to make UML work again on modern systems ---------------------------------------------------------------- Dan Carpenter (2): um: remove a stray tab um: return negative in tuntap_open_tramp() James Pack (1): Fix minor typos and grammar in UML start_up help Krzysztof Kozlowski (1): um: defconfig: Cleanup from old Kconfig options Thomas Meyer (4): um: Fix FP register size for XSTATE/XSAVE um: Fix CONFIG_GCOV for modules. um: link vmlinux with -no-pie um: Use relative modversions with LD_SCRIPT_DYN arch/um/Kconfig.um | 1 + arch/um/Makefile | 2 +- arch/um/configs/i386_defconfig | 1 - arch/um/configs/x86_64_defconfig | 1 - arch/um/include/asm/thread_info.h | 3 +++ arch/um/include/shared/os.h | 2 +- arch/um/kernel/gmon_syms.c | 7 +++++++ arch/um/kernel/process.c | 4 ++-- arch/um/os-Linux/drivers/tuntap_user.c | 2 +- arch/um/os-Linux/skas/process.c | 17 ++++++++--------- arch/um/os-Linux/start_up.c | 6 +++--- arch/x86/um/os-Linux/registers.c | 18 ++++++++++++------ arch/x86/um/os-Linux/tls.c | 2 +- arch/x86/um/user-offsets.c | 2 +- 14 files changed, 41 insertions(+), 27 deletions(-) |
From: <ant...@ko...> - 2017-09-10 07:20:35
|
From: Anton Ivanov <ant...@ca...> 1. Removes the need to walk the IRQ/Device list to determine who triggered the IRQ. 2. Improves scalability (up to several times performance improvement for cases with 10s of devices). 3. Improves UML baseline IO performance for one disk + one NIC use case by up to 10%. 4. Introduces write poll triggered IRQs. 5. Prerequisite for introducing high performance mmesg family of functions in network IO. 6. Fixes RNG shutdown which was leaking a file descriptor Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/drivers/chan_kern.c | 53 +---- arch/um/drivers/line.c | 2 +- arch/um/drivers/random.c | 11 +- arch/um/drivers/ubd_kern.c | 4 +- arch/um/include/shared/irq_user.h | 12 +- arch/um/include/shared/os.h | 17 +- arch/um/kernel/irq.c | 460 ++++++++++++++++++++++++-------------- arch/um/os-Linux/irq.c | 202 +++++++++-------- 8 files changed, 444 insertions(+), 317 deletions(-) diff --git a/arch/um/drivers/chan_kern.c b/arch/um/drivers/chan_kern.c index acbe6c67afba..05588f9466c7 100644 --- a/arch/um/drivers/chan_kern.c +++ b/arch/um/drivers/chan_kern.c @@ -171,56 +171,19 @@ int enable_chan(struct line *line) return err; } -/* Items are added in IRQ context, when free_irq can't be called, and - * removed in process context, when it can. - * This handles interrupt sources which disappear, and which need to - * be permanently disabled. This is discovered in IRQ context, but - * the freeing of the IRQ must be done later. - */ -static DEFINE_SPINLOCK(irqs_to_free_lock); -static LIST_HEAD(irqs_to_free); - -void free_irqs(void) -{ - struct chan *chan; - LIST_HEAD(list); - struct list_head *ele; - unsigned long flags; - - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_splice_init(&irqs_to_free, &list); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - - list_for_each(ele, &list) { - chan = list_entry(ele, struct chan, free_list); - - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } -} - static void close_one_chan(struct chan *chan, int delay_free_irq) { - unsigned long flags; - if (!chan->opened) return; - if (delay_free_irq) { - spin_lock_irqsave(&irqs_to_free_lock, flags); - list_add(&chan->free_list, &irqs_to_free); - spin_unlock_irqrestore(&irqs_to_free_lock, flags); - } - else { - if (chan->input && chan->enabled) - um_free_irq(chan->line->driver->read_irq, chan); - if (chan->output && chan->enabled) - um_free_irq(chan->line->driver->write_irq, chan); - chan->enabled = 0; - } + /* we can safely call free now - it will be marked + * as free and freed once the IRQ stopped processing + */ + if (chan->input && chan->enabled) + um_free_irq(chan->line->driver->read_irq, chan); + if (chan->output && chan->enabled) + um_free_irq(chan->line->driver->write_irq, chan); + chan->enabled = 0; if (chan->ops->close != NULL) (*chan->ops->close)(chan->fd, chan->data); diff --git a/arch/um/drivers/line.c b/arch/um/drivers/line.c index 366e57f5e8d6..8d80b27502e6 100644 --- a/arch/um/drivers/line.c +++ b/arch/um/drivers/line.c @@ -284,7 +284,7 @@ int line_setup_irq(int fd, int input, int output, struct line *line, void *data) if (err) return err; if (output) - err = um_request_irq(driver->write_irq, fd, IRQ_WRITE, + err = um_request_irq(driver->write_irq, fd, IRQ_NONE, line_write_interrupt, IRQF_SHARED, driver->write_irq_name, data); return err; diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c index 37c51a6be690..778a0e52d5a5 100644 --- a/arch/um/drivers/random.c +++ b/arch/um/drivers/random.c @@ -13,6 +13,7 @@ #include <linux/miscdevice.h> #include <linux/delay.h> #include <linux/uaccess.h> +#include <init.h> #include <irq_kern.h> #include <os.h> @@ -154,7 +155,14 @@ static int __init rng_init (void) /* * rng_cleanup - shutdown RNG module */ -static void __exit rng_cleanup (void) + +static void cleanup(void) +{ + free_irq_by_fd(random_fd); + os_close_file(random_fd); +} + +static void __exit rng_cleanup(void) { os_close_file(random_fd); misc_deregister (&rng_miscdev); @@ -162,6 +170,7 @@ static void __exit rng_cleanup (void) module_init (rng_init); module_exit (rng_cleanup); +__uml_exitcall(cleanup); MODULE_DESCRIPTION("UML Host Random Number Generator (RNG) driver"); MODULE_LICENSE("GPL"); diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index b55fe9bf5d3e..d4e8c497ae86 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -1587,11 +1587,11 @@ int io_thread(void *arg) do { res = os_write_file(kernel_fd, ((char *) io_req_buffer) + written, n); - if (res > 0) { + if (res >= 0) { written += res; } else { if (res != -EAGAIN) { - printk("io_thread - read failed, fd = %d, " + printk("io_thread - write failed, fd = %d, " "err = %d\n", kernel_fd, -n); } } diff --git a/arch/um/include/shared/irq_user.h b/arch/um/include/shared/irq_user.h index df5633053957..a7a6120f19d5 100644 --- a/arch/um/include/shared/irq_user.h +++ b/arch/um/include/shared/irq_user.h @@ -7,6 +7,7 @@ #define __IRQ_USER_H__ #include <sysdep/ptrace.h> +#include <stdbool.h> struct irq_fd { struct irq_fd *next; @@ -15,10 +16,17 @@ struct irq_fd { int type; int irq; int events; - int current_events; + bool active; + bool pending; + bool purge; }; -enum { IRQ_READ, IRQ_WRITE }; +#define IRQ_READ 0 +#define IRQ_WRITE 1 +#define IRQ_NONE 2 +#define MAX_IRQ_TYPE (IRQ_NONE + 1) + + struct siginfo; extern void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 574e03fc7ba2..edbe6563d5c5 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -290,15 +290,16 @@ extern void halt_skas(void); extern void reboot_skas(void); /* irq.c */ -extern int os_waiting_for_events(struct irq_fd *active_fds); -extern int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds); -extern void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2); -extern void os_free_irq_later(struct irq_fd *active_fds, - int irq, void *dev_id); -extern int os_get_pollfd(int i); -extern void os_set_pollfd(int i, int fd); +extern int os_waiting_for_events_epoll(void); +extern void *os_epoll_get_data_pointer(int index); +extern int os_epoll_triggered(int index, int events); +extern int os_event_mask(int irq_type); +extern int os_setup_epoll(void); +extern int os_add_epoll_fd(int events, int fd, void *data); +extern int os_mod_epoll_fd(int events, int fd, void *data); +extern int os_del_epoll_fd(int fd); extern void os_set_ioignore(void); +extern void os_close_epoll_fd(void); /* sigio.c */ extern int add_sigio_fd(int fd); diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 23cb9350d47e..980148d56537 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -1,4 +1,6 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL * Derived (i.e. mostly copied) from arch/i386/kernel/irq.c: @@ -16,243 +18,361 @@ #include <as-layout.h> #include <kern_util.h> #include <os.h> +#include <irq_user.h> -/* - * This list is accessed under irq_lock, except in sigio_handler, - * where it is safe from being modified. IRQ handlers won't change it - - * if an IRQ source has vanished, it will be freed by free_irqs just - * before returning from sigio_handler. That will process a separate - * list of irqs to free, with its own locking, coming back here to - * remove list elements, taking the irq_lock to do so. + +/* When epoll triggers we do not know why it did so + * we can also have different IRQs for read and write. + * This is why we keep a small irq_fd array for each fd - + * one entry per IRQ type */ -static struct irq_fd *active_fds = NULL; -static struct irq_fd **last_irq_ptr = &active_fds; -extern void free_irqs(void); +struct irq_entry { + struct irq_entry *next; + int fd; + struct irq_fd *irq_array[MAX_IRQ_TYPE + 1]; +}; + +static struct irq_entry *active_fds; + +static DEFINE_SPINLOCK(irq_lock); + +static void irq_io_loop(struct irq_fd *irq, struct uml_pt_regs *regs) +{ +/* + * irq->active guards against reentry + * irq->pending accumulates pending requests + * if pending is raised the irq_handler is re-run + * until pending is cleared + */ + if (irq->active) { + irq->active = false; + do { + irq->pending = false; + do_IRQ(irq->irq, regs); + } while (irq->pending && (!irq->purge)); + if (!irq->purge) + irq->active = true; + } else { + irq->pending = true; + } +} void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs) { - struct irq_fd *irq_fd; - int n; + struct irq_entry *irq_entry; + struct irq_fd *irq; + + int n, i, j; while (1) { - n = os_waiting_for_events(active_fds); + /* This is now lockless - epoll keeps back-referencesto the irqs + * which have trigger it so there is no need to walk the irq + * list and lock it every time. We avoid locking by turning off + * IO for a specific fd by executing os_del_epoll_fd(fd) before + * we do any changes to the actual data structures + */ + n = os_waiting_for_events_epoll(); + if (n <= 0) { if (n == -EINTR) continue; - else break; + else + break; } - for (irq_fd = active_fds; irq_fd != NULL; - irq_fd = irq_fd->next) { - if (irq_fd->current_events != 0) { - irq_fd->current_events = 0; - do_IRQ(irq_fd->irq, regs); + for (i = 0; i < n ; i++) { + /* Epoll back reference is the entry with 3 irq_fd + * leaves - one for each irq type. + */ + irq_entry = (struct irq_entry *) + os_epoll_get_data_pointer(i); + for (j = 0; j < MAX_IRQ_TYPE ; j++) { + irq = irq_entry->irq_array[j]; + if (irq == NULL) + continue; + if (os_epoll_triggered(i, irq->events) > 0) + irq_io_loop(irq, regs); + if (irq->purge) { + irq_entry->irq_array[j] = NULL; + kfree(irq); + } } } } +} + +static int assign_epoll_events_to_irq(struct irq_entry *irq_entry) +{ + int i; + int events = 0; + struct irq_fd *irq; - free_irqs(); + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + irq = irq_entry->irq_array[i]; + if (irq != NULL) + events = irq->events | events; + } + if (events > 0) { + /* os_add_epoll will call os_mod_epoll if this already exists */ + return os_add_epoll_fd(events, irq_entry->fd, irq_entry); + } + /* No events - delete */ + return os_del_epoll_fd(irq_entry->fd); } -static DEFINE_SPINLOCK(irq_lock); + static int activate_fd(int irq, int fd, int type, void *dev_id) { - struct pollfd *tmp_pfd; - struct irq_fd *new_fd, *irq_fd; + struct irq_fd *new_fd; + struct irq_entry *irq_entry; + int i, err, events; unsigned long flags; - int events, err, n; err = os_set_fd_async(fd); if (err < 0) goto out; - err = -ENOMEM; - new_fd = kmalloc(sizeof(struct irq_fd), GFP_KERNEL); - if (new_fd == NULL) - goto out; + spin_lock_irqsave(&irq_lock, flags); - if (type == IRQ_READ) - events = UM_POLLIN | UM_POLLPRI; - else events = UM_POLLOUT; - *new_fd = ((struct irq_fd) { .next = NULL, - .id = dev_id, - .fd = fd, - .type = type, - .irq = irq, - .events = events, - .current_events = 0 } ); + /* Check if we have an entry for this fd */ err = -EBUSY; - spin_lock_irqsave(&irq_lock, flags); - for (irq_fd = active_fds; irq_fd != NULL; irq_fd = irq_fd->next) { - if ((irq_fd->fd == fd) && (irq_fd->type == type)) { - printk(KERN_ERR "Registering fd %d twice\n", fd); - printk(KERN_ERR "Irqs : %d, %d\n", irq_fd->irq, irq); - printk(KERN_ERR "Ids : 0x%p, 0x%p\n", irq_fd->id, - dev_id); + for (irq_entry = active_fds; + irq_entry != NULL; irq_entry = irq_entry->next) { + if (irq_entry->fd == fd) + break; + } + + if (irq_entry == NULL) { + /* This needs to be atomic as it may be called from an + * IRQ context. + */ + irq_entry = kmalloc(sizeof(struct irq_entry), GFP_ATOMIC); + if (irq_entry == NULL) { + printk(KERN_ERR + "Failed to allocate new IRQ entry\n"); goto out_unlock; } + irq_entry->fd = fd; + for (i = 0; i < MAX_IRQ_TYPE; i++) + irq_entry->irq_array[i] = NULL; + irq_entry->next = active_fds; + active_fds = irq_entry; } - if (type == IRQ_WRITE) - fd = -1; - - tmp_pfd = NULL; - n = 0; + /* Check if we are trying to re-register an interrupt for a + * particular fd + */ - while (1) { - n = os_create_pollfd(fd, events, tmp_pfd, n); - if (n == 0) - break; + if (irq_entry->irq_array[type] != NULL) { + printk(KERN_ERR + "Trying to reregister IRQ %d FD %d TYPE %d ID %p\n", + irq, fd, type, dev_id + ); + goto out_unlock; + } else { + /* New entry for this fd */ + + err = -ENOMEM; + new_fd = kmalloc(sizeof(struct irq_fd), GFP_ATOMIC); + if (new_fd == NULL) + goto out_unlock; - /* - * n > 0 - * It means we couldn't put new pollfd to current pollfds - * and tmp_fds is NULL or too small for new pollfds array. - * Needed size is equal to n as minimum. - * - * Here we have to drop the lock in order to call - * kmalloc, which might sleep. - * If something else came in and changed the pollfds array - * so we will not be able to put new pollfd struct to pollfds - * then we free the buffer tmp_fds and try again. + events = os_event_mask(type); + + *new_fd = ((struct irq_fd) { + .id = dev_id, + .irq = irq, + .type = type, + .events = events, + .active = true, + .pending = false, + .purge = false + }); + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop */ - spin_unlock_irqrestore(&irq_lock, flags); - kfree(tmp_pfd); - - tmp_pfd = kmalloc(n, GFP_KERNEL); - if (tmp_pfd == NULL) - goto out_kfree; - - spin_lock_irqsave(&irq_lock, flags); + os_del_epoll_fd(irq_entry->fd); + irq_entry->irq_array[type] = new_fd; } - *last_irq_ptr = new_fd; - last_irq_ptr = &new_fd->next; - + /* Turn back IO on with the correct (new) IO event mask */ + assign_epoll_events_to_irq(irq_entry); spin_unlock_irqrestore(&irq_lock, flags); - - /* - * This calls activate_fd, so it has to be outside the critical - * section. - */ - maybe_sigio_broken(fd, (type == IRQ_READ)); + maybe_sigio_broken(fd, (type != IRQ_NONE)); return 0; - - out_unlock: +out_unlock: spin_unlock_irqrestore(&irq_lock, flags); - out_kfree: - kfree(new_fd); - out: +out: return err; } -static void free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg) +/* + * Walk the IRQ list and dispose of any unused entries. + * Should be done under irq_lock. + */ + +static void garbage_collect_irq_entries(void) { - unsigned long flags; + int i; + bool reap; + struct irq_entry *walk; + struct irq_entry *previous = NULL; + struct irq_entry *to_free; - spin_lock_irqsave(&irq_lock, flags); - os_free_irq_by_cb(test, arg, active_fds, &last_irq_ptr); - spin_unlock_irqrestore(&irq_lock, flags); + if (active_fds == NULL) + return; + walk = active_fds; + while (walk != NULL) { + reap = true; + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (walk->irq_array[i] != NULL) { + reap = false; + break; + } + } + if (reap) { + if (previous == NULL) + active_fds = walk->next; + else + previous->next = walk->next; + to_free = walk; + } else { + to_free = NULL; + } + walk = walk->next; + if (to_free != NULL) + kfree(to_free); + } } -struct irq_and_dev { - int irq; - void *dev; -}; +/* + * Walk the IRQ list and get the descriptor for our FD + */ -static int same_irq_and_dev(struct irq_fd *irq, void *d) +static struct irq_entry *get_irq_entry_by_fd(int fd) { - struct irq_and_dev *data = d; + struct irq_entry *walk = active_fds; - return ((irq->irq == data->irq) && (irq->id == data->dev)); + while (walk != NULL) { + if (walk->fd == fd) + return walk; + walk = walk->next; + } + return NULL; } -static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) -{ - struct irq_and_dev data = ((struct irq_and_dev) { .irq = irq, - .dev = dev }); - free_irq_by_cb(same_irq_and_dev, &data); -} +/* + * Walk the IRQ list and dispose of an entry for a specific + * device, fd and number. Note - if sharing an IRQ for read + * and writefor the same FD it will be disposed in either case. + * If this behaviour is undesirable use different IRQ ids. + */ -static int same_fd(struct irq_fd *irq, void *fd) -{ - return (irq->fd == *((int *)fd)); -} +#define IGNORE_IRQ 1 +#define IGNORE_DEV (1<<1) -void free_irq_by_fd(int fd) +static void do_free_by_irq_and_dev( + struct irq_entry *irq_entry, + unsigned int irq, + void *dev, + int flags +) { - free_irq_by_cb(same_fd, &fd); + int i; + struct irq_fd *to_free; + + for (i = 0; i < MAX_IRQ_TYPE ; i++) { + if (irq_entry->irq_array[i] != NULL) { + if ( + ((flags & IGNORE_IRQ) || + (irq_entry->irq_array[i]->irq == irq)) && + ((flags & IGNORE_DEV) || + (irq_entry->irq_array[i]->id == dev)) + ) { + /* Turn off any IO on this fd - allows us to + * avoid locking the IRQ loop + */ + os_del_epoll_fd(irq_entry->fd); + to_free = irq_entry->irq_array[i]; + irq_entry->irq_array[i] = NULL; + assign_epoll_events_to_irq(irq_entry); + if (to_free->active) + to_free->purge = true; + else + kfree(to_free); + } + } + } } -/* Must be called with irq_lock held */ -static struct irq_fd *find_irq_by_fd(int fd, int irqnum, int *index_out) +void free_irq_by_fd(int fd) { - struct irq_fd *irq; - int i = 0; - int fdi; + struct irq_entry *to_free; + unsigned long flags; - for (irq = active_fds; irq != NULL; irq = irq->next) { - if ((irq->fd == fd) && (irq->irq == irqnum)) - break; - i++; - } - if (irq == NULL) { - printk(KERN_ERR "find_irq_by_fd doesn't have descriptor %d\n", - fd); - goto out; - } - fdi = os_get_pollfd(i); - if ((fdi != -1) && (fdi != fd)) { - printk(KERN_ERR "find_irq_by_fd - mismatch between active_fds " - "and pollfds, fd %d vs %d, need %d\n", irq->fd, - fdi, fd); - irq = NULL; - goto out; + spin_lock_irqsave(&irq_lock, flags); + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); } - *index_out = i; - out: - return irq; + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); } -void reactivate_fd(int fd, int irqnum) +static void free_irq_by_irq_and_dev(unsigned int irq, void *dev) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irq, + dev, + 0 + ); + to_free = to_free->next; } - os_set_pollfd(i, irq->fd); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); +} - add_sigio_fd(fd); + +void reactivate_fd(int fd, int irqnum) +{ + /** NOP - we do auto-EOI now **/ } void deactivate_fd(int fd, int irqnum) { - struct irq_fd *irq; + struct irq_entry *to_free; unsigned long flags; - int i; + os_del_epoll_fd(fd); spin_lock_irqsave(&irq_lock, flags); - irq = find_irq_by_fd(fd, irqnum, &i); - if (irq == NULL) { - spin_unlock_irqrestore(&irq_lock, flags); - return; + to_free = get_irq_entry_by_fd(fd); + if (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + irqnum, + NULL, + IGNORE_DEV + ); } - - os_set_pollfd(i, -1); + garbage_collect_irq_entries(); spin_unlock_irqrestore(&irq_lock, flags); - ignore_sigio_fd(fd); } EXPORT_SYMBOL(deactivate_fd); @@ -265,17 +385,28 @@ EXPORT_SYMBOL(deactivate_fd); */ int deactivate_all_fds(void) { - struct irq_fd *irq; - int err; + unsigned long flags; + struct irq_entry *to_free; - for (irq = active_fds; irq != NULL; irq = irq->next) { - err = os_clear_fd_async(irq->fd); - if (err) - return err; - } - /* If there is a signal already queued, after unblocking ignore it */ + spin_lock_irqsave(&irq_lock, flags); + /* Stop IO. The IRQ loop has no lock so this is our + * only way of making sure we are safe to dispose + * of all IRQ handlers + */ os_set_ioignore(); - + to_free = active_fds; + while (to_free != NULL) { + do_free_by_irq_and_dev( + to_free, + -1, + NULL, + IGNORE_IRQ | IGNORE_DEV + ); + to_free = to_free->next; + } + garbage_collect_irq_entries(); + spin_unlock_irqrestore(&irq_lock, flags); + os_close_epoll_fd(); return 0; } @@ -353,8 +484,11 @@ void __init init_IRQ(void) irq_set_chip_and_handler(TIMER_IRQ, &SIGVTALRM_irq_type, handle_edge_irq); + for (i = 1; i < NR_IRQS; i++) irq_set_chip_and_handler(i, &normal_irq_type, handle_edge_irq); + /* Initialize EPOLL Loop */ + os_setup_epoll(); } /* diff --git a/arch/um/os-Linux/irq.c b/arch/um/os-Linux/irq.c index b9afb74b79ad..365823010346 100644 --- a/arch/um/os-Linux/irq.c +++ b/arch/um/os-Linux/irq.c @@ -1,135 +1,147 @@ /* + * Copyright (C) 2017 - Cambridge Greys Ltd + * Copyright (C) 2011 - 2014 Cisco Systems Inc * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL */ #include <stdlib.h> #include <errno.h> -#include <poll.h> +#include <sys/epoll.h> #include <signal.h> #include <string.h> #include <irq_user.h> #include <os.h> #include <um_malloc.h> +/* Epoll support */ + +static int epollfd = -1; + +#define MAX_EPOLL_EVENTS 64 + +static struct epoll_event epoll_events[MAX_EPOLL_EVENTS]; + +/* Helper to return an Epoll data pointer from an epoll event structure. + * We need to keep this one on the userspace side to keep includes separate + */ + +void *os_epoll_get_data_pointer(int index) +{ + return epoll_events[index].data.ptr; +} + +/* Helper to compare events versus the events in the epoll structure. + * Same as above - needs to be on the userspace side + */ + + +int os_epoll_triggered(int index, int events) +{ + return epoll_events[index].events & events; +} +/* Helper to set the event mask. + * The event mask is opaque to the kernel side, because it does not have + * access to the right includes/defines for EPOLL constants. + */ + +int os_event_mask(int irq_type) +{ + if (irq_type == IRQ_READ) + return EPOLLIN | EPOLLPRI; + if (irq_type == IRQ_WRITE) + return EPOLLOUT; + return 0; +} + /* - * Locked by irq_lock in arch/um/kernel/irq.c. Changed by os_create_pollfd - * and os_free_irq_by_cb, which are called under irq_lock. + * Initial Epoll Setup */ -static struct pollfd *pollfds = NULL; -static int pollfds_num = 0; -static int pollfds_size = 0; +int os_setup_epoll(void) +{ + epollfd = epoll_create(MAX_EPOLL_EVENTS); + return epollfd; +} -int os_waiting_for_events(struct irq_fd *active_fds) +/* + * Helper to run the actual epoll_wait + */ +int os_waiting_for_events_epoll(void) { - struct irq_fd *irq_fd; - int i, n, err; + int n, err; - n = poll(pollfds, pollfds_num, 0); + n = epoll_wait(epollfd, + (struct epoll_event *) &epoll_events, MAX_EPOLL_EVENTS, 0); if (n < 0) { err = -errno; if (errno != EINTR) - printk(UM_KERN_ERR "os_waiting_for_events:" - " poll returned %d, errno = %d\n", n, errno); + printk( + UM_KERN_ERR "os_waiting_for_events:" + " epoll returned %d, error = %s\n", n, + strerror(errno) + ); return err; } - - if (n == 0) - return 0; - - irq_fd = active_fds; - - for (i = 0; i < pollfds_num; i++) { - if (pollfds[i].revents != 0) { - irq_fd->current_events = pollfds[i].revents; - pollfds[i].fd = -1; - } - irq_fd = irq_fd->next; - } return n; } -int os_create_pollfd(int fd, int events, void *tmp_pfd, int size_tmpfds) -{ - if (pollfds_num == pollfds_size) { - if (size_tmpfds <= pollfds_size * sizeof(pollfds[0])) { - /* return min size needed for new pollfds area */ - return (pollfds_size + 1) * sizeof(pollfds[0]); - } - - if (pollfds != NULL) { - memcpy(tmp_pfd, pollfds, - sizeof(pollfds[0]) * pollfds_size); - /* remove old pollfds */ - kfree(pollfds); - } - pollfds = tmp_pfd; - pollfds_size++; - } else - kfree(tmp_pfd); /* remove not used tmp_pfd */ - - pollfds[pollfds_num] = ((struct pollfd) { .fd = fd, - .events = events, - .revents = 0 }); - pollfds_num++; - - return 0; -} -void os_free_irq_by_cb(int (*test)(struct irq_fd *, void *), void *arg, - struct irq_fd *active_fds, struct irq_fd ***last_irq_ptr2) +/* + * Helper to add a fd to epoll + */ +int os_add_epoll_fd(int events, int fd, void *data) { - struct irq_fd **prev; - int i = 0; - - prev = &active_fds; - while (*prev != NULL) { - if ((*test)(*prev, arg)) { - struct irq_fd *old_fd = *prev; - if ((pollfds[i].fd != -1) && - (pollfds[i].fd != (*prev)->fd)) { - printk(UM_KERN_ERR "os_free_irq_by_cb - " - "mismatch between active_fds and " - "pollfds, fd %d vs %d\n", - (*prev)->fd, pollfds[i].fd); - goto out; - } - - pollfds_num--; - - /* - * This moves the *whole* array after pollfds[i] - * (though it doesn't spot as such)! - */ - memmove(&pollfds[i], &pollfds[i + 1], - (pollfds_num - i) * sizeof(pollfds[0])); - if (*last_irq_ptr2 == &old_fd->next) - *last_irq_ptr2 = prev; - - *prev = (*prev)->next; - if (old_fd->type == IRQ_WRITE) - ignore_sigio_fd(old_fd->fd); - kfree(old_fd); - continue; - } - prev = &(*prev)->next; - i++; - } - out: - return; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events | EPOLLET; + result = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &event); + if ((result) && (errno == EEXIST)) + result = os_mod_epoll_fd(events, fd, data); + if (result) + printk("epollctl add err fd %d, %s\n", fd, strerror(errno)); + return result; } -int os_get_pollfd(int i) +/* + * Helper to mod the fd event mask and/or data backreference + */ +int os_mod_epoll_fd(int events, int fd, void *data) { - return pollfds[i].fd; + struct epoll_event event; + int result; + + event.data.ptr = data; + event.events = events; + result = epoll_ctl(epollfd, EPOLL_CTL_MOD, fd, &event); + if (result) + printk(UM_KERN_ERR + "epollctl mod err fd %d, %s\n", fd, strerror(errno)); + return result; } -void os_set_pollfd(int i, int fd) +/* + * Helper to delete the epoll fd + */ +int os_del_epoll_fd(int fd) { - pollfds[i].fd = fd; + struct epoll_event event; + int result; + /* This is quiet as we use this as IO ON/OFF - so it is often + * invoked on a non-existent fd + */ + result = epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, &event); + return result; } void os_set_ioignore(void) { signal(SIGIO, SIG_IGN); } + +void os_close_epoll_fd(void) +{ + /* Needed so we do not leak an fd when rebooting */ + os_close_file(epollfd); +} -- 2.11.0 |
From: <ant...@ko...> - 2017-09-10 07:20:27
|
From: Anton Ivanov <ant...@ca...> 1. Provides infrastructure for vector IO using recvmmsg/sendmmsg. 1.1. Multi-message read. 1.2. Multi-message write. 1.3. Optimized queue support for multi-packet enqueue/dequeue. 1.4. BQL/DQL support. 2. Implements transports for several transports as well support for direct wiring of PWEs to NIC. Allows direct connection of VMs to host, other VMs and network devices with no switch in use. 2.1. Raw socket >4 times faster than existing pcap based transport 2.2. Tap transport using socket RX and tap xmit. >3 times faster RX than existing tap. 2.3. GRE transport - direct wiring to GRE PWE, >3 times faster RX than any existing transport. 2.4. L2TPv3 transport - direct wiring to L2TPv3 PWE, > 3 times faster RX than any existing transport. 2.5. TX in all cases shows only minor improvement, but consumes significantly less CPU under load. 3. Tuning and performance related information via ethtool. 4. Rudimentary BPF support - used in tap only to avoid software looping 5. Scatter Gather support. 6. VNET and checksum offload support for raw socket transport. Signed-off-by: Anton Ivanov <ant...@ca...> --- arch/um/Kconfig.net | 11 + arch/um/drivers/Makefile | 4 +- arch/um/drivers/net_kern.c | 4 +- arch/um/drivers/vector_kern.c | 1516 +++++++++++++++++++++++++++++++++++ arch/um/drivers/vector_kern.h | 128 +++ arch/um/drivers/vector_transports.c | 406 ++++++++++ arch/um/drivers/vector_user.c | 512 ++++++++++++ arch/um/drivers/vector_user.h | 89 ++ arch/um/include/asm/irq.h | 12 + arch/um/include/shared/net_kern.h | 2 + 10 files changed, 2681 insertions(+), 3 deletions(-) create mode 100644 arch/um/drivers/vector_kern.c create mode 100644 arch/um/drivers/vector_kern.h create mode 100644 arch/um/drivers/vector_transports.c create mode 100644 arch/um/drivers/vector_user.c create mode 100644 arch/um/drivers/vector_user.h diff --git a/arch/um/Kconfig.net b/arch/um/Kconfig.net index 820a56f00332..0516dc76e6aa 100644 --- a/arch/um/Kconfig.net +++ b/arch/um/Kconfig.net @@ -108,6 +108,17 @@ config UML_NET_DAEMON more than one without conflict. If you don't need UML networking, say N. +config UML_NET_VECTOR + bool "Vector I/O high performance network devices" + depends on UML_NET + help + This User-Mode Linux network driver uses multi-message send + and receive functions. The host running the UML guest must have + a linux kernel version above 3.0 and a libc version > 2.13. + This driver provides tap, raw, gre and l2tpv3 network transports + with up to 4 times higher network throughput than the UML network + drivers. + config UML_NET_VDE bool "VDE transport" depends on UML_NET diff --git a/arch/um/drivers/Makefile b/arch/um/drivers/Makefile index e7582e1d248c..16b3cebddafb 100644 --- a/arch/um/drivers/Makefile +++ b/arch/um/drivers/Makefile @@ -9,6 +9,7 @@ slip-objs := slip_kern.o slip_user.o slirp-objs := slirp_kern.o slirp_user.o daemon-objs := daemon_kern.o daemon_user.o +vector-objs := vector_kern.o vector_user.o vector_transports.o umcast-objs := umcast_kern.o umcast_user.o net-objs := net_kern.o net_user.o mconsole-objs := mconsole_kern.o mconsole_user.o @@ -43,6 +44,7 @@ obj-$(CONFIG_STDERR_CONSOLE) += stderr_console.o obj-$(CONFIG_UML_NET_SLIP) += slip.o slip_common.o obj-$(CONFIG_UML_NET_SLIRP) += slirp.o slip_common.o obj-$(CONFIG_UML_NET_DAEMON) += daemon.o +obj-$(CONFIG_UML_NET_VECTOR) += vector.o obj-$(CONFIG_UML_NET_VDE) += vde.o obj-$(CONFIG_UML_NET_MCAST) += umcast.o obj-$(CONFIG_UML_NET_PCAP) += pcap.o @@ -61,7 +63,7 @@ obj-$(CONFIG_BLK_DEV_COW_COMMON) += cow_user.o obj-$(CONFIG_UML_RANDOM) += random.o # pcap_user.o must be added explicitly. -USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o +USER_OBJS := fd.o null.o pty.o tty.o xterm.o slip_common.o pcap_user.o vde_user.o vector_user.o CFLAGS_null.o = -DDEV_NULL=$(DEV_NULL_PATH) include arch/um/scripts/Makefile.rules diff --git a/arch/um/drivers/net_kern.c b/arch/um/drivers/net_kern.c index 1669240c7a25..03cc5857a3b6 100644 --- a/arch/um/drivers/net_kern.c +++ b/arch/um/drivers/net_kern.c @@ -288,7 +288,7 @@ static void uml_net_user_timer_expire(unsigned long _conn) #endif } -static void setup_etheraddr(struct net_device *dev, char *str) +void uml_net_setup_etheraddr(struct net_device *dev, char *str) { unsigned char *addr = dev->dev_addr; char *end; @@ -412,7 +412,7 @@ static void eth_configure(int n, void *init, char *mac, */ snprintf(dev->name, sizeof(dev->name), "eth%d", n); - setup_etheraddr(dev, mac); + uml_net_setup_etheraddr(dev, mac); printk(KERN_INFO "Netdevice %d (%pM) : ", n, dev->dev_addr); diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c new file mode 100644 index 000000000000..0899189a9e61 --- /dev/null +++ b/arch/um/drivers/vector_kern.c @@ -0,0 +1,1516 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright (C) 2001 Lennert Buytenhek (bu...@gn...) and + * James Leu (jl...@mi...). + * Copyright (C) 2001 by various other people who didn't put their name here. + * Licensed under the GPL. + */ + +#include <linux/version.h> +#include <linux/bootmem.h> +#include <linux/etherdevice.h> +#include <linux/ethtool.h> +#include <linux/inetdevice.h> +#include <linux/init.h> +#include <linux/list.h> +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/rtnetlink.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <linux/interrupt.h> +#include <init.h> +#include <irq_kern.h> +#include <irq_user.h> +#include <net_kern.h> +#include <os.h> +#include "mconsole_kern.h" +#include "vector_user.h" +#include "vector_kern.h" + +/* + * Adapted from network devices with the following major changes: + * All transports are static - simplifies the code significantly + * Multiple FDs/IRQs per device + * Vector IO optionally used for read/write, falling back to legacy + * based on configuration and/or availability + * Configuration is no longer positional - L2TPv3 and GRE require up to + * 10 parameters, passing this as positional is not fit for purpose. + * Only socket transports are supported + */ + + +#define DRIVER_NAME "uml-vector" +#define DRIVER_VERSION "01" +struct vector_cmd_line_arg { + struct list_head list; + int unit; + char *arguments; +}; + +struct vector_device { + struct list_head list; + struct net_device *dev; + struct platform_device pdev; + int unit; + int opened; +}; + +static LIST_HEAD(vec_cmd_line); + +static DEFINE_SPINLOCK(vector_devices_lock); +static LIST_HEAD(vector_devices); + +static int driver_registered; + +static void vector_eth_configure(int n, struct arglist *def); + +/* Argument accessors to set variables (and/or set default values) + * mtu, buffer sizing, default headroom, etc + */ + +#define DEFAULT_HEADROOM 2 +#define SAFETY_MARGIN 32 +#define DEFAULT_VECTOR_SIZE 64 +#define TX_SMALL_PACKET 128 +#define MAX_IOV_SIZE 8 + +static const struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { "rx_queue_max" }, + { "rx_queue_running_average" }, + { "tx_queue_max" }, + { "tx_queue_running_average" }, + { "rx_encaps_errors" }, + { "tx_timeout_count" }, + { "tx_restart_queue" }, + { "tx_kicks" }, + { "tx_flow_control_xon" }, + { "tx_flow_control_xoff" }, + { "rx_csum_offload_good" }, + { "rx_csum_offload_errors"}, + { "sg_ok"}, + { "sg_linearized"}, +}; + +#define VECTOR_NUM_STATS ARRAY_SIZE(ethtool_stats_keys) + +/* Used in the 4.4 OpenWRT backport */ + +#if LINUX_VERSION_CODE <= KERNEL_VERSION(4, 7, 0) +static inline void netif_trans_update(struct net_device *dev) +{ + struct netdev_queue *txq = netdev_get_tx_queue(dev, 0); + + if (txq->trans_start != jiffies) + txq->trans_start = jiffies; +} +#endif + +static void vector_reset_stats(struct vector_private *vp) +{ + vp->estats.rx_queue_max = 0; + vp->estats.rx_queue_running_average = 0; + vp->estats.tx_queue_max = 0; + vp->estats.tx_queue_running_average = 0; + vp->estats.rx_encaps_errors = 0; + vp->estats.tx_timeout_count = 0; + vp->estats.tx_restart_queue = 0; + vp->estats.tx_kicks = 0; + vp->estats.tx_flow_control_xon = 0; + vp->estats.tx_flow_control_xoff = 0; + vp->estats.sg_ok = 0; + vp->estats.sg_linearized = 0; +} + +static int get_mtu(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "mtu"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return ETH_MAX_PACKET; +} + +static int get_depth(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "depth"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_VECTOR_SIZE; +} + +static int get_headroom(struct arglist *def) +{ + char *mtu = uml_vector_fetch_arg(def, "headroom"); + long result; + + if (mtu != NULL) { + if (kstrtoul(mtu, 10, &result) == 0) + return result; + } + return DEFAULT_HEADROOM; +} + +static int get_transport_options(struct arglist *def) +{ + char *transport = uml_vector_fetch_arg(def, "transport"); + + if (strncmp(transport, TRANS_TAP, TRANS_TAP_LEN) == 0) + return (VECTOR_RX | VECTOR_BPF); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return (VECTOR_TX | VECTOR_RX | VECTOR_BPF); + return (VECTOR_TX | VECTOR_RX); +} + + +/* A mini-buffer for packet drop read + * All of our supported transports are datagram oriented and we always + * read using recvmsg or recvmmsg. If we pass a buffer which is smaller + * than the packet size it still counts as full packet read and will + * clean the incoming stream to keep sigio/epoll happy + */ + +#define DROP_BUFFER_SIZE 32 + +static char *drop_buffer; + +/* Array backed queues optimized for bulk enqueue/dequeue and + * 1:N (small values of N) or 1:1 enqueuer/dequeuer ratios. + * For more details and full design rationale see + * http://foswiki.cambridgegreys.com/Main/EatYourTailAndEnjoyIt + */ + + +/* + * Advance the mmsg queue head by n = advance. Resets the queue to + * maximum enqueue/dequeue-at-once capacity if possible. Called by + * dequeuers. Caller must hold the head_lock! + */ + +static int vector_advancehead(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->head = + (qi->head + advance) + % qi->max_depth; + + + spin_lock(&qi->tail_lock); + qi->queue_depth -= advance; + + /* we are at 0, use this to + * reset head and tail so we can use max size vectors + */ + + if (qi->queue_depth == 0) { + qi->head = 0; + qi->tail = 0; + } + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +/* Advance the queue tail by n = advance. + * This is called by enqueuers which should hold the + * head lock already + */ + +static int vector_advancetail(struct vector_queue *qi, int advance) +{ + int queue_depth; + + qi->tail = + (qi->tail + advance) + % qi->max_depth; + spin_lock(&qi->head_lock); + qi->queue_depth += advance; + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + return queue_depth; +} + +/* + * Generic vector enqueue with support for forming headers using transport + * specific callback. Allows GRE, L2TPv3, RAW and other transports + * to use a common enqueue procedure in vector mode + */ + +static int vector_enqueue(struct vector_queue *qi, struct sk_buff *skb) +{ + struct vector_private *vp = netdev_priv(qi->dev); + int queue_depth; + int nr_frags, frag, packet_len; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + struct iovec *iov; + skb_frag_t *skb_frag; + + spin_lock(&qi->tail_lock); + spin_lock(&qi->head_lock); + queue_depth = qi->queue_depth; + spin_unlock(&qi->head_lock); + + if (skb) + packet_len = skb->len; + + if (queue_depth < qi->max_depth) { + + *(qi->skbuff_vector + qi->tail) = skb; + mmsg_vector += qi->tail; + iov = mmsg_vector->msg_hdr.msg_iov; + + mmsg_vector->msg_hdr.msg_name = vp->fds->remote_addr; + mmsg_vector->msg_hdr.msg_namelen = vp->fds->remote_addr_size; + nr_frags = skb_shinfo(skb)->nr_frags; + if (vp->header_size > 0) { + vp->form_header(iov->iov_base, skb, vp); + iov++; + mmsg_vector->msg_hdr.msg_iovlen = 2 + nr_frags; + } else + mmsg_vector->msg_hdr.msg_iovlen = 1 + nr_frags; + if (nr_frags > qi->max_iov_frags) { + if (skb_linearize(skb) != 0) + goto drop; + else + vp->estats.sg_linearized++; + } else + vp->estats.sg_ok++; + iov->iov_base = skb->data; + if (nr_frags > 0) + iov->iov_len = skb->len - skb->data_len; + else + iov->iov_len = skb->len; + for (frag = 0; frag < nr_frags; frag++) { + iov++; + skb_frag = &skb_shinfo(skb)->frags[frag]; + iov->iov_base = skb_frag_address_safe(skb_frag); + iov->iov_len = skb_frag_size(skb_frag); + } + queue_depth = vector_advancetail(qi, 1); + } else + goto drop; + spin_unlock(&qi->tail_lock); + return queue_depth; +drop: + qi->dev->stats.tx_dropped++; + if (skb != NULL) { + dev_consume_skb_any(skb); + netdev_completed_queue(qi->dev, 1, packet_len); + } + spin_unlock(&qi->tail_lock); + return queue_depth; +} + +static int consume_vector_skbs(struct vector_queue *qi, int count) +{ + struct sk_buff *skb; + int skb_index; + int bytes_compl = 0; + + for (skb_index = qi->head; skb_index < qi->head + count; skb_index++) { + skb = *(qi->skbuff_vector + skb_index); + /* mark as empty to ensure correct destruction if + * needed + */ + bytes_compl += skb->len; + *(qi->skbuff_vector + skb_index) = NULL; + dev_consume_skb_any(skb); + } + qi->dev->stats.tx_bytes += bytes_compl; + qi->dev->stats.tx_packets += count; + netdev_completed_queue(qi->dev, count, bytes_compl); + return vector_advancehead(qi, count); +} + +/* + * Generic vector deque via sendmmsg with support for forming headers + * using transport specific callback. Allows GRE, L2TPv3, RAW and + * other transports to use a common dequeue procedure in vector mode + */ + + +static int vector_send(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *send_from; + int result = 0, send_len, queue_depth = qi->max_depth; + + if (spin_trylock(&qi->head_lock)) { + if (spin_trylock(&qi->tail_lock)) { + /* update queue_depth to current value */ + queue_depth = qi->queue_depth; + spin_unlock(&qi->tail_lock); + while (queue_depth > 0) { + /* Calculate the start of the vector */ + send_len = queue_depth; + send_from = qi->mmsg_vector; + send_from += qi->head; + /* Adjust vector size if wraparound */ + if (send_len + qi->head > qi->max_depth) + send_len = qi->max_depth - qi->head; + /* Try to TX as many packets as possible */ + if (send_len > 0) { + result = uml_vector_sendmmsg( + vp->fds->tx_fd, + send_from, + send_len, + 0 + ); + vp->in_write_poll = (result != send_len); + } + /* For some of the sendmmsg error scenarios + * we may end being unsure in the TX success + * for all packets. It is safer to declare + * them all TX-ed and blame the network. + */ + if (result < 0) { + printk(KERN_ERR "sendmmsg err=%i\n", + result); + result = send_len; + } + if (result > 0) { + queue_depth = consume_vector_skbs(qi, result); + /* This is equivalent to an TX IRQ. + * Restart the upper layers to feed us + * more packets. + */ + if (result > vp->estats.tx_queue_max) + vp->estats.tx_queue_max = result; + vp->estats.tx_queue_running_average = + (vp->estats.tx_queue_running_average + result) >> 1; + } + netif_trans_update(qi->dev); + netif_wake_queue(qi->dev); + /* if TX is busy, break out of the send loop, poll + * write IRQ will reschedule xmit for us + */ + if (result != send_len) { + vp->estats.tx_restart_queue++; + break; + } + } + } + spin_unlock(&qi->head_lock); + } else { + tasklet_schedule(&vp->tx_poll); + } + return queue_depth; +} + +/* Queue destructor. Deliberately stateless so we can use + * it in queue cleanup if initialization fails. + */ + +static void destroy_queue(struct vector_queue *qi) +{ + int i; + struct iovec *iov; + struct vector_private *vp = netdev_priv(qi->dev); + struct mmsghdr *mmsg_vector; + + if (qi == NULL) + return; + /* deallocate any skbuffs - we rely on any unused to be + * set to NULL. + */ + if (qi->skbuff_vector != NULL) { + for (i = 0; i < qi->max_depth; i++) { + if (*(qi->skbuff_vector + i) != NULL) + dev_kfree_skb_any(*(qi->skbuff_vector + i)); + } + kfree(qi->skbuff_vector); + } + /* deallocate matching IOV structures including header buffs */ + if (qi->mmsg_vector != NULL) { + mmsg_vector = qi->mmsg_vector; + for (i = 0; i < qi->max_depth; i++) { + iov = mmsg_vector->msg_hdr.msg_iov; + if (iov != NULL) { + if ((vp->header_size > 0) && + (iov->iov_base != NULL)) + kfree(iov->iov_base); + kfree(iov); + } + mmsg_vector++; + } + kfree(qi->mmsg_vector); + } + kfree(qi); +} + +/* + * Queue constructor. Create a queue with a given side. + */ +static struct vector_queue *create_queue( + struct vector_private *vp, + int max_size, + int header_size, + int num_extra_frags) +{ + struct vector_queue *result; + int i; + struct iovec *iov; + struct mmsghdr *mmsg_vector; + + result = kmalloc(sizeof(struct vector_queue), GFP_KERNEL); + if (result == NULL) + goto out_fail; + result->max_depth = max_size; + result->dev = vp->dev; + result->mmsg_vector = kmalloc( + (sizeof(struct mmsghdr) * max_size), GFP_KERNEL); + result->skbuff_vector = kmalloc( + (sizeof(void *) * max_size), GFP_KERNEL); + if (result->mmsg_vector == NULL || result->skbuff_vector == NULL) + goto out_fail; + + mmsg_vector = result->mmsg_vector; + for (i = 0; i < max_size; i++) { + /* Clear all pointers - we use non-NULL as marking on + * what to free on destruction + */ + *(result->skbuff_vector + i) = NULL; + mmsg_vector->msg_hdr.msg_iov = NULL; + mmsg_vector++; + } + mmsg_vector = result->mmsg_vector; + for (i = 0; i < max_size; i++) { + if (vp->header_size > 0) + iov = kmalloc(sizeof(struct iovec) * (2 + num_extra_frags), GFP_KERNEL); + else + iov = kmalloc(sizeof(struct iovec) * (1 + num_extra_frags), GFP_KERNEL); + if (iov == NULL) + goto out_fail; + mmsg_vector->msg_hdr.msg_iov = iov; + mmsg_vector->msg_hdr.msg_iovlen = 1; + mmsg_vector->msg_hdr.msg_control = NULL; + mmsg_vector->msg_hdr.msg_controllen = 0; + mmsg_vector->msg_hdr.msg_flags = 0; + mmsg_vector->msg_hdr.msg_name = NULL; + mmsg_vector->msg_hdr.msg_namelen = 0; + if (vp->header_size > 0) { + iov->iov_base = kmalloc(header_size, GFP_KERNEL); + if (iov->iov_base == NULL) + goto out_fail; + iov->iov_len = header_size; + mmsg_vector->msg_hdr.msg_iovlen = 2; + iov++; + } + iov->iov_base = NULL; + iov->iov_len = 0; + mmsg_vector++; + } + spin_lock_init(&result->head_lock); + spin_lock_init(&result->tail_lock); + result->queue_depth = 0; + result->head = 0; + result->tail = 0; + return result; +out_fail: + destroy_queue(result); + return NULL; +} + +/* + * We do not use the RX queue as a proper wraparound queue for now + * This is not necessary because the consumption via netif_rx() + * happens in-line. While we can try using the return code of + * netif_rx() for flow control there are no drivers doing this today. + * For this RX specific use we ignore the tail/head locks and + * just read into a prepared queue filled with skbuffs. + */ + +/* Prepare queue for recvmmsg one-shot rx - fill with fresh sk_buffs*/ + +static void prep_queue_for_rx(struct vector_queue *qi) +{ + struct vector_private *vp = netdev_priv(qi->dev); + struct sk_buff *skb; + struct iovec *iov; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int i; + + if (qi->queue_depth == 0) + return; + for (i = 0; i < qi->queue_depth; i++) { + /* it is OK if allocation fails - recvmmsg with NULL data in + * iov argument still performs an RX, just drops the packet + * This allows us stop faffing around with a "drop buffer" + */ + + skb = netdev_alloc_skb( + vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + iov = mmsg_vector->msg_hdr.msg_iov; + mmsg_vector->msg_len = 0; + if (vp->header_size > 0) + iov++; + if (skb != NULL) { + skb_reserve(skb, vp->headroom); + skb->dev = qi->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + skb->ip_summed = CHECKSUM_NONE; + iov->iov_base = skb->data; + iov->iov_len = vp->max_packet; + } else { + iov->iov_base = NULL; + iov->iov_len = 0; + } + *skbuff_vector = skb; + skbuff_vector++; + mmsg_vector++; + } + qi->queue_depth = 0; +} + +static struct vector_device *find_device(int n) +{ + struct vector_device *device; + struct list_head *ele; + + spin_lock(&vector_devices_lock); + list_for_each(ele, &vector_devices) { + device = list_entry(ele, struct vector_device, list); + if (device->unit == n) + goto out; + } + device = NULL; + out: + spin_unlock(&vector_devices_lock); + return device; +} + +static int vector_parse(char *str, int *index_out, char **str_out, + char **error_out) +{ + int n, len, err = -EINVAL; + char *start = str; + + len = strlen(str); + + while ((*str != ':') && (strlen(str) > 1)) + str++; + if (*str != ':') { + *error_out = "Expected ':' after device number"; + return err; + } else + *str = '\0'; + + err = kstrtouint(start, 0, &n); + if (err < 0) { + *error_out = "Bad device number"; + return err; + } + + str++; + if (find_device(n)) { + *error_out = "Device already configured"; + return err; + } + + *index_out = n; + *str_out = str; + return 0; +} + +static int vector_config(char *str, char **error_out) +{ + int err, n; + char *params; + struct arglist *parsed; + + err = vector_parse(str, &n, ¶ms, error_out); + if (err != 0) + return err; + + /* This string is broken up and the pieces used by the underlying + * driver. We should copy it to make sure things do not go wrong + * later. + */ + + params = kstrdup(params, GFP_KERNEL); + if (str == NULL) { + *error_out = "vector_config failed to strdup string"; + return -ENOMEM; + } + + parsed = uml_parse_vector_ifspec(params); + + if (parsed == NULL) { + *error_out = "vector_config failed to parse parameters"; + return -EINVAL; + } + + vector_eth_configure(n, parsed); + return 0; +} + +static int vector_id(char **str, int *start_out, int *end_out) +{ + char *end; + int n; + + n = simple_strtoul(*str, &end, 0); + if ((*end != '\0') || (end == *str)) + return -1; + + *start_out = n; + *end_out = n; + *str = end; + return n; +} + +static int vector_remove(int n, char **error_out) +{ + struct vector_device *vec_d; + struct net_device *dev; + struct vector_private *vp; + + vec_d = find_device(n); + if (vec_d == NULL) + return -ENODEV; + dev = vec_d->dev; + vp = netdev_priv(dev); + if (vp->fds != NULL) + return -EBUSY; + unregister_netdev(dev); + platform_device_unregister(&vec_d->pdev); + return 0; +} + +/* + * There is no shared per-transport initialization code, so + * we will just initialize each interface one by one and + * add them to a list + */ + +static struct platform_driver uml_net_driver = { + .driver = { + .name = DRIVER_NAME, + }, +}; + + +static void vector_device_release(struct device *dev) +{ + struct vector_device *device = dev_get_drvdata(dev); + struct net_device *netdev = device->dev; + + list_del(&device->list); + kfree(device); + free_netdev(netdev); +} + +/* Bog standard recv using recvmsg - not used normally unless the user + * explicitly specifies not to use recvmmsg vector RX. + */ + +static int vector_legacy_rx(struct vector_private *vp) +{ + int pkt_len; + struct user_msghdr hdr; + struct iovec iov[2]; /* header + data use case only */ + int iovpos = 0; + struct sk_buff *skb; + int header_check; + + hdr.msg_name = NULL; + hdr.msg_namelen = 0; + hdr.msg_iov = (struct iovec *) &iov; + hdr.msg_iovlen = 1; + hdr.msg_control = NULL; + hdr.msg_controllen = 0; + hdr.msg_flags = 0; + + if (vp->header_size > 0) { + iov[iovpos].iov_base = vp->header_rxbuffer; + iov[iovpos].iov_len = vp->rx_header_size; + hdr.msg_iovlen++; + iovpos++; + } + + skb = netdev_alloc_skb(vp->dev, + vp->max_packet + vp->headroom + SAFETY_MARGIN); + if (skb == NULL) { + /* Read a packet into drop_buffer and don't do + * anything with it. + */ + iov[iovpos].iov_base = drop_buffer; + iov[iovpos].iov_len = DROP_BUFFER_SIZE; + vp->dev->stats.rx_dropped++; + } else { + skb_reserve(skb, vp->headroom); + skb->dev = vp->dev; + skb_put(skb, vp->max_packet); + skb_reset_mac_header(skb); + iov[iovpos].iov_base = skb->data; + iov[iovpos].iov_len = vp->max_packet; + } + + pkt_len = uml_vector_recvmsg(vp->fds->rx_fd, &hdr, 0); + + if (skb != NULL) { + if (pkt_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + vp->header_rxbuffer, skb, vp); + if (header_check < 0) { + dev_kfree_skb_irq(skb); + vp->dev->stats.rx_dropped++; + vp->estats.rx_encaps_errors++; + return 0; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, pkt_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + dev_kfree_skb_irq(skb); + } + } + return pkt_len; +} + +/* + * Packet at a time TX which falls back to vector TX if the + * underlying transport is busy. + */ + +static int writev_tx(struct vector_private *vp, struct sk_buff *skb) +{ + int pkt_len; + + if (vp->header_size == 0) { + pkt_len = os_write_file(vp->fds->tx_fd, skb->data, skb->len); + } else { + struct iovec iov[2]; + + iov[0].iov_base = vp->header_txbuffer; + iov[0].iov_len = vp->header_size; + vp->form_header(iov->iov_base, skb, vp); + iov[1].iov_base = skb->data; + iov[1].iov_len = skb->len; + pkt_len = uml_vector_writev(vp->fds->tx_fd, &iov, 2); + } + netif_trans_update(vp->dev); + netif_wake_queue(vp->dev); + + if (pkt_len > 0) { + vp->dev->stats.tx_bytes += skb->len; + vp->dev->stats.tx_packets++; + } else { + vp->dev->stats.tx_dropped++; + } + consume_skb(skb); + return pkt_len; +} + +/* + * Receive as many messages as we can in one call using the special + * mmsg vector matched to an skb vector which we prepared earlier. + */ + +static int vector_mmsg_rx(struct vector_private *vp) +{ + int packet_count, i; + struct vector_queue *qi = vp->rx_queue; + struct sk_buff *skb; + struct mmsghdr *mmsg_vector = qi->mmsg_vector; + void **skbuff_vector = qi->skbuff_vector; + int header_check; + + /* Refresh the vector and make sure it is with new skbs and the + * iovs are updated to point to them. + */ + + prep_queue_for_rx(qi); + + /* Fire the Lazy Gun - get as many packets as we can in one go. */ + + packet_count = uml_vector_recvmmsg( + vp->fds->rx_fd, qi->mmsg_vector, qi->max_depth, 0); + + if (packet_count <= 0) + return packet_count; + + /* We treat packet processing as enqueue, buffer refresh as dequeue + * The queue_depth tells us how many buffers have been used and how + * many do we need to prep the next time prep_queue_for_rx() is called. + */ + + qi->queue_depth = packet_count; + + for (i = 0; i < packet_count; i++) { + skb = (*skbuff_vector); + if (mmsg_vector->msg_len > vp->header_size) { + if (vp->header_size > 0) { + header_check = vp->verify_header( + mmsg_vector->msg_hdr.msg_iov->iov_base, skb, vp); + if (header_check < 0) { + /* Overlay header failed to verify - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too + * complex. + */ + dev_kfree_skb_irq(skb); + vp->estats.rx_encaps_errors++; + continue; + } + if (header_check > 0) { + vp->estats.rx_csum_offload_good++; + skb->ip_summed = CHECKSUM_UNNECESSARY; + } + } + skb_trim(skb, + mmsg_vector->msg_len - vp->rx_header_size); + skb->protocol = eth_type_trans(skb, skb->dev); + /* + * We do not need to lock on updating stats here + * The interrupt loop is non-reentrant. + */ + vp->dev->stats.rx_bytes += skb->len; + vp->dev->stats.rx_packets++; + netif_rx(skb); + } else { + /* Overlay header too short to do anything - discard. + * We can actually keep this skb and reuse it, + * but that will make the prep logic too complex. + */ + if (skb != NULL) + dev_kfree_skb_irq(skb); + } + (*skbuff_vector) = NULL; + /* Move to the next buffer element */ + mmsg_vector++; + skbuff_vector++; + } + if (packet_count > 0) { + if (vp->estats.rx_queue_max < packet_count) + vp->estats.rx_queue_max = packet_count; + vp->estats.rx_queue_running_average = + (vp->estats.rx_queue_running_average + packet_count) >> 1; + } + return packet_count; +} + +static void vector_rx(struct vector_private *vp) +{ + int err; + + if ((vp->options & VECTOR_RX) > 0) + while ((err = vector_mmsg_rx(vp)) > 0) + ; + else + while ((err = vector_legacy_rx(vp)) > 0) + ; + if (err != 0) + printk(KERN_ERR "vector_rx: error(%d)\n", err); +} + +static int vector_net_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + int queue_depth = 0; + + if ((vp->options & VECTOR_TX) == 0) { + writev_tx(vp, skb); + return NETDEV_TX_OK; + } + + /* We do BQL only in the vector path, no point doing it in + * packet at a time mode as there is no device queue + */ + + netdev_sent_queue(vp->dev, skb->len); + queue_depth = vector_enqueue(vp->tx_queue, skb); + + /* if the device queue is full, stop the upper layers and + * flush it. + */ + + if (queue_depth >= vp->tx_queue->max_depth - 1) { + vp->estats.tx_kicks++; + netif_stop_queue(dev); + vector_send(vp->tx_queue); + return NETDEV_TX_OK; + } + if (skb->xmit_more) { + mod_timer(&vp->tl, vp->coalesce); + return NETDEV_TX_OK; + } + if (skb->len < TX_SMALL_PACKET) { + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); + } else + tasklet_schedule(&vp->tx_poll); + return NETDEV_TX_OK; +} + +static irqreturn_t vector_rx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + vector_rx(vp); + return IRQ_HANDLED; + +} + +static irqreturn_t vector_tx_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct vector_private *vp = netdev_priv(dev); + + if (!netif_running(dev)) + return IRQ_NONE; + /* We need to pay attention to it only if we got + * -EAGAIN or -ENOBUFFS from sendmms. Otherwise + * we ignore it. In the future, it may be worth + * it to improve the IRQ controller a bit to make + * tweaking the IRQ mask less costly + */ + + if (vp->in_write_poll) { + tasklet_schedule(&vp->tx_poll); + } + return IRQ_HANDLED; + +} + +static int irq_rr; + +static int vector_net_close(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + + netif_stop_queue(dev); + del_timer(&vp->tl); + + if (vp->fds == NULL) + return 0; + + /* Disable and free all IRQS */ + if (vp->rx_irq > 0) { + um_free_irq(vp->rx_irq, dev); + vp->rx_irq = 0; + } + if (vp->tx_irq > 0) { + um_free_irq(vp->tx_irq, dev); + vp->tx_irq = 0; + } + tasklet_kill(&vp->tx_poll); + if (vp->fds->rx_fd > 0) { + os_close_file(vp->fds->rx_fd); + vp->fds->rx_fd = -1; + } + if (vp->fds->tx_fd > 0) { + os_close_file(vp->fds->tx_fd); + vp->fds->tx_fd = -1; + } + if (vp->bpf != NULL) + kfree(vp->bpf); + if (vp->fds->remote_addr != NULL) + kfree(vp->fds->remote_addr); + if (vp->transport_data != NULL) + kfree(vp->transport_data); + if (vp->header_rxbuffer != NULL) + kfree(vp->header_rxbuffer); + if (vp->header_txbuffer != NULL) + kfree(vp->header_txbuffer); + if (vp->rx_queue != NULL) + destroy_queue(vp->rx_queue); + if (vp->tx_queue != NULL) + destroy_queue(vp->tx_queue); + kfree(vp->fds); + vp->fds = NULL; + spin_lock_irqsave(&vp->lock, flags); + vp->opened = false; + spin_unlock_irqrestore(&vp->lock, flags); + return 0; +} + +/* TX tasklet */ + +static void vector_tx_poll(unsigned long data) +{ + struct vector_private *vp = (struct vector_private *)data; + + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} +static void vector_reset_tx(struct work_struct *work) +{ + struct vector_private *vp = + container_of(work, struct vector_private, reset_tx); + netdev_reset_queue(vp->dev); + netif_start_queue(vp->dev); + netif_wake_queue(vp->dev); +} +static int vector_net_open(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + unsigned long flags; + int err = -EINVAL; + struct vector_device *vdevice; + + spin_lock_irqsave(&vp->lock, flags); + if (vp->opened) + return -ENXIO; + vp->opened = true; + spin_unlock_irqrestore(&vp->lock, flags); + + vp->fds = uml_vector_user_open(vp->unit, vp->parsed); + + if (vp->fds == NULL) + goto out_close; + + if (build_transport_data(vp) < 0) + goto out_close; + + if ((vp->options & VECTOR_RX) > 0) { + vp->rx_queue = create_queue( + vp, get_depth(vp->parsed), vp->rx_header_size, 0); + vp->rx_queue->queue_depth = get_depth(vp->parsed); + } else { + vp->header_rxbuffer = kmalloc(vp->rx_header_size, GFP_KERNEL); + if (vp->header_rxbuffer == NULL) + goto out_close; + } + if ((vp->options & VECTOR_TX) > 0) { + vp->tx_queue = create_queue( + vp, get_depth(vp->parsed), vp->header_size, MAX_IOV_SIZE); + } else { + vp->header_txbuffer = kmalloc(vp->header_size, GFP_KERNEL); + if (vp->header_txbuffer == NULL) + goto out_close; + } + + /* READ IRQ */ + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->rx_fd, + IRQ_READ, vector_rx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR "vector_open: failed to get rx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->rx_irq = irq_rr + VECTOR_BASE_IRQ; + dev->irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + + /* WRITE IRQ - we need it only if we have vector TX */ + if ((vp->options & VECTOR_TX) > 0) { + err = um_request_irq( + irq_rr + VECTOR_BASE_IRQ, vp->fds->tx_fd, + IRQ_WRITE, vector_tx_interrupt, + IRQF_SHARED, dev->name, dev); + if (err != 0) { + printk(KERN_ERR + "vector_open: failed to get tx irq(%d)\n", err); + err = -ENETUNREACH; + goto out_close; + } + vp->tx_irq = irq_rr + VECTOR_BASE_IRQ; + irq_rr = (irq_rr + 1) % VECTOR_IRQ_SPACE; + } + + if ((vp->options & VECTOR_BPF) != 0) { + vp->bpf = uml_vector_default_bpf(vp->fds->rx_fd, dev->dev_addr); + } + + + /* Write Timeout Timer */ + + vp->tl.data = (unsigned long) vp; + netif_start_queue(dev); + + /* clear buffer - it can happen that the host side of the interface + * is full when we get here. In this case, new data is never queued, + * SIGIOs never arrive, and the net never works. + */ + + vector_rx(vp); + + vector_reset_stats(vp); + vdevice = find_device(vp->unit); + vdevice->opened = 1; + + if ((vp->options & VECTOR_TX) != 0) { + add_timer(&vp->tl); + } + return 0; +out_close: + vector_net_close(dev); + return err; +} + + +static void vector_net_set_multicast_list(struct net_device *dev) +{ + /* TODO: - we can do some BPF games here */ + return; +} + +static void vector_net_tx_timeout(struct net_device *dev) +{ + struct vector_private *vp = netdev_priv(dev); + vp->estats.tx_timeout_count++; + netif_trans_update(dev); + schedule_work(&vp->reset_tx); +} + +#ifdef CONFIG_NET_POLL_CONTROLLER +static void vector_net_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + uml_net_interrupt(dev->irq, dev); + enable_irq(dev->irq); +} +#endif + +static void vector_net_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *info) +{ + strlcpy(info->driver, DRIVER_NAME, sizeof(info->driver)); + strlcpy(info->version, DRIVER_VERSION, sizeof(info->version)); +} + +static void vector_get_ringparam(struct net_device *netdev, + struct ethtool_ringparam *ring) +{ + struct vector_private *vp = netdev_priv(netdev); + + ring->rx_max_pending = vp->rx_queue->max_depth; + ring->tx_max_pending = vp->tx_queue->max_depth; + ring->rx_pending = vp->rx_queue->max_depth; + ring->tx_pending = vp->tx_queue->max_depth; +} + +static void vector_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch (stringset) { + case ETH_SS_TEST: + *buf = '\0'; + break; + case ETH_SS_STATS: + memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + default: + WARN_ON(1); + break; + } +} + +static int vector_get_sset_count(struct net_device *dev, int sset) +{ + switch (sset) { + case ETH_SS_TEST: + return 0; + case ETH_SS_STATS: + return VECTOR_NUM_STATS; + default: + return -EOPNOTSUPP; + } +} + +static void vector_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *estats, u64 *tmp_stats) +{ + struct vector_private *vp = netdev_priv(dev); + + memcpy(tmp_stats, &vp->estats, sizeof(struct vector_estats)); +} + +static int vector_get_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + ec->tx_coalesce_usecs = (vp->coalesce * 1000000) / HZ; + return 0; +} + +static int vector_set_coalesce(struct net_device *netdev, + struct ethtool_coalesce *ec) +{ + struct vector_private *vp = netdev_priv(netdev); + + vp->coalesce = (ec->tx_coalesce_usecs * HZ) / 1000000; + if (vp->coalesce == 0) + vp->coalesce = 1; + return 0; +} + +static const struct ethtool_ops vector_net_ethtool_ops = { + .get_drvinfo = vector_net_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_ts_info = ethtool_op_get_ts_info, + .get_ringparam = vector_get_ringparam, + .get_strings = vector_get_strings, + .get_sset_count = vector_get_sset_count, + .get_ethtool_stats = vector_get_ethtool_stats, + .get_coalesce = vector_get_coalesce, + .set_coalesce = vector_set_coalesce, +}; + + +static const struct net_device_ops vector_netdev_ops = { + .ndo_open = vector_net_open, + .ndo_stop = vector_net_close, + .ndo_start_xmit = vector_net_start_xmit, + .ndo_set_rx_mode = vector_net_set_multicast_list, + .ndo_tx_timeout = vector_net_tx_timeout, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = vector_net_poll_controller, +#endif +}; + + +static void vector_timer_expire(unsigned long _conn) +{ + struct vector_private *vp = (struct vector_private *)_conn; + vp->estats.tx_kicks++; + vector_send(vp->tx_queue); +} + +static void vector_eth_configure( + int n, + struct arglist *def + ) +{ + struct vector_device *device; + struct net_device *dev; + struct vector_private *vp; + int err; + + printk(KERN_INFO "Vector Netdevice %d\n", n); + + device = kzalloc(sizeof(*device), GFP_KERNEL); + if (device == NULL) { + printk(KERN_ERR "eth_configure failed to allocate struct " + "vector_device\n"); + return; + } + printk(KERN_INFO "vector etherdevice start %d\n", n); + dev = alloc_etherdev(sizeof(struct vector_private)); + if (dev == NULL) { + printk(KERN_ERR "eth_configure: failed to allocate struct " + "net_device for vec%d\n", n); + goto out_free_device; + } + + dev->mtu = get_mtu(def); + + INIT_LIST_HEAD(&device->list); + device->unit = n; + + /* If this name ends up conflicting with an existing registered + * netdevice, that is OK, register_netdev{,ice}() will notice this + * and fail. + */ + snprintf(dev->name, sizeof(dev->name), "vec%d", n); + uml_net_setup_etheraddr(dev, uml_vector_fetch_arg(def, "mac")); + vp = netdev_priv(dev); + + /* sysfs register */ + if (!driver_registered) { + platform_driver_register(¨_net_driver); + driver_registered = 1; + } + device->pdev.id = n; + device->pdev.name = DRIVER_NAME; + device->pdev.dev.release = vector_device_release; + dev_set_drvdata(&device->pdev.dev, device); + if (platform_device_register(&device->pdev)) + goto out_free_netdev; + SET_NETDEV_DEV(dev, &device->pdev.dev); + + device->dev = dev; + + *vp = ((struct vector_private) + { + .list = LIST_HEAD_INIT(vp->list), + .dev = dev, + .unit = n, + .options = get_transport_options(def), + .rx_irq = 0, + .tx_irq = 0, + .parsed = def, + .max_packet = get_mtu(def) + ETH_HEADER_OTHER, + /* TODO - we need to calculate headroom so that ip header + * is 16 byte aligned all the time + */ + .headroom = get_headroom(def), + .form_header = NULL, + .verify_header = NULL, + .header_rxbuffer = NULL, + .header_txbuffer = NULL, + .header_size = 0, + .rx_header_size = 0, + .rexmit_scheduled = false, + .opened = false, + .transport_data = NULL, + .in_write_poll = false, + .coalesce = 2 + }); + + /* if we can do vector TX, we can do scatter/gather too */ + if ((vp->options & VECTOR_TX) > 0) + dev->features = NETIF_F_SG; + tasklet_init(&vp->tx_poll, vector_tx_poll, (unsigned long)vp); + INIT_WORK(&vp->reset_tx, vector_reset_tx); + + init_timer(&vp->tl); + spin_lock_init(&vp->lock); + vp->tl.function = vector_timer_expire; + + /* FIXME */ + dev->netdev_ops = &vector_netdev_ops; + dev->ethtool_ops = &vector_net_ethtool_ops; + dev->watchdog_timeo = (HZ >> 1); + /* primary IRQ - fixme */ + dev->irq = 0; /* we will adjust this once opened */ + + rtnl_lock(); + err = register_netdevice(dev); + rtnl_unlock(); + if (err) + goto out_undo_user_init; + + spin_lock(&vector_devices_lock); + list_add(&device->list, &vector_devices); + spin_unlock(&vector_devices_lock); + + return; + +out_undo_user_init: + return; +out_free_netdev: + free_netdev(dev); +out_free_device: + kfree(device); +} + + + + +/* + * Invoked late in the init + */ + +static int __init vector_init(void) +{ + struct list_head *ele; + struct vector_cmd_line_arg *def; + struct arglist *parsed; + + list_for_each(ele, &vec_cmd_line) { + def = list_entry(ele, struct vector_cmd_line_arg, list); + parsed = uml_parse_vector_ifspec(def->arguments); + if (parsed != NULL) + vector_eth_configure(def->unit, parsed); + } + return 0; +} + + +/* Invoked at initial argument parsing, only stores + * arguments until a proper vector_init is called + * later + */ + +static int __init vector_setup(char *str) +{ + char *error; + int n, err; + struct vector_cmd_line_arg *new; + + err = vector_parse(str, &n, &str, &error); + if (err) { + printk(KERN_ERR "vector_setup - Couldn't parse '%s' : %s\n", + str, error); + return 1; + } + new = alloc_bootmem(sizeof(*new)); + INIT_LIST_HEAD(&new->list); + new->unit = n; + new->arguments = str; + list_add_tail(&new->list, &vec_cmd_line); + return 1; +} + +__setup("vec", vector_setup); +__uml_help(vector_setup, +"vec[0-9]+:<option>=<value>,<option>=<value>\n" +" Configure a vector io network device.\n\n" +); + +late_initcall(vector_init); + +static struct mc_device vector_mc = { + .list = LIST_HEAD_INIT(vector_mc.list), + .name = "vec", + .config = vector_config, + .get_config = NULL, + .id = vector_id, + .remove = vector_remove, +}; + +#ifdef CONFIG_INET +static int vector_inetaddr_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + return NOTIFY_DONE; +} + +static struct notifier_block vector_inetaddr_notifier = { + .notifier_call = vector_inetaddr_event, +}; + +static void inet_register(void) +{ + register_inetaddr_notifier(&vector_inetaddr_notifier); +} +#else +static inline void inet_register(void) +{ +} +#endif + +static int vector_net_init(void) +{ + mconsole_register_dev(&vector_mc); + inet_register(); + return 0; +} + +__initcall(vector_net_init); + + + diff --git a/arch/um/drivers/vector_kern.h b/arch/um/drivers/vector_kern.h new file mode 100644 index 000000000000..fd724ef6806e --- /dev/null +++ b/arch/um/drivers/vector_kern.h @@ -0,0 +1,128 @@ +/* + * Copyright (C) 2002 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#ifndef __UM_VECTOR_KERN_H +#define __UM_VECTOR_KERN_H + +#include <linux/netdevice.h> +#include <linux/platform_device.h> +#include <linux/skbuff.h> +#include <linux/socket.h> +#include <linux/list.h> +#include <linux/ctype.h> +#include <linux/workqueue.h> +#include <linux/interrupt.h> +#include "vector_user.h" + +/* Queue structure specially adapted for multiple enqueue/dequeue + * in a mmsgrecv/mmsgsend context + */ + +/* Dequeue method */ + +#define QUEUE_SENDMSG 0 +#define QUEUE_SENDMMSG 1 + +#define VECTOR_RX 1 +#define VECTOR_TX (1 << 1) +#define VECTOR_BPF (1 << 2) + +#define ETH_MAX_PACKET 1500 +#define ETH_HEADER_OTHER 32 /* just in case someone decides to go mad on QnQ */ + +struct vector_queue { + struct mmsghdr *mmsg_vector; + void **skbuff_vector; + /* backlink to device which owns us */ + struct net_device *dev; + spinlock_t head_lock; + spinlock_t tail_lock; + int queue_depth, head, tail, max_depth, max_iov_frags; + short options; +}; + +struct vector_estats { + uint64_t rx_queue_max; + uint64_t rx_queue_running_average; + uint64_t tx_queue_max; + uint64_t tx_queue_running_average; + uint64_t rx_encaps_errors; + uint64_t tx_timeout_count; + uint64_t tx_restart_queue; + uint64_t tx_kicks; + uint64_t tx_flow_control_xon; + uint64_t tx_flow_control_xoff; + uint64_t rx_csum_offload_good; + uint64_t rx_csum_offload_errors; + uint64_t sg_ok; + uint64_t sg_linearized; +}; + +#define VERIFY_HEADER_NOK -1 +#define VERIFY_HEADER_OK 0 +#define VERIFY_CSUM_OK 1 + +struct vector_private { + struct list_head list; + spinlock_t lock; + struct net_device *dev; + + int unit; + + /* Timeout timer in TX */ + + struct timer_list tl; + + /* Scheduled "remove device" work */ + struct work_struct reset_tx; + struct vector_fds *fds; + + struct vector_queue *rx_queue; + struct vector_queue *tx_queue; + + int rx_irq; + int tx_irq; + + struct arglist *parsed; + + void *transport_data; /* transport specific params if needed */ + + int max_packet; + int headroom; + + int options; + + /* remote address if any - some transports will leave this as null */ + + int header_size; + int rx_header_size; + int coalesce; + + void *header_rxbuffer; + void *header_txbuffer; + + void (*form_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + int (*verify_header)(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp); + + spinlock_t stats_lock; + + struct tasklet_struct tx_poll; + bool rexmit_scheduled; + bool opened; + bool in_write_poll; + + /* ethtool stats */ + + struct vector_estats estats; + void *bpf; + + char user[0]; +}; + +extern int build_transport_data(struct vector_private *vp); + +#endif diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c new file mode 100644 index 000000000000..44724c783e6d --- /dev/null +++ b/arch/um/drivers/vector_transports.c @@ -0,0 +1,406 @@ +/* + * Copyright (C) 2017 - Cambridge Greys Limited + * Copyright (C) 2011 - 2014 Cisco Systems Inc + * Licensed under the GPL. + */ + +#include <linux/etherdevice.h> +#include <linux/netdevice.h> +#include <linux/skbuff.h> +#include <linux/slab.h> +#include <asm/byteorder.h> +#include <uapi/linux/ip.h> +#include <uapi/linux/virtio_net.h> +#include <linux/virtio_net.h> +#include <linux/virtio_byteorder.h> +#include <linux/netdev_features.h> +#include "vector_user.h" +#include "vector_kern.h" + + +struct gre_minimal_header { + uint16_t header; + uint16_t arptype; +}; + + +struct uml_gre_data { + uint32_t rx_key; + uint32_t tx_key; + uint32_t sequence; + + bool ipv6; + bool has_sequence; + bool pin_sequence; + bool checksum; + bool key; + struct gre_minimal_header expected_header; + + uint32_t checksum_offset; + uint32_t key_offset; + uint32_t sequence_offset; + +}; + +struct uml_l2tpv3_data { + uint64_t rx_cookie; + uint64_t tx_cookie; + uint64_t rx_session; + uint64_t tx_session; + uint32_t counter; + + bool udp; + bool ipv6; + bool has_counter; + bool pin_counter; + bool cookie; + bool cookie_is_64; + + uint32_t cookie_offset; + uint32_t session_offset; + uint32_t counter_offset; +}; + +static void l2tpv3_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *counter; + + if (td->udp) + *(uint32_t *) header = cpu_to_be32(L2TPV3_DATA_PACKET); + (*(uint32_t *) (header + td->session_offset)) = td->tx_session; + + if (td->cookie) { + if (td->cookie_is_64) + (*(uint64_t *)(header + td->cookie_offset)) = + td->tx_cookie; + else + (*(uint32_t *)(header + td->cookie_offset)) = + td->tx_cookie; + } + if (td->has_counter) { + counter = (uint32_t *)(header + td->counter_offset); + if (td->pin_counter) { + *counter = 0; + } else { + td->counter++; + *counter = cpu_to_be32(td->counter); + } + } +} + +static void gre_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_gre_data *td = vp->transport_data; + uint32_t *sequence; + *((uint32_t *) header) = *((uint32_t *) &td->expected_header); + if (td->key) + (*(uint32_t *) (header + td->key_offset)) = td->tx_key; + if (td->has_sequence) { + sequence = (uint32_t *)(header + td->sequence_offset); + if (td->pin_sequence) + *sequence = 0; + else + *sequence = cpu_to_be32(++td->sequence); + } +} + +static void raw_form_header(uint8_t *header, + struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + virtio_net_hdr_from_skb(skb, vheader, virtio_legacy_is_little_endian(), false); +} + +static int l2tpv3_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct uml_l2tpv3_data *td = vp->transport_data; + uint32_t *session; + uint64_t cookie; + + if ((!td->udp) && (!td->ipv6)) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + /* we do not do a strict check for "data" packets as per + * the RFC spec because the pure IP spec does not have + * that anyway. + */ + + if (td->cookie) { + if (td->cookie_is_64) + cookie = *(uint64_t *)(header + td->cookie_offset); + else + cookie = *(uint32_t *)(header + td->cookie_offset); + if (cookie != td->rx_cookie) { + printk(KERN_ERR "uml_l2tpv3: unknown cookie id"); + return -1; + } + } + session = (uint32_t *) (header + td->session_offset); + if (*session != td->rx_session) { + printk(KERN_ERR "uml_l2tpv3: session mismatch"); + return -1; + } + return 0; +} + +static int gre_verify_header( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + + uint32_t key; + struct uml_gre_data *td = vp->transport_data; + + if (!td->ipv6) + header += sizeof(struct iphdr) /* fix for ipv4 raw */; + + if (*((uint32_t *) header) != *((uint32_t *) &td->expected_header)) { + printk(KERN_ERR "header type disagreement, expecting %0x, got %0x", + *((uint32_t *) &td->expected_header), + *((uint32_t *) header) + ); + return -1; + } + + if (td->key) { + key = (*(uint32_t *)(header + td->key_offset)); + if (key != td->rx_key) { + printk(KERN_ERR "unknown key id %0x, expecting %0x", + key, td->rx_key); + return -1; + } + } + return 0; +} + +static int raw_verify_header ( + uint8_t *header, struct sk_buff *skb, struct vector_private *vp) +{ + struct virtio_net_hdr *vheader = (struct virtio_net_hdr *) header; + + if (vheader->gso_type != VIRTIO_NET_HDR_GSO_NONE) { + printk(KERN_ERR "raw: GSO enabled on interface, please turn off"); + return -1; /* GSO, we cannot process this */ + } + if ((vheader->flags & VIRTIO_NET_HDR_F_DATA_VALID) > 0) + return 1; + else + virtio_net_hdr_to_skb(skb, vheader, virtio_legacy_is_little_endian()); + return 0; +} + +static bool get_uint_param( + struct arglist *def, char *param, unsigned int *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoint(arg, 0, result) == 0) + return true; + } + return false; +} + +static bool get_ulong_param( + struct arglist *def, char *param, unsigned long *result) +{ + char *arg = uml_vector_fetch_arg(def, param); + + if (arg != NULL) { + if (kstrtoul(arg, 0, result) == 0) + return true; + return true; + } + return false; +} + +static int build_gre_transport_data(struct vector_private *vp) +{ + struct uml_gre_data *td; + int temp_int; + int temp_rx; + int temp_tx; + + vp->transport_data = kmalloc(sizeof(struct uml_gre_data), GFP_KERNEL); + if (vp->transport_data == NULL) + return -ENOMEM; + td = vp->transport_data; + td->sequence = 0; + + td->expected_header.arptype = GRE_IRB; + td->expected_header.header = 0; + + vp->form_header = &gre_form_header; + vp->verify_header = &gre_verify_header; + vp->header_size = 4; + td->key_offset = 4; + td->sequence_offset = 4; + td->checksum_offset = 4; + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + td->key = false; + if (get_uint_param(vp->parsed, "rx_key", &temp_rx)) { + if (get_uint_param(vp->parsed, "tx_key", &temp_tx)) { + td->key = true; + td->expected_header.header |= GRE_MODE_KEY; + td->rx_key = cpu_to_be32(temp_rx); + td->tx_key = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->sequence_offset += 4; + } else { + return -EINVAL; + } + } + + td->sequence = false; + if (get_uint_param(vp->parsed, "sequence", &temp_int)) { + if (temp_int > 0) { + vp->header_size += 4; + td->has_sequence = true; + td->expected_header.header |= GRE_MODE_SEQUENCE; + if (get_uint_param( + vp->parsed, "pin_sequence", &temp_int)) { + if (temp_int > 0) + td->pin_sequence = true; + } + } + } + vp->rx_header_size = vp->header_size; + if (!td->ipv6) + vp->rx_header_size += sizeof(struct iphdr); + return 0; +} + +static int build_l2tpv3_transport_data(struct vector_private *vp) +{ + + struct uml_l2tpv3_data *td; + int temp_int, temp_rxs, temp_txs; + unsigned long temp_rx; + unsigned long temp_tx; + + vp->transport_data = kmalloc( + sizeof(struct uml_l2tpv3_data), GFP_KERNEL); + + if (vp->transport_data == NULL) + return -ENOMEM; + + td = vp->transport_data; + + vp->form_header = &l2tpv3_form_header; + vp->verify_header = &l2tpv3_verify_header; + td->counter = 0; + + vp->header_size = 4; + td->session_offset = 0; + td->cookie_offset = 4; + td->counter_offset = 4; + + + td->ipv6 = false; + if (get_uint_param(vp->parsed, "v6", &temp_int)) { + if (temp_int > 0) + td->ipv6 = true; + } + + if (get_uint_param(vp->parsed, "rx_session", &temp_rxs)) { + if (get_uint_param(vp->parsed, "tx_session", &temp_txs)) { + td->tx_session = cpu_to_be32(temp_txs); + td->rx_session = cpu_to_be32(temp_rxs); + } else { + return -EINVAL; + } + } else { + return -EINVAL; + } + + td->cookie_is_64 = false; + if (get_uint_param(vp->parsed, "cookie64", &temp_int)) { + if (temp_int > 0) + td->cookie_is_64 = true; + } + td->cookie = false; + if (get_ulong_param(vp->parsed, "rx_cookie", &temp_rx)) { + if (get_ulong_param(vp->parsed, "tx_cookie", &temp_tx)) { + td->cookie = true; + if (td->cookie_is_64) { + td->rx_cookie = cpu_to_be64(temp_rx); + td->tx_cookie = cpu_to_be64(temp_tx); + vp->header_size += 8; + td->counter_offset += 8; + } else { + td->rx_cookie = cpu_to_be32(temp_rx); + td->tx_cookie = cpu_to_be32(temp_tx); + vp->header_size += 4; + td->counter_offset += 4; + } + } else { + return -EINVAL; + } + } + + td->has_counter = false; + if (get_uint_param(vp->parsed, "counter", &temp_int)) { + if (temp_int > 0) { + td->has_counter = true; + vp->header_size += 4; + if (get_uint_param( + vp->parsed, "pin_counter", &temp_int)) { + if (temp_int > 0) + td->pin_counter = true; + } + } + } + + if (get_uint_param(vp->parsed, "udp", &temp_int)) { + if (temp_int > 0) { + td->udp = true; + vp->header_size += 4; + td->counter_offset += 4; + td->session_offset += 4; + td->cookie_offset += 4; + } + } + + vp->rx_header_size = vp->header_size; + if ((!td->ipv6) && (!td->udp)) + vp->rx_header_size += sizeof(struct iphdr); + + return 0; +} + +static int build_raw_transport_data(struct vector_private *vp) +{ + if (uml_vector_enable_vnet_headers(vp->fds->rx_fd)) { + vp->form_header = &raw_form_header; + vp->verify_header = &raw_verify_header; + vp->header_size = sizeof(struct virtio_net_hdr); + vp->rx_header_size = sizeof(struct virtio_net_hdr); + vp->dev->features |= NETIF_F_HW_CSUM; + printk(KERN_INFO "raw: using vnet headers to offload checksum, header size=%d", vp->header_size); + } + return 0; +} + + +int build_transport_data(struct vector_private *vp) +{ + char *transport = uml_vector_fetch_arg(vp->parsed, "transport"); + + if (strncmp(transport, TRANS_GRE, TRANS_GRE_LEN) == 0) + return build_gre_transport_data(vp); + if (strncmp(transport, TRANS_L2TPV3, TRANS_L2TPV3_LEN) == 0) + return build_l2tpv3_transport_data(vp); + if (strncmp(transport, TRANS_RAW, TRANS_RAW_LEN) == 0) + return build_raw_transport_data(vp); + return 0; +} + diff --git a/arch/um/drivers/vector_user.c b/arch/um/drivers/vector_user.c new file mode 100644 index 000000000000..6d51289cb9cb --- /dev/null +++ b/arch/um/drivers/vector_user.c @@ -0,0 +1,512 @@ +/* + * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Licensed under the GPL + */ + +#include <stdio.h> +#include <unistd.h> +#include <stdarg.h> +#include <errno.h> +#include <stddef.h> +#include <string.h> +#include <sys/ioctl.h> +#include <net/if.h> +#include <linux/if_tun.h> +#include <arpa/inet.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <net/ethernet.h> +#include <netinet/ip.h> +#include <netinet/ether.h> +#include <linux/if_ether.h> +#include <linux/if_packet.h> +#include <sys/socket.h> +#include <sys/wait.h> +#include <netdb.h> +#include <stdlib.h> +#include <os.h> +#include <um_malloc.h> +#include "vector_user.h" + +#define ID_GRE 0 +#define ID_L2TPV3 1 +#define ID_MAX 1 + +#define TOKEN_IFNAME "ifname" + +#define TRANS_RAW "raw" +#define TRANS_RAW_LEN strlen(TRANS_RAW) + +/* This is very ugly and brute force lookup, but it is done + * only once at initialization so not worth doing hashes or + * anything more intelligent + */ + +char *uml_vector_fetch_arg(struct arglist *ifspec, char *token) +{ + int i; + + for (i = 0; i < ifspec->numargs; i++) { + if (strcmp(ifspec->tokens[i], token) == 0) + return ifspec->values[i]; + } + return NULL; + +} + +struct arglist *uml_parse_vector_ifspec(char *arg) +{ + struct arglist *result; + int pos, len; + bool parsing_token = true, next_starts = true; + + if (arg == NULL) + return NULL; + result = uml_kmalloc(sizeof(struct arglist), UM_GFP_KERNEL); + if (result == NULL) + return NULL; + result->numargs = 0; + len = strlen(arg); + for (pos = 0; pos < len; pos++) { + if (next_starts) { + if (parsing_token) { + result->tokens[result->numargs] = arg + pos; + } else { + result->values[result->numargs] = arg + pos; + result->numargs++; + } + next_starts = false; + } + if (*(arg + pos) == '=') { + if (parsing_token) + parsing_token = false; + else + goto cleanup; + next_starts = true; + (*(arg + pos)) = '\0'; + } + if (*(arg + pos) == ',') { + parsing_token = true; + next_starts = true; + (*(arg + pos)) = '\0'; + } + } + return result; +cleanup: + printk(UM_KERN_ERR "vector_setup - Couldn't parse '%s'\n", arg); + kfree(result); + return NULL; +} + +/* + * Socket/FD configuration functions. These return an structure + * of rx and tx descriptors to cover cases where these are not + * the same (f.e. read via raw socket and write via tap). + */ + +#define PATH_NET_TUN "/dev/net/tun" + +static struct vector_fds *user_init_tap_fds(struct arglist *ifspec) +{ + struct ifreq ifr; + int fd = -1; + struct sockaddr_ll sock; + int err = -ENOMEM; + char *iface; + struct vector_fds *result = NULL; + + iface = uml_vector_fetch_arg(ifspec, TOKEN_IFNAME); + if (iface == NULL) { + printk(UM_KERN_ERR "uml_tap: failed to parse interface spec\n"); + goto tap_cleanup; + } + + result = uml_kmalloc(sizeof(struct vector_fds), UM_GFP_KERNEL); + if (result == NULL) { + printk(UM_KERN_ERR "uml_tap: failed to allocate file descriptors\n"); + goto tap_cleanup; + } + result->rx_fd = -1; + result->tx_fd = -1; + result->remote_addr = NULL; + result->remote_addr_size = 0; + + /* TAP */ + + fd = open(PATH_NET_TUN, O_RDWR); + if (fd < 0) { + printk(UM_KERN_ERR "uml_tap: failed to open tun device\n"); + goto tap_cleanup; + } + result->tx_fd = fd; + memset(&ifr, 0, sizeof(ifr)); + ifr.ifr_flags = IFF_TAP | IFF_NO_PI; + strncpy((char *)&ifr.ifr_name, iface, sizeof(ifr.ifr_name) - 1); + + err =... [truncated message content] |
From: <ant...@ko...> - 2017-09-10 07:20:15
|
Hi Richard, hi list, Attached is an updated and rebased versus 4.13 patchset for the Epoll based IRQ controller and Vector Network driver. I have been running a backport to OpenWRT/kernel 4.4 since around May and I have not found any bugs so far. It appears stable and performs up to 3-4 times better than the existing UML network and IO. New in this patchset: 1. VNET header support and checksum offloads in the raw socket transport. 2. Scatter/Gather has been folded into the main vector patchset. |
From: Richard W. <ri...@no...> - 2017-08-29 20:59:52
|
Linus, The following changes since commit 14ccee78fc82f5512908f4424f541549a5705b89: Linux 4.13-rc6 (2017-08-20 14:13:52 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml.git for-linus-4.13-rc7 for you to fetch changes up to 2fb44600fe784449404c6639de26af8361999ec7: um: Fix check for _xstate for older hosts (2017-08-24 21:52:28 +0200) ---------------------------------------------------------------- This pull request contains a single fix for a regression which was introduced while the merge window. ---------------------------------------------------------------- Florian Fainelli (1): um: Fix check for _xstate for older hosts arch/x86/um/user-offsets.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) |