Regarding STM32, I see that we aren't using the BSRR register to full potential. We can set and clear bits in the same command, which I find is nice when using 4/8 bits for LCDs and the like.

So, the lower 16 bits of BSRR will set the respective 16 outputs:

void gpio_set(uint32_t gpioport, uint16_t gpios)
    GPIO_BSRR(gpioport) = gpios;

And the upper 16 bits will clear the same outputs:

void gpio_clear(uint32_t gpioport, uint16_t gpios)
    GPIO_BSRR(gpioport) = (gpios << 16);

(the above are the current libopencm3 functions).

If we are using say, the lower 8 bits of a port in byte wide fashion, we would do something like this:
#define BYTE_MASK  (GPIO7 | GPIO6 [...] GPIO1 | GPIO0)

gpio_clear(port, BYTE_MASK);
gpio_set(port, new_data_byte);

Which is ok. But with this new function that I use sometime:

void gpio_atomic(uint32_t gpioport, uint16_t gpioset, uint16_t gpioclr)
    GPIO_BSRR(gpioport) = gpioset | (gpioclr << 16);

We can do it in a single instruction:
gpio_atomic(port, new_data_byte, BYTE_MASK);

This of course works with 2 or 3 bits too, so it could be faster for bitbanging interfaces if needed, I'd think. Plus it doesn't potentially smash other bits on the port like using the ODR directly... and should be faster than read-modify-write on the ODR, I would imagine... of course I haven't actually bothered to test...