What is pte-highmem

pte-highmem is a feature that allows pagetables to be allocated in highmemory (above 800M physical). This is a critical requirement for highmem servers mapping plenty of shared memory like databases (that would otherwise run out of lowmem with lots of tasks mapping the SHM).

The pte-highmem feature is included into the SuSE kernels starting from 2.4.18-SuSE in SuSE Linux 8.0.

Why some device driver needs changes to operate correctly with pte-highmem

In theory no device driver should be required to touch (read/write) pagetables by hand. Proper functions like remap_page_range()/ioremap()/vmalloc()/vfree(), are meant to deal with pagetables in a transparent manner with respect to device drivers.

In practice a few device drivers are currently walking kernel pagetables by hand to find the physical pages that are backing a vmalloc virtual memory area. Those drivers need to find the physical pages in order to setup DMA SG entries to allow devices to transfer data on those vmalloc areas.

The reason those drivers were walking pagetables by hand is that the functionality to resolve a virtual vmalloc page to a physical page wasn't provided with a proper common code functionality and so all the details of the pagetable handling were exposed to the lowlevel drivers. This is been fixed since the stable >=2.4.19 kernels and in the 2.5.x branch.

Long term solution: vmalloc_to_page()

The new functionality available since 2.4.19 is called vmalloc_to_page(): vmalloc_to_page() takes care internally of all the pagetable handling details and in turn it hides those details completely to the lowlevel device drivers.

Example of conversion

This is an example of old code before vmalloc_to_page() is been made available (taken from drivers/media/video/meye.c): This is the same code but converted to use vmalloc_to_page(): NOTE: the above page_address(vmalloc_to_page((void *)adr)) is valid only because the vmalloc space is been allocated with vmalloc_32, otherwise kvirt_to_pa() itself would be broken (it can only return a physical address in the 0-4G range). With pci64 devices it would be better to allocate the vmalloc memory with a plain vmalloc() and to work with 'page structures' instead of physical addresses, that would save some lowmem ram and working with page structures is cleaner anyways.

Lack of vmalloc_to_page() in the <= 2.4.18 kernels

While the above is the right way to update those drivers for kernels >=2.4.19 and 2.5 to let them work fine with the new pte-highmem feature, the 2.4.18 kernel didn't yet provided the vmalloc_to_page() functionality to modules.

In turn the 2.4.18-SuSE kernel (based on 2.4.18) in SuSE Linux 8.0 doesn't yet export a vmalloc_to_page() either.

So for the 2.4.18-SuSE kernel the pagetable handling function in the drivers must be updated to handle the pte-highmem feature. Either that or the function vmalloc_to_page() should be cut and pasted from the 2.4.19/mm/memory.c file to the device driver .c file.

Example of device driver update for the 2.4.18-SuSE kernel

Old non pte-highmem capable code (again from meye.c): Updated pte-highmem capable code (not yet using vmalloc_to_page() because not yet available under 2.4.18): Unified diff: Some driver may also need a: at the top of the file, in order to compile correctly.

pte_offset_atomic() won't block (no spinning and no scheduler calls inside) and it can be recalled from all normal kernel context (not from irq/bh context though: touching pagetables from irq/bh would be a bug in the first place). It can be recalled with spinlocks hold because it doesn't block and it doesn't acquire any other lock internally.

pte_offset_atomic() opens a critical section that must be closed from pte_kunmap(). This is the big difference introduced by the pte-highmem design: it requires the closure of the critical section with pte_kunmap().

The driver is not allowed to sleep within the critical section (it is an atomic kmap). For this reason it is recommended to read the contents of the pagetables right after pte_offset_atomic(), and to pte_kunmap() right after the read. Here it is another example covering this case:

How to write sources that can be compiled cleanly with all kernels out there

With some #ifdef trickery it is possible to write code that compiles cleanly under all kernels out there:

Here an example:

This relies on the fact pte_offset_atomic() is a preprocessor macro, and that's not going to change.

NOTE: in the long term the #ifdef trickery should go away for code clarity, and only the vmalloc_to_page() way should be retained.

Drivers walking pagetables not to find the vmalloc() physical pages

They can be short-term adapted using pte_offset_atomic()/pte_kunmap() too like if it would be a <= 2.4.18 kernel dealing with the vmalloc areas (as outlined in the previous section).

However long term those drivers would better use other common code API provided by the kernel like map_user_kiobuf() or get_user_pages() (the latter it's not yet exported to modules but it should be ok to export it too if necessary). Then they wouldn't depend any longer on the lowlevel details of the memory management and in turn they wouldn't break that easily in the long term.

vmalloc_to_page() and EXPORT_SYMBOL_GPL()

Originally vmalloc_to_page() was exported to modules using EXPORT_SYMBOL_GPL (this mean only GPL modules could use it).

That happened for no good reason and it will be fixed. I doesn't make sense to require non-GPL modules to walk pagetables by hand. A discussion covering this topic extensively can be found on the l-k mailing list in April 2002.

Conclusions

No driver should touch pagetables by hand. Drivers for new >=2.4.19 and 2.5.x kernels will use vmalloc_to_page()/map_user_kiobuf() if necessary. vmalloc_to_page() internally handles the highmem pagetables just fine and so the pagetable handling become completely transparent to the device drivers.

Kernels <= 2.4.18 shipping with the pte-highmem feature (like the 2.4.18-SuSE kernel in SuSE Linux 8.0) don't yet provide the universal vmalloc_to_page() functionality, and so any driver out of the kernel tree that needs to walk pagetables by hand will need a few liner patch (pte_offset_atomic/pte_kunmap) to handle the highmem pagetables correctly.

8 April 2002 - Andrea Arcangeli - SuSE