On Windows VG consists of 1) the launcher, an EXE file, 2) tools, which are also EXE files, 3) a core preload DLL that is always loaded into the tool process, and for non-trivial tools a tool-specific preload DLL. Quite like on Linux in fact. But the details are very different.
The VG launcher reads the headers of the client program and determines its image base and size. It then starts the tool process in a suspended state. It makes a reservation in the tool process's address space for the client image at its specified base addess. (If the tool's image overlaps with the client's, too bad. But that is unlikely. Like on Linux, the tools are linked to have a non-standard image base address.)
After that the tool is allowed to resume, and all interesting happens in the tool process from that on.
The tool sets up its address space manager, etc. The interesting bits begin when then client program is loaded. The client exectuable is loaded with LoadLibrary(). Not that unlinke in the normal case of loading DLLs with LoadLibrary(), when loading an .exe file any dependent DLLs are not loaded. Read on.
The pointer in the TEB to the main executable base is updated to point to the client executable. Also, the command lines (ANSI and Unicode) are updated to strip out the tool name and options for the tool.
The core preload DLL is loaded, and the simulated CPU is started to execute the load_and_run_client() function in it whose job is to load the dependents of the client executable and then transfer control to it.
Loading an EXE with LoadLibrary() doesn't take care of loading its imported DLLs. (Unlike the normal case, i.e. loading a DLL with LoadLibrary().) But, import handlig code is relatively trivial, and that is also done in load_and_run_client(). Thus this is executed on the simulated CPU, and the plain LoadLibrary() calls in it eventually turn up as system calls that VG notices. VG code does not have to do any PE rebasing or calling of DLL entry points in the right order, etc, itself at all.
Win32k system calls are those numbered 0x1000..up, which are handled by the Win32 subsystem win32k.sys and not the kernel proper,
ntoskrnl.exe. (The actual kernel image file name differs depending on hardware capabilities of the machine detected when installing Windows.)
The kernel system calls are easy to enumerate as the all Nt* entries exported from ntdll.dll are simple system call wrappers.
For Win32 calls something more devious is needed, and even then we can never be sure to enumerate them all. Fortunately it should be possible by running some typical client programs to find out what Win32k system calls we need be interested in. Knock on wood...
For instance, GetDC in user32.dll is a stub that just does (WIN32 example):
0x7e4186c7 \<USER32!GetDC>: mov $0x1191,%eax
0x7e4186cc \<USER32!GetDC+5>: mov $0x7ffe0300,%edx
0x7e4186d1 \<USER32!GetDC+10>: call *(%edx)
0x7e4186d3 \<USER32!GetDC+12>: ret $0x4
Note that 0x7ffe0300 is the same memory location used by the ntdll.dll system call wrappers, that points to the KiFastSystemCall:
0x7c90e4f0 \<ntdll!KiFastSystemCall>: mov %esp,%edx
0x7c90e4f2 \<ntdll!KiFastSystemCall+2>: sysenter
So it is relatively easy to get the numbers for those win32k system calls that are simply wrapped in this manner: Just look at the instructions for the exported wrapped function, GetDC in this case. But there are many more Win32k system calls.
To properly enumerate these we need to use a different approach. Microsoft makes publicly available symbol files for
win32k.sys, for use in debugging. Using the dbghelp API it is possible to simply look up the initalized system call table in win32k.sys in the PE file, and look up the names of the pointers to the internal, but systematically named, system call handlers in it. (It's not possible to peek into the running kernel of course.)
If a symbol file for win32k.sys is not available, VG can't properly recognize Win32 system calls, and does a best effort only.
Callbacks from kernel to user code is used for lots of things like exception handling and "events" from the windowing system. Essential to handle this. Note that here is an important difference from Unix: On Windows a system call can cause a callback to user code, and that user code can perform a system call, which causes another callback, etc. When these callbacks return to kernel code, the system call continues. Etc. I.e. no concept of "interrupted" system calls that don't finish. Clearly some kind of stack of system call contexts will be needed.
To avoid linking the tools to various complex DLLs that drag in even more DLLs, the launcher process is used as a helper. The tool and launcher communicate using a simple protocol that uses a couple of events and a shared memory area.
The launcher creates two events and a shared memory area, as inheritable. The handles to the events are stored at the beginning of the shared memory. When the tool is running, the launcher is waiting for the first event to fire. When it is fired by the tool, the shared memory contains a request for the launcher. When the launcher has fulfilled the request, the result is stored in the shared memory and the second event is fired. Simple. The start address of the shared memory area is passed to the tool in the environment variable VALGRIND_IPC_SHMEM.