From: Jan-Benedict G. <jb...@he...> - 2006-04-09 20:15:45
|
Documentation/x86_64/boot-options.txt | 5 arch/i386/kernel/acpi/boot.c | 5 arch/i386/kernel/apic.c | 23 ++- arch/i386/kernel/cpu/cpufreq/powernow-k8.c | 18 ++- arch/i386/kernel/mpparse.c | 21 --- arch/i386/kernel/reboot_fixups.c | 2 arch/i386/kernel/setup.c | 36 +++++- arch/i386/mm/init.c | 2 arch/i386/pci/direct.c | 9 + arch/i386/pci/mmconfig.c | 67 +++++++---- arch/x86_64/Kconfig | 5 arch/x86_64/Makefile | 24 ++-- arch/x86_64/defconfig | 42 ++++--- arch/x86_64/ia32/ia32entry.S | 23 ++- arch/x86_64/kernel/aperture.c | 2 arch/x86_64/kernel/e820.c | 36 +++++- arch/x86_64/kernel/entry.S | 28 +--- arch/x86_64/kernel/mce.c | 8 + arch/x86_64/kernel/nmi.c | 7 + arch/x86_64/kernel/pci-dma.c | 2 arch/x86_64/kernel/process.c | 10 + arch/x86_64/kernel/setup.c | 4 arch/x86_64/kernel/time.c | 4 arch/x86_64/kernel/vmlinux.lds.S | 2 arch/x86_64/kernel/x8664_ksyms.c | 2 arch/x86_64/mm/init.c | 37 +++++- arch/x86_64/mm/numa.c | 46 ++++++- arch/x86_64/mm/srat.c | 170 +++++++++++++++++++++++++++-- arch/x86_64/pci/mmconfig.c | 53 ++++++--- drivers/acpi/Kconfig | 2 include/asm-i386/apic.h | 2 include/asm-i386/e820.h | 4 include/asm-i386/hpet.h | 1 include/asm-x86_64/e820.h | 3 include/asm-x86_64/hpet.h | 2 include/asm-x86_64/ia32_unistd.h | 2 include/asm-x86_64/mce.h | 7 + include/asm-x86_64/numa.h | 2 include/asm-x86_64/numnodes.h | 2 include/linux/bootmem.h | 1 include/linux/init.h | 3 include/linux/jiffies.h | 6 + include/linux/memory_hotplug.h | 14 +- kernel/timer.c | 2 mm/bootmem.c | 9 + security/selinux/xfrm.c | 4 46 files changed, 575 insertions(+), 184 deletions(-) New commits: commit ebe9c645f0738c1254e7b6208e3dec8a1b9be2ba Merge: 6b61940... 6764472... Author: Jan-Benedict Glaw <jb...@d2...> Date: Sun Apr 9 22:12:03 2006 +0200 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 commit 67644726317a8274be4a3d0ef85b9ccebaa90304 Author: Dave Jones <da...@re...> Date: Sun Apr 2 23:34:19 2006 -0700 [SELINUX] Fix build after ipsec decap state changes. security/selinux/xfrm.c: In function 'selinux_socket_getpeer_dgram': security/selinux/xfrm.c:284: error: 'struct sec_path' has no member named 'x' security/selinux/xfrm.c: In function 'selinux_xfrm_sock_rcv_skb': security/selinux/xfrm.c:317: error: 'struct sec_path' has no member named 'x' Signed-off-by: Dave Jones <da...@re...> Signed-off-by: David S. Miller <da...@da...> Signed-off-by: Linus Torvalds <tor...@os...> commit 66004a6ca23f2a2408b32cbe27fda0389fb8f9dc Author: Linus Torvalds <tor...@g5...> Date: Sun Apr 9 12:14:02 2006 -0700 Move request_standard_resources() back to before PCI probing This effectively undoes the PCI resource allocation changes done in commit b408cbc704352eccee301e1103b23203ba1c3a0e, but leaves the cleanups of that commit in place. We're going back to marking the resources reported by e820 busy _before_ doing PCI probing, so that any PCI resource that clashes with the BIOS- reported memory map will be reloacted to a non-clashing area. The reason? Larry Finger reports that his laptop has the cardbus controller set up by the BIOS so that it conflicts with the e820 memory map, and needs to be relocated. See http://bugzilla.kernel.org/show_bug.cgi?id=6337 for more details. We'll have to work out how to handle the fbcon problem that caused that commit in the first place in some other way. Cc: Ivan Kokshaysky <in...@ju...> Cc: Greg Kroah-Hartman <gr...@su...> Cc: Antonino A. Daplas <ad...@po...> Cc: <bj...@lu...> Tested-by: Larry Finger <Lar...@lw...> Signed-off-by: Linus Torvalds <tor...@os...> commit b8feb47f992d314c956add15c1118430120635bb Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:34 2006 +0200 [PATCH] x86_64: Update 32-bit system call table Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 67d53ea5a3d42aadeb1584e757ca4660c0e8a810 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:31 2006 +0200 [PATCH] x86_64: Eliminate IA32_NR_syscalls define Or rather compute it based on the table length automatically. This also has the intended side effect of not warning for new system calls anymore. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit bbd3aff89d4b34ef17a748e4c001ecc5b43e3e55 Author: Sam Ravnborg <sa...@ra...> Date: Fri Apr 7 19:50:28 2006 +0200 [PATCH] x86_64: fix CONFIG_REORDER Fix CONFIG_REORDER. The value of cflags-y was assined to CFLAGS before cflags-y was assigned the value used for CONFIG_REORDER. Use cflags-y for all CFLAGS options in the Makefile to avoid this happening again. Signed-off-by: Sam Ravnborg <sa...@ra...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 97c2803c9c694cafbd9f5e43a25903e0abf25188 Author: John Blackwood <joh...@cc...> Date: Fri Apr 7 19:50:25 2006 +0200 [PATCH] x86_64: Plug GS leak in arch_prctl() In linux-2.6.16, we have noticed a problem where the gs base value returned from an arch_prtcl(ARCH_GET_GS, ...) call will be incorrect if: - the current/calling task has NOT set its own gs base yet to a non-zero value, - some other task that ran on the same processor previously set their own gs base to a non-zero value. In this situation, the ARCH_GET_GS code will read and return the MSR_KERNEL_GS_BASE msr register. However, since the __switch_to() code does NOT load/zero the MSR_KERNEL_GS_BASE register when the task that is switched IN has a zero next->gs value, the caller of arch_prctl(ARCH_GET_GS, ...) will get back the value of some previous tasks's gs base value instead of 0. Change the arch_prctl() ARCH_GET_GS code to only read and return the MSR_KERNEL_GS_BASE msr register if the 'gs' register of the calling task is non-zero. Side note: Since in addition to using arch_prctl(ARCH_SET_GS, ...), a task can also setup a gs base value by using modify_ldt() and write an index value into 'gs' from user space, the patch below reads 'gs' instead of using thread.gs, since in the modify_ldt() case, the thread.gs value will be 0, and incorrect value would be returned (the task->thread.gs value). When the user has not set its own gs base value and the 'gs' register is zero, then the MSR_KERNEL_GS_BASE register will not be read and a value of zero will be returned by reading and returning 'task->thread.gs'. The first patch shown below is an attempt at implementing this approach. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit e48c4729d23a026f3711d5e36add5cce894b4913 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:21 2006 +0200 [PATCH] i386: Remove printk about reboot fixups at reboot Printk doesn't have any value Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit b20367a6c2a0cd937cb1f0a8cf848f1402fef99c Author: Jordan Hargrave <jor...@de...> Date: Fri Apr 7 19:50:18 2006 +0200 [PATCH] x86_64: Fix drift with HPET timer enabled If the HPET timer is enabled, the clock can drift by ~3 seconds a day. This is due to the HPET timer not being initialized with the correct setting (still using PIT count). If HZ changes, this drift can become even more pronounced. HPET patch initializes tick_nsec with correct tick_nsec settings for HPET timer. Vojtech comments: "It's not entirely correct (it assumes the HPET ticks totally exactly), but it's significantly better than assuming the PIT error there." Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 49c93e84d8b2d602a07c302c7e3cd4fa09095fbb Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:15 2006 +0200 [PATCH] i386/x86-64: Return defined error value for bad PCI config space accesses Mostly to get better handling when a extended config space access has to fallback to Type1. Cc: gr...@su... Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 8c30b1a74aed4041f183e183a149b7dfbdc6c20e Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:12 2006 +0200 [PATCH] i386/x86_64: Check if MCFG works for the first 16 busses Previously only the first bus would be checked against Type 1. Why 16? Checking all would need too much memory and we can assume that systems with more than 16 busses have better than average quality BIOS. This is an additional defense against bad MCFG tables. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit e405d067298b2b960bf20318e91ed842157c65bc Author: Ravikiran G Thirumalai <ki...@sc...> Date: Fri Apr 7 19:50:09 2006 +0200 [PATCH] x86_64: Fixup read_mostly section on internode cache line size for vSMP Fixup the read mostly section to start at internode cacheline boundary. Signed-off-by: Ravikiran Thirumalai <ki...@sc...> Signed-off-by: Shai Fultheim <sh...@sc...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 3d34ee6891e274dfb6a22930546d37738cdbe9c4 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:06 2006 +0200 [PATCH] x86_64: Don't return error for HPET initialization in initcall Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit ac04dcaf6f567307fbeef9c3c1fff35280e53f02 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:03 2006 +0200 [PATCH] x86_64: Don't export strlen twice Fix WARNING: vmlinux: 'strlen' exported twice. Previous export was in vmlinux Reported by Mats Johannesson Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 7bf36bbc5e0c09271f9efe22162f8cc3f8ebd3d2 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:50:00 2006 +0200 [PATCH] x86_64: When user could have changed RIP always force IRET Intel EM64T CPUs handle uncanonical return addresses differently from AMD CPUs. The exception is reported in the SYSRET, not the next instruction. This leads to the kernel exception handler running on the user stack with the wrong GS because the kernel didn't expect exceptions on this instruction. This version of the patch has the teething problems that plagued an earlier version fixed. This is CVE-2006-0744 Thanks to Ernie Petrides and Asit B. Mallick for analysis and initial patches. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 553f265fe883a23502ee351845f09334790f18b8 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:57 2006 +0200 [PATCH] x86_64: Don't run NMI watchdog during machine checks Machine checks can stall the machine for a long time and it's not good to trigger the nmi watchdog during that. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit be56db6186999a8571ae480cf2b929578f6dfd68 Author: Dave Hansen <hav...@us...> Date: Fri Apr 7 19:49:54 2006 +0200 [PATCH] x86_64: extra NODES_SHIFT definition The generic linux/numa.h file defines NODES_SHIFT to 0 in case the architecture did not. Every architecture which has a NUMA config option defines NODES_SHIFT in its asm-$ARCH headers, but only if NUMA is enabled, except for x86_64. This should make it like all the rest. Signed-off-by: Dave Hansen <hav...@us...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 4211a30349e8d2b724cfb4ce2584604f5e59c299 Author: Jacob Shin <jac...@am...> Date: Fri Apr 7 19:49:51 2006 +0200 [PATCH] x86_64: Proper null pointer check in powernow_k8_get This prevents crashes on dual core system when enough ticks are lost. Replaces earlier patch by me. Cc: Dave Jones <da...@re...> Signed-off-by: Thomas Renninger <tr...@su...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit d7fa706ce2c29cb751c15ca00f3aa7b223e3c9f0 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:48 2006 +0200 [PATCH] x86_64: Revert earlier powernow-k8 change Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 95d769aaf47abfc77b600631403ff5af6c990cff Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:45 2006 +0200 [PATCH] i386: Consolidate modern APIC handling AMD systems have a modern APIC that supports 8 bit IDs, but don't have a XAPIC version number. Add a new "modern_apic" subfunction that handles this correctly and use it (nearly) everywhere where XAPIC is tested for. I removed one wart: the code specified that external APICs would use an 8bit APIC ID. But I checked a real 82093 data sheet and it says clearly that they only use 4bit. So I removed this special case since it would a bit awkward to implement now. I removed the valid APIC tests in mptable parsing completely. On any modern system they only check against the full field width (8bit) anyways and are no-ops. This also fixes them doing the wrong thing on >8 core Opterons. This makes i386 boot again on 16 core Opterons. Cc: Ingo Molnar <mi...@el...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit d1530d82e02fd96d4634a6d6f6538c8b778c43af Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:42 2006 +0200 [PATCH] x86_64: Clear APIC feature bit when local APIC is disabled Needed for other checks later in ACPI. Pointed out by Len Brown Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit d3b6a349d233aecf2c52f7f4c150ca09f684f2d8 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:39 2006 +0200 [PATCH] x86-64/i386: Don't process APICs/IO-APICs in ACPI when APIC is disabled. When nolapic was passed or the local APIC was disabled for another reason ACPI would still parse the IO-APICs until these were explicitely disabled with noapic. Usually this resulted in a non booting configuration unless "nolapic noapic" was used. I also disabled the local APIC parsing in this case, although that's only cosmetic (suppresses a few printks) This hopefully makes nolapic work in all cases. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit ec0f08eeea6ac1d8c925f47e3677e4c985fd8f63 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:36 2006 +0200 [PATCH] x86_64: Don't sanity check Type 1 PCI bus access on newer systems Horus systems don't have anything on bus 0 which makes the Type 1 sanity checks fail. Use the DMI BIOS year to check for newer systems and always assume Type 1 works on them. I used 2001 as an pretty arbitary cutoff year. Cc: gr...@su... Cc: Navin Boppuri <nav...@ne...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit fa47dd0ba303599f8adf8d8336ed2fb74efc47c5 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:33 2006 +0200 [PATCH] x86_64: Fix compilation with CONFIG_PCI=n / allnoconfig Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 946f2ee5c7312e8acac4f3ab6629e7e2d36a3646 Author: Arjan van de Ven <ar...@li...> Date: Fri Apr 7 19:49:30 2006 +0200 [PATCH] i386/x86-64: Check that MCFG points to an e820 reserved area This patch introduces a user for the e820_all_mapped function: There have been several machines that don't have a working MMCONFIG, often because of a buggy MCFG table in the ACPI bios. This patch adds a simple sanity check that detects a whole bunch of these cases, and when it detects it, linux now boots rather than crash-and-burns. The accuracy of this detection can in principle be improved if there was a "is this entire range in e820 with THIS attribute", but no such function exist and the complexity needed for this is not really worth it; this simple check already catches most cases anyway. Signed-off-by: Arjan van de Ven <ar...@li...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 952223683ec989e86328c24808fdb962c4dbeb0a Author: Arjan van de Ven <ar...@li...> Date: Fri Apr 7 19:49:27 2006 +0200 [PATCH] x86_64: Introduce e820_all_mapped Introduce a e820_all_mapped() function which checks if the entire range <start,end> is mapped with type. This is done by moving the local start variable to the end of each known-good region; if at the end of the function the start address is still before end, there must be a part that's not of the correct type; otherwise it's a good region. Signed-off-by: Arjan van de Ven <ar...@li...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit eee5a9fa63c97366cdea6ab3aa2ed9e3601812d0 Author: Arjan van de Ven <ar...@li...> Date: Fri Apr 7 19:49:24 2006 +0200 [PATCH] x86_64: Rename e820_mapped to e820_any_mapped Rename e820_mapped to e820_any_mapped since it tests if any part of the range is mapped according to the type. Later steps will introduce e820_all_mapped which will check if the entire range is mapped with the type. Both have their merit. Signed-off-by: Arjan van de Ven <ar...@li...> Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit a8062231d80239cf3405982858c02aea21a6066a Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:21 2006 +0200 [PATCH] x86_64: Handle empty PXMs that only contain hotplug memory The node setup code would try to allocate the node metadata in the node itself, but that fails if there is no memory in there. This can happen with memory hotplug when the hotplug area defines an so far empty node. Now use bootmem to try to allocate the mem_map in other nodes. And if it fails don't panic, but just ignore the node. To make this work I added a new __alloc_bootmem_nopanic function that does what its name implies. TBD should try to use nearby nodes here. Currently we just use any. It's hard to do it better because bootmem doesn't have proper fallback lists yet. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 68a3a7feb08f960095072f28ec20f7900793c506 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:18 2006 +0200 [PATCH] x86_64: Reserve SRAT hotadd memory on x86-64 From: Keith Mannthey, Andi Kleen Implement memory hotadd without sparsemem. The memory in the SRAT hotadd area is just preserved instead and can be activated later. There are a few restrictions: - Only one continuous hotadd area allowed per node The main problem is dealing with the many buggy SRAT tables that are out there. The strategy here is to reject anything suspicious. Originally from Keith Mannthey, with several hacks and changes by AK and also contributions from Andrew Morton [ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kam...@jp...>: 1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n. Rebuilding zonelist is necessary when the system has just memory < 4G at boot, and hot add memory > 4G. because x86_64 has DMA32, ZONE_NORAML is not included into zonelist at boot time if system doesn't have memory >4G at boot. [AK: should just force the higher zones at boot time when SRAT tells us] 2) zone and node's spanned_pages and present_pages are not incremented. They should be. For example, our server (ia64/Fujitsu PrimeQuest) can equip memory from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have possible 1T +memory. (Microsoft requires "write all possible memory in SRAT") When we reserve memmap for possible 1T memory, Linux will not work well in +minimum 4G configuraion ;) [AK: needs limiting to 5-10% of max memory] ] Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 9d99aaa31f5994d1923c3713ce9144c4c42332e1 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:15 2006 +0200 [PATCH] x86_64: Support memory hotadd without sparsemem Memory hotadd doesn't need SPARSEMEM, but can be handled by just preallocating mem_maps. This only needs some untangling of ifdefs to enable the necessary code even without SPARSEMEM. Originally from Keith Mannthey, hacked by AK. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 805e8c03c9ea9bdb402a36341e02ec24825d5417 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:12 2006 +0200 [PATCH] x86_64: Clean up execve path Just call IRET always, no need for any special cases. Needed for the next bug fix. Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> commit 903fcc608e9f531749024172277dc2fd15d5a587 Author: Andi Kleen <ak...@su...> Date: Fri Apr 7 19:49:09 2006 +0200 [PATCH] x86_64: Update defconfig Signed-off-by: Andi Kleen <ak...@su...> Signed-off-by: Linus Torvalds <tor...@os...> diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index 1921353..f2cd6ef 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt @@ -151,6 +151,11 @@ NUMA numa=fake=X Fake X nodes and ignore NUMA setup of the actual machine. + numa=hotadd=percent + Only allow hotadd memory to preallocate page structures upto + percent of already available memory. + numa=hotadd=0 will disable hotadd memory. + ACPI acpi=off Don't enable ACPI diff --git a/arch/i386/kernel/acpi/boot.c b/arch/i386/kernel/acpi/boot.c index 0330661..8dab352 100644 --- a/arch/i386/kernel/acpi/boot.c +++ b/arch/i386/kernel/acpi/boot.c @@ -215,7 +215,7 @@ static int __init acpi_parse_madt(unsign { struct acpi_table_madt *madt = NULL; - if (!phys_addr || !size) + if (!phys_addr || !size || !cpu_has_apic) return -EINVAL; madt = (struct acpi_table_madt *)__acpi_map_table(phys_addr, size); @@ -751,6 +751,9 @@ static int __init acpi_parse_madt_ioapic return -ENODEV; } + if (!cpu_has_apic) + return -ENODEV; + /* * if "noapic" boot option, don't look for IO-APICs */ diff --git a/arch/i386/kernel/apic.c b/arch/i386/kernel/apic.c index 6273bf7..254cee9 100644 --- a/arch/i386/kernel/apic.c +++ b/arch/i386/kernel/apic.c @@ -62,6 +62,18 @@ int apic_verbosity; static void apic_pm_activate(void); +int modern_apic(void) +{ + unsigned int lvr, version; + /* AMD systems use old APIC versions, so check the CPU */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD && + boot_cpu_data.x86 >= 0xf) + return 1; + lvr = apic_read(APIC_LVR); + version = GET_APIC_VERSION(lvr); + return version >= 0x14; +} + /* * 'what should we do if we get a hw irq event on an illegal vector'. * each architecture has to answer this themselves. @@ -119,10 +131,7 @@ void enable_NMI_through_LVT0 (void * dum int get_physical_broadcast(void) { - unsigned int lvr, version; - lvr = apic_read(APIC_LVR); - version = GET_APIC_VERSION(lvr); - if (!APIC_INTEGRATED(version) || version >= 0x14) + if (modern_apic()) return 0xff; else return 0xf; @@ -349,9 +358,9 @@ int __init verify_local_APIC(void) void __init sync_Arb_IDs(void) { - /* Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 */ - unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR)); - if (ver >= 0x14) /* P4 or higher */ + /* Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 + And not needed on AMD */ + if (modern_apic()) return; /* * Wait for idle. diff --git a/arch/i386/kernel/cpu/cpufreq/powernow-k8.c b/arch/i386/kernel/cpu/cpufreq/powernow-k8.c index 712a26b..7c0e160 100644 --- a/arch/i386/kernel/cpu/cpufreq/powernow-k8.c +++ b/arch/i386/kernel/cpu/cpufreq/powernow-k8.c @@ -46,7 +46,7 @@ #define PFX "powernow-k8: " #define BFX PFX "BIOS error: " -#define VERSION "version 1.60.1" +#define VERSION "version 1.60.2" #include "powernow-k8.h" /* serialize freq changes */ @@ -55,7 +55,7 @@ static DEFINE_MUTEX(fidvid_mutex); static struct powernow_k8_data *powernow_data[NR_CPUS]; #ifndef CONFIG_SMP -static cpumask_t cpu_core_map[1] = { CPU_MASK_ALL }; +static cpumask_t cpu_core_map[1]; #endif /* Return a frequency in MHz, given an input fid */ @@ -910,6 +910,9 @@ static int powernowk8_target(struct cpuf unsigned int newstate; int ret = -EIO; + if (!data) + return -EINVAL; + /* only run on specific CPU from here on */ oldmask = current->cpus_allowed; set_cpus_allowed(current, cpumask_of_cpu(pol->cpu)); @@ -969,6 +972,9 @@ static int powernowk8_verify(struct cpuf { struct powernow_k8_data *data = powernow_data[pol->cpu]; + if (!data) + return -EINVAL; + return cpufreq_frequency_table_verify(pol, data->powernow_table); } @@ -977,7 +983,7 @@ static int __cpuinit powernowk8_cpu_init { struct powernow_k8_data *data; cpumask_t oldmask = CPU_MASK_ALL; - int rc, i; + int rc; if (!cpu_online(pol->cpu)) return -ENODEV; @@ -1063,8 +1069,7 @@ static int __cpuinit powernowk8_cpu_init printk("cpu_init done, current fid 0x%x, vid 0x%x\n", data->currfid, data->currvid); - for_each_cpu_mask(i, cpu_core_map[pol->cpu]) - powernow_data[i] = data; + powernow_data[pol->cpu] = data; return 0; @@ -1104,6 +1109,9 @@ static unsigned int powernowk8_get (unsi if (!data) return -EINVAL; + if (!data) + return -EINVAL; + set_cpus_allowed(current, cpumask_of_cpu(cpu)); if (smp_processor_id() != cpu) { printk(KERN_ERR PFX "limiting to CPU %d failed in powernowk8_get\n", cpu); diff --git a/arch/i386/kernel/mpparse.c b/arch/i386/kernel/mpparse.c index 8d8aa9d..db12017 100644 --- a/arch/i386/kernel/mpparse.c +++ b/arch/i386/kernel/mpparse.c @@ -110,21 +110,6 @@ static int __init mpf_checksum(unsigned static int mpc_record; static struct mpc_config_translation *translation_table[MAX_MPC_ENTRY] __initdata; -#ifdef CONFIG_X86_NUMAQ -static int MP_valid_apicid(int apicid, int version) -{ - return hweight_long(apicid & 0xf) == 1 && (apicid >> 4) != 0xf; -} -#else -static int MP_valid_apicid(int apicid, int version) -{ - if (version >= 0x14) - return apicid < 0xff; - else - return apicid < 0xf; -} -#endif - static void __devinit MP_processor_info (struct mpc_config_processor *m) { int ver, apicid; @@ -190,12 +175,6 @@ static void __devinit MP_processor_info ver = m->mpc_apicver; - if (!MP_valid_apicid(apicid, ver)) { - printk(KERN_WARNING "Processor #%d INVALID. (Max ID: %d).\n", - m->mpc_apicid, MAX_APICS); - return; - } - /* * Validate version */ diff --git a/arch/i386/kernel/reboot_fixups.c b/arch/i386/kernel/reboot_fixups.c index 10e21a4..99aab41 100644 --- a/arch/i386/kernel/reboot_fixups.c +++ b/arch/i386/kernel/reboot_fixups.c @@ -51,7 +51,5 @@ void mach_reboot_fixups(void) cur->reboot_fixup(dev); } - - printk(KERN_WARNING "No reboot fixup found for your hardware\n"); } diff --git a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c index eacc3f0..80cb3b2 100644 --- a/arch/i386/kernel/setup.c +++ b/arch/i386/kernel/setup.c @@ -963,6 +963,36 @@ efi_memory_present_wrapper(unsigned long return 0; } + /* + * This function checks if the entire range <start,end> is mapped with type. + * + * Note: this function only works correct if the e820 table is sorted and + * not-overlapping, which is the case + */ +int __init +e820_all_mapped(unsigned long start, unsigned long end, unsigned type) +{ + int i; + for (i = 0; i < e820.nr_map; i++) { + struct e820entry *ei = &e820.map[i]; + if (type && ei->type != type) + continue; + /* is the region (part) in overlap with the current region ?*/ + if (ei->addr >= end || ei->addr + ei->size <= start) + continue; + /* if the region is at the beginning of <start,end> we move + * start to the end of the region since it's ok until there + */ + if (ei->addr <= start) + start = ei->addr + ei->size; + /* if start is now at or beyond end, we're done, full + * coverage */ + if (start >= end) + return 1; /* we're done */ + } + return 0; +} + /* * Find the highest page frame number we have available */ @@ -1317,8 +1347,8 @@ legacy_init_iomem_resources(struct resou /* * Request address space for all standard resources * - * This is called just before pcibios_assign_resources(), which is also - * an fs_initcall, but is linked in later (in arch/i386/pci/i386.c). + * This is called just before pcibios_init(), which is also a + * subsys_initcall, but is linked in later (in arch/i386/pci/common.c). */ static int __init request_standard_resources(void) { @@ -1339,7 +1369,7 @@ static int __init request_standard_resou return 0; } -fs_initcall(request_standard_resources); +subsys_initcall(request_standard_resources); static void __init register_memory(void) { diff --git a/arch/i386/mm/init.c b/arch/i386/mm/init.c index 9f66ac5..ae6534a 100644 --- a/arch/i386/mm/init.c +++ b/arch/i386/mm/init.c @@ -651,6 +651,7 @@ void __init mem_init(void) * Specifically, in the case of x86, we will always add * memory to the highmem for now. */ +#ifdef CONFIG_HOTPLUG_MEMORY #ifndef CONFIG_NEED_MULTIPLE_NODES int add_memory(u64 start, u64 size) { @@ -667,6 +668,7 @@ int remove_memory(u64 start, u64 size) return -EINVAL; } #endif +#endif kmem_cache_t *pgd_cache; kmem_cache_t *pmd_cache; diff --git a/arch/i386/pci/direct.c b/arch/i386/pci/direct.c index 99012b9..0659ced 100644 --- a/arch/i386/pci/direct.c +++ b/arch/i386/pci/direct.c @@ -4,6 +4,7 @@ #include <linux/pci.h> #include <linux/init.h> +#include <linux/dmi.h> #include "pci.h" /* @@ -18,8 +19,10 @@ int pci_conf1_read(unsigned int seg, uns { unsigned long flags; - if (!value || (bus > 255) || (devfn > 255) || (reg > 255)) + if (!value || (bus > 255) || (devfn > 255) || (reg > 255)) { + *value = -1; return -EINVAL; + } spin_lock_irqsave(&pci_config_lock, flags); @@ -188,6 +191,10 @@ static int __init pci_sanity_check(struc if (pci_probe & PCI_NO_CHECKS) return 1; + /* Assume Type 1 works for newer systems. + This handles machines that don't have anything on PCI Bus 0. */ + if (dmi_get_year(DMI_BIOS_DATE) >= 2001) + return 1; for (devfn = 0; devfn < 0x100; devfn++) { if (o->read(0, 0, devfn, PCI_CLASS_DEVICE, 2, &x)) diff --git a/arch/i386/pci/mmconfig.c b/arch/i386/pci/mmconfig.c index 6137890..f77d7f8 100644 --- a/arch/i386/pci/mmconfig.c +++ b/arch/i386/pci/mmconfig.c @@ -12,14 +12,20 @@ #include <linux/pci.h> #include <linux/init.h> #include <linux/acpi.h> +#include <asm/e820.h> #include "pci.h" +#define MMCONFIG_APER_SIZE (256*1024*1024) + +/* Assume systems with more busses have correct MCFG */ +#define MAX_CHECK_BUS 16 + #define mmcfg_virt_addr ((void __iomem *) fix_to_virt(FIX_PCIE_MCFG)) /* The base address of the last MMCONFIG device accessed */ static u32 mmcfg_last_accessed_device; -static DECLARE_BITMAP(fallback_slots, 32); +static DECLARE_BITMAP(fallback_slots, MAX_CHECK_BUS*32); /* * Functions for accessing PCI configuration space with MMCONFIG accesses @@ -29,8 +35,8 @@ static u32 get_base_addr(unsigned int se int cfg_num = -1; struct acpi_table_mcfg_config *cfg; - if (seg == 0 && bus == 0 && - test_bit(PCI_SLOT(devfn), fallback_slots)) + if (seg == 0 && bus < MAX_CHECK_BUS && + test_bit(PCI_SLOT(devfn) + 32*bus, fallback_slots)) return 0; while (1) { @@ -74,8 +80,10 @@ static int pci_mmcfg_read(unsigned int s unsigned long flags; u32 base; - if (!value || (bus > 255) || (devfn > 255) || (reg > 4095)) + if (!value || (bus > 255) || (devfn > 255) || (reg > 4095)) { + *value = -1; return -EINVAL; + } base = get_base_addr(seg, bus, devfn); if (!base) @@ -146,29 +154,34 @@ static struct pci_raw_ops pci_mmcfg = { Normally this can be expressed in the MCFG by not listing them and assigning suitable _SEGs, but this isn't implemented in some BIOS. Instead try to discover all devices on bus 0 that are unreachable using MM - and fallback for them. - We only do this for bus 0/seg 0 */ + and fallback for them. */ static __init void unreachable_devices(void) { - int i; + int i, k; unsigned long flags; - for (i = 0; i < 32; i++) { - u32 val1; - u32 addr; - - pci_conf1_read(0, 0, PCI_DEVFN(i, 0), 0, 4, &val1); - if (val1 == 0xffffffff) - continue; - - /* Locking probably not needed, but safer */ - spin_lock_irqsave(&pci_config_lock, flags); - addr = get_base_addr(0, 0, PCI_DEVFN(i, 0)); - if (addr != 0) - pci_exp_set_dev_base(addr, 0, PCI_DEVFN(i, 0)); - if (addr == 0 || readl((u32 __iomem *)mmcfg_virt_addr) != val1) - set_bit(i, fallback_slots); - spin_unlock_irqrestore(&pci_config_lock, flags); + for (k = 0; k < MAX_CHECK_BUS; k++) { + for (i = 0; i < 32; i++) { + u32 val1; + u32 addr; + + pci_conf1_read(0, k, PCI_DEVFN(i, 0), 0, 4, &val1); + if (val1 == 0xffffffff) + continue; + + /* Locking probably not needed, but safer */ + spin_lock_irqsave(&pci_config_lock, flags); + addr = get_base_addr(0, k, PCI_DEVFN(i, 0)); + if (addr != 0) + pci_exp_set_dev_base(addr, k, PCI_DEVFN(i, 0)); + if (addr == 0 || + readl((u32 __iomem *)mmcfg_virt_addr) != val1) { + set_bit(i, fallback_slots); + printk(KERN_NOTICE + "PCI: No mmconfig possible on %x:%x\n", k, i); + } + spin_unlock_irqrestore(&pci_config_lock, flags); + } } } @@ -183,6 +196,14 @@ void __init pci_mmcfg_init(void) (pci_mmcfg_config[0].base_address == 0)) return; + if (!e820_all_mapped(pci_mmcfg_config[0].base_address, + pci_mmcfg_config[0].base_address + MMCONFIG_APER_SIZE, + E820_RESERVED)) { + printk(KERN_ERR "PCI: BIOS Bug: MCFG area is not E820-reserved\n"); + printk(KERN_ERR "PCI: Not using MMCONFIG.\n"); + return; + } + printk(KERN_INFO "PCI: Using MMCONFIG\n"); raw_pci_ops = &pci_mmcfg; pci_probe = (pci_probe & ~PCI_PROBE_MASK) | PCI_PROBE_MMCONF; diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig index 4310b4a..7df2fe1 100644 --- a/arch/x86_64/Kconfig +++ b/arch/x86_64/Kconfig @@ -136,6 +136,11 @@ config X86_L1_CACHE_SHIFT default "7" if GENERIC_CPU || MPSC default "6" if MK8 +config X86_INTERNODE_CACHE_BYTES + int + default "4096" if X86_VSMP + default X86_L1_CACHE_BYTES if !X86_VSMP + config X86_TSC bool default y diff --git a/arch/x86_64/Makefile b/arch/x86_64/Makefile index 585fd4a..e573e2a 100644 --- a/arch/x86_64/Makefile +++ b/arch/x86_64/Makefile @@ -24,37 +24,37 @@ LDFLAGS := -m elf_x86_64 OBJCOPYFLAGS := -O binary -R .note -R .comment -S LDFLAGS_vmlinux := - CHECKFLAGS += -D__x86_64__ -m64 +cflags-y := cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8) cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic) -CFLAGS += $(cflags-y) -CFLAGS += -m64 -CFLAGS += -mno-red-zone -CFLAGS += -mcmodel=kernel -CFLAGS += -pipe +cflags-y += -m64 +cflags-y += -mno-red-zone +cflags-y += -mcmodel=kernel +cflags-y += -pipe cflags-$(CONFIG_REORDER) += -ffunction-sections # this makes reading assembly source easier, but produces worse code # actually it makes the kernel smaller too. -CFLAGS += -fno-reorder-blocks -CFLAGS += -Wno-sign-compare +cflags-y += -fno-reorder-blocks +cflags-y += -Wno-sign-compare ifneq ($(CONFIG_UNWIND_INFO),y) -CFLAGS += -fno-asynchronous-unwind-tables +cflags-y += -fno-asynchronous-unwind-tables endif ifneq ($(CONFIG_DEBUG_INFO),y) # -fweb shrinks the kernel a bit, but the difference is very small # it also messes up debugging, so don't use it for now. -#CFLAGS += $(call cc-option,-fweb) +#cflags-y += $(call cc-option,-fweb) endif # -funit-at-a-time shrinks the kernel .text considerably # unfortunately it makes reading oopses harder. -CFLAGS += $(call cc-option,-funit-at-a-time) +cflags-y += $(call cc-option,-funit-at-a-time) # prevent gcc from generating any FP code by mistake -CFLAGS += $(call cc-option,-mno-sse -mno-mmx -mno-sse2 -mno-3dnow,) +cflags-y += $(call cc-option,-mno-sse -mno-mmx -mno-sse2 -mno-3dnow,) +CFLAGS += $(cflags-y) AFLAGS += -m64 head-y := arch/x86_64/kernel/head.o arch/x86_64/kernel/head64.o arch/x86_64/kernel/init_task.o diff --git a/arch/x86_64/defconfig b/arch/x86_64/defconfig index 566ecc9..3c45ec2 100644 --- a/arch/x86_64/defconfig +++ b/arch/x86_64/defconfig @@ -1,7 +1,7 @@ # # Automatically generated make config: don't edit -# Linux kernel version: 2.6.16-git9 -# Sat Mar 25 15:18:40 2006 +# Linux kernel version: 2.6.17-rc1 +# Mon Apr 3 16:11:14 2006 # CONFIG_X86_64=y CONFIG_64BIT=y @@ -9,6 +9,7 @@ CONFIG_X86=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_RWSEM_GENERIC_SPINLOCK=y +CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y @@ -55,10 +56,6 @@ CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y -CONFIG_CC_ALIGN_FUNCTIONS=0 -CONFIG_CC_ALIGN_LABELS=0 -CONFIG_CC_ALIGN_LOOPS=0 -CONFIG_CC_ALIGN_JUMPS=0 CONFIG_SLAB=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 @@ -70,7 +67,6 @@ CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y -CONFIG_OBSOLETE_MODPARM=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_KMOD is not set @@ -81,6 +77,7 @@ CONFIG_STOP_MACHINE=y # CONFIG_LBD=y # CONFIG_BLK_DEV_IO_TRACE is not set +# CONFIG_LSF is not set # # IO Schedulers @@ -105,6 +102,7 @@ CONFIG_X86_PC=y CONFIG_GENERIC_CPU=y CONFIG_X86_L1_CACHE_BYTES=128 CONFIG_X86_L1_CACHE_SHIFT=7 +CONFIG_X86_INTERNODE_CACHE_BYTES=128 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y # CONFIG_MICROCODE is not set @@ -116,6 +114,7 @@ CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y CONFIG_SMP=y CONFIG_SCHED_SMT=y +CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set @@ -138,6 +137,7 @@ CONFIG_NEED_MULTIPLE_NODES=y CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_MIGRATION=y CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y +CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y CONFIG_NR_CPUS=32 CONFIG_HOTPLUG_CPU=y CONFIG_HPET_TIMER=y @@ -289,6 +289,7 @@ CONFIG_IP_PNP_DHCP=y # CONFIG_INET_AH is not set # CONFIG_INET_ESP is not set # CONFIG_INET_IPCOMP is not set +# CONFIG_INET_XFRM_TUNNEL is not set # CONFIG_INET_TUNNEL is not set CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y @@ -300,6 +301,7 @@ CONFIG_IPV6=y # CONFIG_INET6_AH is not set # CONFIG_INET6_ESP is not set # CONFIG_INET6_IPCOMP is not set +# CONFIG_INET6_XFRM_TUNNEL is not set # CONFIG_INET6_TUNNEL is not set # CONFIG_IPV6_TUNNEL is not set # CONFIG_NETFILTER is not set @@ -704,7 +706,6 @@ CONFIG_S2IO=m # Wireless LAN (non-hamradio) # # CONFIG_NET_RADIO is not set -# CONFIG_NET_WIRELESS_RTNETLINK is not set # # Wan interfaces @@ -791,7 +792,7 @@ CONFIG_HW_CONSOLE=y # CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y -# CONFIG_SERIAL_8250_ACPI is not set +CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 # CONFIG_SERIAL_8250_EXTENDED is not set @@ -921,6 +922,7 @@ CONFIG_HWMON=y # Digital Video Broadcasting Devices # # CONFIG_DVB is not set +# CONFIG_USB_DABUSB is not set # # Graphics support @@ -932,6 +934,8 @@ CONFIG_VIDEO_SELECT=y # Console display driver support # CONFIG_VGA_CONSOLE=y +CONFIG_VGACON_SOFT_SCROLLBACK=y +CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=256 CONFIG_DUMMY_CONSOLE=y # @@ -1058,15 +1062,6 @@ CONFIG_USB_HIDINPUT=y # CONFIG_USB_MICROTEK is not set # -# USB Multimedia devices -# -# CONFIG_USB_DABUSB is not set - -# -# Video4Linux support is needed for USB Multimedia device support -# - -# # USB Network Adapters # # CONFIG_USB_CATC is not set @@ -1118,9 +1113,15 @@ CONFIG_USB_MON=y # CONFIG_MMC is not set # +# LED devices +# +# CONFIG_NEW_LEDS is not set + +# # InfiniBand support # # CONFIG_INFINIBAND is not set +# CONFIG_IPATH_CORE is not set # # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) @@ -1128,6 +1129,11 @@ CONFIG_USB_MON=y # CONFIG_EDAC is not set # +# Real Time Clock +# +# CONFIG_RTC_CLASS is not set + +# # Firmware Drivers # # CONFIG_EDD is not set diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 35b2fac..5a98026 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -15,6 +15,8 @@ #include <asm/vsyscall32.h> #include <linux/linkage.h> +#define IA32_NR_syscalls ((ia32_syscall_end - ia32_sys_call_table)/8) + .macro IA32_ARG_FIXUP noebp=0 movl %edi,%r8d .if \noebp @@ -109,8 +111,8 @@ ENTRY(ia32_sysenter_target) CFI_REMEMBER_STATE jnz sysenter_tracesys sysenter_do_call: - cmpl $(IA32_NR_syscalls),%eax - jae ia32_badsys + cmpl $(IA32_NR_syscalls-1),%eax + ja ia32_badsys IA32_ARG_FIXUP 1 call *ia32_sys_call_table(,%rax,8) movq %rax,RAX-ARGOFFSET(%rsp) @@ -210,8 +212,8 @@ ENTRY(ia32_cstar_target) CFI_REMEMBER_STATE jnz cstar_tracesys cstar_do_call: - cmpl $IA32_NR_syscalls,%eax - jae ia32_badsys + cmpl $IA32_NR_syscalls-1,%eax + ja ia32_badsys IA32_ARG_FIXUP 1 call *ia32_sys_call_table(,%rax,8) movq %rax,RAX-ARGOFFSET(%rsp) @@ -296,8 +298,8 @@ ENTRY(ia32_syscall) testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),threadinfo_flags(%r10) jnz ia32_tracesys ia32_do_syscall: - cmpl $(IA32_NR_syscalls),%eax - jae ia32_badsys + cmpl $(IA32_NR_syscalls-1),%eax + ja ia32_badsys IA32_ARG_FIXUP call *ia32_sys_call_table(,%rax,8) # xxx: rip relative ia32_sysret: @@ -685,12 +687,11 @@ ia32_sys_call_table: .quad sys_readlinkat /* 305 */ .quad sys_fchmodat .quad sys_faccessat - .quad sys_ni_syscall /* pselect6 for now */ - .quad sys_ni_syscall /* ppoll for now */ + .quad quiet_ni_syscall /* pselect6 for now */ + .quad quiet_ni_syscall /* ppoll for now */ .quad sys_unshare /* 310 */ .quad compat_sys_set_robust_list .quad compat_sys_get_robust_list + .quad sys_splice + .quad sys_sync_file_range ia32_syscall_end: - .rept IA32_NR_syscalls-(ia32_syscall_end-ia32_sys_call_table)/8 - .quad ni_syscall - .endr diff --git a/arch/x86_64/kernel/aperture.c b/arch/x86_64/kernel/aperture.c index fffd6b0..70b9d21 100644 --- a/arch/x86_64/kernel/aperture.c +++ b/arch/x86_64/kernel/aperture.c @@ -80,7 +80,7 @@ static int __init aperture_valid(char *n printk("Aperture from %s beyond 4GB. Ignoring.\n",name); return 0; } - if (e820_mapped(aper_base, aper_base + aper_size, E820_RAM)) { + if (e820_any_mapped(aper_base, aper_base + aper_size, E820_RAM)) { printk("Aperture from %s pointing to e820 RAM. Ignoring.\n",name); return 0; } diff --git a/arch/x86_64/kernel/e820.c b/arch/x86_64/kernel/e820.c index 293cd71..62776c0 100644 --- a/arch/x86_64/kernel/e820.c +++ b/arch/x86_64/kernel/e820.c @@ -80,7 +80,12 @@ static inline int bad_addr(unsigned long return 0; } -int __init e820_mapped(unsigned long start, unsigned long end, unsigned type) +/* + * This function checks if any part of the range <start,end> is mapped + * with type. + */ +int __meminit +e820_any_mapped(unsigned long start, unsigned long end, unsigned type) { int i; for (i = 0; i < e820.nr_map; i++) { @@ -94,6 +99,35 @@ int __init e820_mapped(unsigned long sta return 0; } +/* + * This function checks if the entire range <start,end> is mapped with type. + * + * Note: this function only works correct if the e820 table is sorted and + * not-overlapping, which is the case + */ +int __init e820_all_mapped(unsigned long start, unsigned long end, unsigned type) +{ + int i; + for (i = 0; i < e820.nr_map; i++) { + struct e820entry *ei = &e820.map[i]; + if (type && ei->type != type) + continue; + /* is the region (part) in overlap with the current region ?*/ + if (ei->addr >= end || ei->addr + ei->size <= start) + continue; + + /* if the region is at the beginning of <start,end> we move + * start to the end of the region since it's ok until there + */ + if (ei->addr <= start) + start = ei->addr + ei->size; + /* if start is now at or beyond end, we're done, full coverage */ + if (start >= end) + return 1; /* we're done */ + } + return 0; +} + /* * Find a free area in a specific range. */ diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S index 8538bfe..c946e4f 100644 --- a/arch/x86_64/kernel/entry.S +++ b/arch/x86_64/kernel/entry.S @@ -180,6 +180,10 @@ rff_trace: * * XXX if we had a free scratch register we could save the RSP into the stack frame * and report it properly in ps. Unfortunately we haven't. + * + * When user can change the frames always force IRET. That is because + * it deals with uncanonical addresses better. SYSRET has trouble + * with them due to bugs in both AMD and Intel CPUs. */ ENTRY(system_call) @@ -254,7 +258,10 @@ sysret_signal: xorl %esi,%esi # oldset -> arg2 call ptregscall_common 1: movl $_TIF_NEED_RESCHED,%edi - jmp sysret_check + /* Use IRET because user could have changed frame. This + works because ptregscall_common has called FIXUP_TOP_OF_STACK. */ + cli + jmp int_with_check badsys: movq $-ENOSYS,RAX-ARGOFFSET(%rsp) @@ -280,7 +287,8 @@ tracesys: call syscall_trace_leave RESTORE_TOP_OF_STACK %rbx RESTORE_REST - jmp ret_from_sys_call + /* Use IRET because user could have changed frame */ + jmp int_ret_from_sys_call CFI_ENDPROC /* @@ -408,25 +416,9 @@ ENTRY(stub_execve) CFI_ADJUST_CFA_OFFSET -8 CFI_REGISTER rip, r11 SAVE_REST - movq %r11, %r15 - CFI_REGISTER rip, r15 FIXUP_TOP_OF_STACK %r11 call sys_execve - GET_THREAD_INFO(%rcx) - bt $TIF_IA32,threadinfo_flags(%rcx) - CFI_REMEMBER_STATE - jc exec_32bit RESTORE_TOP_OF_STACK %r11 - movq %r15, %r11 - CFI_REGISTER rip, r11 - RESTORE_REST - pushq %r11 - CFI_ADJUST_CFA_OFFSET 8 - CFI_REL_OFFSET rip, 0 - ret - -exec_32bit: - CFI_RESTORE_STATE movq %rax,RAX(%rsp) RESTORE_REST jmp int_ret_from_sys_call diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c index 10b3e34..6f0790e 100644 --- a/arch/x86_64/kernel/mce.c +++ b/arch/x86_64/kernel/mce.c @@ -29,6 +29,8 @@ #define MISC_MCELOG_MINOR 227 #define NR_BANKS 6 +atomic_t mce_entry; + static int mce_dont_init; /* 0: always panic, 1: panic if deadlock possible, 2: try to avoid panic, @@ -172,10 +174,12 @@ void do_machine_check(struct pt_regs * r int i; int panicm_found = 0; + atomic_inc(&mce_entry); + if (regs) notify_die(DIE_NMI, "machine check", regs, error_code, 18, SIGKILL); if (!banks) - return; + goto out2; memset(&m, 0, sizeof(struct mce)); m.cpu = safe_smp_processor_id(); @@ -266,6 +270,8 @@ void do_machine_check(struct pt_regs * r out: /* Last thing done in the machine check exception to clear state. */ wrmsrl(MSR_IA32_MCG_STATUS, 0); + out2: + atomic_dec(&mce_entry); } /* diff --git a/arch/x86_64/kernel/nmi.c b/arch/x86_64/kernel/nmi.c index d9e4067..4e6357f 100644 --- a/arch/x86_64/kernel/nmi.c +++ b/arch/x86_64/kernel/nmi.c @@ -34,6 +34,7 @@ #include <asm/proto.h> #include <asm/kdebug.h> #include <asm/local.h> +#include <asm/mce.h> /* * lapic_nmi_owner tracks the ownership of the lapic NMI hardware: @@ -480,6 +481,12 @@ void __kprobes nmi_watchdog_tick(struct __get_cpu_var(nmi_touch) = 0; touched = 1; } +#ifdef CONFIG_X86_MCE + /* Could check oops_in_progress here too, but it's safer + not too */ + if (atomic_read(&mce_entry) > 0) + touched = 1; +#endif if (!touched && __get_cpu_var(last_irq_sum) == sum) { /* * Ayiee, looks like this CPU is stuck ... diff --git a/arch/x86_64/kernel/pci-dma.c b/arch/x86_64/kernel/pci-dma.c index 03c9eee..af035ed 100644 --- a/arch/x86_64/kernel/pci-dma.c +++ b/arch/x86_64/kernel/pci-dma.c @@ -48,9 +48,11 @@ dma_alloc_pages(struct device *dev, gfp_ { struct page *page; int node; +#ifdef CONFIG_PCI if (dev->bus == &pci_bus_type) node = pcibus_to_node(to_pci_dev(dev)->bus); else +#endif node = numa_node_id(); page = alloc_pages_node(node, gfp, order); return page ? page_address(page) : NULL; diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c index 70dd8e5..1c44b53 100644 --- a/arch/x86_64/kernel/process.c +++ b/arch/x86_64/kernel/process.c @@ -781,10 +781,16 @@ long do_arch_prctl(struct task_struct *t } case ARCH_GET_GS: { unsigned long base; + unsigned gsindex; if (task->thread.gsindex == GS_TLS_SEL) base = read_32bit_tls(task, GS_TLS); - else if (doit) - rdmsrl(MSR_KERNEL_GS_BASE, base); + else if (doit) { + asm("movl %%gs,%0" : "=r" (gsindex)); + if (gsindex) + rdmsrl(MSR_KERNEL_GS_BASE, base); + else + base = task->thread.gs; + } else base = task->thread.gs; ret = put_user(base, (unsigned long __user *)addr); diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c index 0856ad4..c50b067 100644 --- a/arch/x86_64/kernel/setup.c +++ b/arch/x86_64/kernel/setup.c @@ -353,8 +353,10 @@ static __init void parse_cmdline_early ( if (fullarg(from, "enable_timer_pin_1")) disable_timer_pin_1 = -1; - if (fullarg(from, "nolapic") || fullarg(from, "disableapic")) + if (fullarg(from, "nolapic") || fullarg(from, "disableapic")) { + clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); disable_apic = 1; + } if (fullarg(from, "noapic")) skip_ioapic_setup = 1; diff --git a/arch/x86_64/kernel/time.c b/arch/x86_64/kernel/time.c index ef8bc46..7392570 100644 --- a/arch/x86_64/kernel/time.c +++ b/arch/x86_64/kernel/time.c @@ -726,7 +726,7 @@ static __init int late_hpet_init(void) unsigned int ntimer; if (!vxtime.hpet_address) - return -1; + return 0; memset(&hd, 0, sizeof (hd)); @@ -917,6 +917,8 @@ void __init time_init(void) vxtime.hpet_address = 0; if (hpet_use_timer) { + /* set tick_nsec to use the proper rate for HPET */ + tick_nsec = TICK_NSEC_HPET; cpu_khz = hpet_calibrate_tsc(); timename = "HPET"; #ifdef CONFIG_X86_PM_TIMER diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S index 39ff070..b81f473 100644 --- a/arch/x86_64/kernel/vmlinux.lds.S +++ b/arch/x86_64/kernel/vmlinux.lds.S @@ -65,7 +65,7 @@ SECTIONS .data.cacheline_aligned : AT(ADDR(.data.cacheline_aligned) - LOAD_OFFSET) { *(.data.cacheline_aligned) } - . = ALIGN(CONFIG_X86_L1_CACHE_BYTES); + . = ALIGN(CONFIG_X86_INTERNODE_CACHE_BYTES); .data.read_mostly : AT(ADDR(.data.read_mostly) - LOAD_OFFSET) { *(.data.read_mostly) } diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c index d78f460..fec4e52 100644 --- a/arch/x86_64/kernel/x8664_ksyms.c +++ b/arch/x86_64/kernel/x8664_ksyms.c @@ -112,7 +112,6 @@ EXPORT_SYMBOL_GPL(unset_nmi_callback); #undef memcpy #undef memset #undef memmove -#undef strlen extern void * memset(void *,int,__kernel_size_t); extern size_t strlen(const char *); @@ -121,7 +120,6 @@ extern void * memcpy(void *,const void * extern void * __memcpy(void *,const void *,__kernel_size_t); EXPORT_SYMBOL(memset); -EXPORT_SYMBOL(strlen); EXPORT_SYMBOL(strpbrk); EXPORT_SYMBOL(memmove); EXPORT_SYMBOL(memcpy); diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c index e5f7f1c..4ba34e9 100644 --- a/arch/x86_64/mm/init.c +++ b/arch/x86_64/mm/init.c @@ -305,7 +305,7 @@ static void __meminit phys_pud_init(pud_ if (paddr >= end) break; - if (!after_bootmem && !e820_mapped(paddr, paddr+PUD_SIZE, 0)) { + if (!after_bootmem && !e820_any_mapped(paddr, paddr+PUD_SIZE, 0)) { set_pud(pud, __pud(0)); continue; } @@ -507,9 +507,8 @@ void __init clear_kernel_mapping(unsigne /* * Memory hotplug specific functions - * These are only for non-NUMA machines right now. */ -#ifdef CONFIG_MEMORY_HOTPLUG +#if defined(CONFIG_ACPI_HOTPLUG_MEMORY) || defined(CONFIG_ACPI_HOTPLUG_MEMORY_MODULE) void online_page(struct page *page) { @@ -520,6 +519,38 @@ void online_page(struct page *page) num_physpages++; } +#ifndef CONFIG_MEMORY_HOTPLUG +/* + * Memory Hotadd without sparsemem. The mem_maps have been allocated in advance, + * just online the pages. + */ +int __add_pages(struct zone *z, unsigned long start_pfn, unsigned long nr_pages) +{ + int err = -EIO; + unsigned long pfn; + unsigned long total = 0, mem = 0; + for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++) { + if (pfn_valid(pfn)) { + online_page(pfn_to_page(pfn)); + err = 0; + mem++; + } + total++; + } + if (!err) { + z->spanned_pages += total; + z->present_pages += mem; + z->zone_pgdat->node_spanned_pages += total; + z->zone_pgdat->node_present_pages += mem; + } + return err; +} +#endif + +/* + * Memory is added always to NORMAL zone. This means you will never get + * additional DMA/DMA32 memory. + */ int add_memory(u64 start, u64 size) { struct pglist_data *pgdat = NODE_DATA(0); diff --git a/arch/x86_64/mm/numa.c b/arch/x86_64/mm/numa.c index 4be82d6..cc02573 100644 --- a/arch/x86_64/mm/numa.c +++ b/arch/x86_64/mm/numa.c @@ -100,11 +100,30 @@ int early_pfn_to_nid(unsigned long pfn) } #endif +static void * __init +early_node_mem(int nodeid, unsigned long start, unsigned long end, + unsigned long size) +{ + unsigned long mem = find_e820_area(start, end, size); + void *ptr; + if (mem != -1L) + return __va(mem); + ptr = __alloc_bootmem_nopanic(size, + SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS)); + if (ptr == 0) { + printk(KERN_ERR "Cannot find %lu bytes in node %d\n", + size, nodeid); + return NULL; + } + return ptr; +} + /* Initialize bootmem allocator for a node */ void __init setup_node_bootmem(int nodeid, unsigned long start, unsigned long end) { unsigned long start_pfn, end_pfn, bootmap_pages, bootmap_size, bootmap_start; unsigned long nodedata_phys; + void *bootmap; const int pgdat_size = round_up(sizeof(pg_data_t), PAGE_SIZE); start = round_up(start, ZONE_ALIGN); @@ -114,13 +133,11 @@ void __init setup_node_bootmem(int nodei start_pfn = start >> PAGE_SHIFT; end_pfn = end >> PAGE_SHIFT; - nodedata_phys = find_e820_area(start, end, pgdat_size); - if (nodedata_phys == -1L) - panic("Cannot find memory pgdat in node %d\n", nodeid); - - Dprintk("nodedata_phys %lx\n", nodedata_phys); + node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size); + if (node_data[nodeid] == NULL) + return; + nodedata_phys = __pa(node_data[nodeid]); - node_data[nodeid] = phys_to_virt(nodedata_phys); memset(NODE_DATA(nodeid), 0, sizeof(pg_data_t)); NODE_DATA(nodeid)->bdata = &plat_node_bdata[nodeid]; NODE_DATA(nodeid)->node_start_pfn = start_pfn; @@ -129,9 +146,15 @@ void __init setup_node_bootmem(int nodei /* Find a place for the bootmem map */ bootmap_pages = bootmem_bootmap_pages(end_pfn - start_pfn); bootmap_start = round_up(nodedata_phys + pgdat_size, PAGE_SIZE); - bootmap_start = find_e820_area(bootmap_start, end, bootmap_pages<<PAGE_SHIFT); - if (bootmap_start == -1L) - panic("Not enough continuous space for bootmap on node %d", nodeid); + bootmap = early_node_mem(nodeid, bootmap_start, end, + bootmap_pages<<PAGE_SHIFT); + if (bootmap == NULL) { + if (nodedata_phys < start || nodedata_phys >= end) + free_bootmem((unsigned long)node_data[nodeid],pgdat_size); + node_data[nodeid] = NULL; + return; + } + bootmap_start = __pa(bootmap); Dprintk("bootmap start %lu pages %lu\n", bootmap_start, bootmap_pages); bootmap_size = init_bootmem_node(NODE_DATA(nodeid), @@ -142,6 +165,9 @@ void __init setup_node_bootmem(int nodei reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT); +#ifdef CONFIG_ACPI_NUMA + srat_reserve_add_area(nodeid); +#endif node_set_online(nodeid); } @@ -335,6 +361,8 @@ __init int numa_setup(char *opt) #ifdef CONFIG_ACPI_NUMA if (!strncmp(opt,"noacpi",6)) acpi_numa = -1; + if (!strncmp(opt,"hotadd=", 7)) + hotadd_percent = simple_strtoul(opt+7, NULL, 10); #endif return 1; } diff --git a/arch/x86_64/mm/srat.c b/arch/x86_64/mm/srat.c index 2eb8795..15ae9fc 100644 --- a/arch/x86_64/mm/srat.c +++ b/arch/x86_64/mm/srat.c @@ -15,15 +15,26 @@ #include <linux/bitmap.h> #include <linux/module.h> #include <linux/topology.h> +#include <linux/bootmem.h> +#include <linux/mm.h> #include <asm/proto.h> #include <asm/numa.h> #include <asm/e820.h> +#if (defined(CONFIG_ACPI_HOTPLUG_MEMORY) || \ + defined(CONFIG_ACPI_HOTPLUG_MEMORY_MODULE)) \ + && !defined(CONFIG_MEMORY_HOTPLUG) +#define RESERVE_HOTADD 1 +#endif + static struct acpi_table_slit *acpi_slit; static nodemask_t nodes_parsed __initdata; static nodemask_t nodes_found __initdata; static struct bootnode nodes[MAX_NUMNODES] __initdata; +static struct bootnode nodes_add[MAX_NUMNODES] __initdata; +static int found_add_area __initdata; +int hotadd_percent __initdata = 10; static u8 pxm2node[256] = { [0 ... 255] = 0xff }; /* Too small nodes confuse the VM badly. Usually they result @@ -71,6 +82,10 @@ static __init int conflicting_nodes(unsi static __init void cutoff_node(int i, unsigned long start, unsigned long end) { struct bootnode *nd = &nodes[i]; + + if (found_add_area) + return; + if (nd->start < start) { nd->start = start; if (nd->end < nd->start) @@ -90,6 +105,8 @@ static __init void bad_srat(void) acpi_numa = -1; for (i = 0; i < MAX_LOCAL_APIC; i++) apicid_to_node[i] = NUMA_NO_NODE; + for (i = 0; i < MAX_NUMNODES; i++) + nodes_add[i].start = nodes[i].end = 0; } static __init inline int srat_disabled(void) @@ -155,11 +172,114 @@ acpi_numa_processor_affinity_init(struct pxm, pa->apic_id, node); } +#ifdef RESERVE_HOTADD +/* + * Protect against too large hotadd areas that would fill up memory. + */ +static int hotadd_enough_memory(struct bootnode *nd) +{ + static unsigned long allocated; + static unsigned long last_area_end; + unsigned long pages = (nd->end - nd->start) >> PAGE_SHIFT; + long mem = pages * sizeof(struct page); + unsigned long addr; + unsigned long allowed; + unsigned long oldpages = pages; + + if (mem <... [truncated message content] |