|
From: Fergal D. <fe...@go...> - 2014-02-05 01:48:56
|
On Wed, Feb 5, 2014 at 3:45 AM, Jason Gunthorpe <
jgu...@ob...> wrote:
> On Tue, Feb 04, 2014 at 06:06:37PM +0100, Michael Schaller wrote:
> > Hi everyone,
> >
> > I've had a serious look on our logs and machine inventory data today and
> the
> > numbers don't add up for a hardware-only issue.
>
> Wow, that is great data..
>
> The most direct way to figure out what is going on is to bisection
> search the kernel git history. Narrowing down a problem to a single
> commit gives the best chance to get it fixed for you.. But I fully
> realize how that could be incredibly time consuming.
>
> > As a first step I've checked all the collected syslog data across
> > the fleet and we saw the "tpm_transmit: tpm_{send,recv}: error
> > -5" error messages for the first time on 2013-11-07 and since
> > then these errors appeared every day in the fleet. According to the
> > syslog data this affects 333 unique machines but during a normal
> > work day only 20-35 machines are affected.
>
> However, perhaps with this data you can already identify the first bad
> kernel version you saw?
>
> > Then I've cross referenced the syslog data with our machine inventory
> data and
> > I can confirm the following facts:
> > * All affected laptops use TPMs manufactured by STM
> > * About 10% of our laptop fleet with STM TPMs is affected
>
> So, non-STM models never see this problem and only some STM chips see
> it?
>
> I have debugged crazy TPM problems related to startup that showed
> this kind of distribution, but that was for embedded..
I have managed to reproduce the problem by repeatedly running
wpa_supplicant in a way that hits the TPM. It took < 100 invocations.
I can't send it to you until I make sure it's OK to send outside Google.
In the course of this I also found that trousers stops working after 20-30
TPM invocations
Feb 5 10:31:32 notpot.roam.corp.google.com TCSD TDDL[15292]: TrouSerS
ERROR: write to device /dev/tpm0 failed: Timer expired
I had to restart it 3 or 4 times before I managed to get to the point where
my TPM is unusable. That seems like a separate bug.
Right now, my TPM looks like this
$@ sudo python3 tpm.py
[sudo] password for fergal:
Loc 0: Access: 0xff
Loc 0: STS: 0xff
Loc 0: Burst: 0xff 0xff
$@ tpm_version
Tspi_Context_Connect failed: 0x00003011 - layer=tsp, code=0011 (17),
Communication failure
F
> > I've had a look on the tpm_tis source code but didn't see anything
> obvious
> > between 3.5 and 3.8 that changed. The only thing that might be related
> is this
> > commit: https://github.com/torvalds/linux/commit/
> > b633f0507e19224f1527921644722bfb36db9bb0
>
> This seems harmless, but it suggests some changes were going on
> elsewhere in the PM subsytem...
>
> One thought: if the kernel fails to suspend the TPM the BIOS might
> disable it on restore, so adding debugging to verify suspend would be
> a good step.
>
> After you are done testing 3.5, I would try and test a kernel built
> from b633f0507e19224f1527921644722bfb36db9bb0 - this falls somewhere
> in-between 3.5 and 3.6. If that fails then you can probably do a
> bisection search with only 4-5 steps. If it passes then you know the
> problem isn't contained within the TPM subsystem.
>
> Do you know if the TPM has to be in-use to cause the problem? eg can
> you have the issue if you don't start trousers, and don't use any TPM
> functions from user space? You'd have to read the sysfs files on
> resume to see if the TPM is still there..
>
> Jason
>
|