flickertcb-devel Mailing List for Flicker: Minimal TCB Code Execution (Page 10)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Justin,

On 120314 at 08:45, Justin King-Lacroix wrote:
> I've found that remounting the root filesystem read-only before the
> late-launch, and then remounting it read-write again afterwards, works
> around the filesystem corruption issue.

Can you do that in a standard installation? Maybe from within the
kernel? Here it aborts with "mount: / is busy", and other attempts like
xfs_freeze or hdparm -S did not help.

Disabling AHCI in the BIOS reduced the problem quite a bit, allowing us
to make many Flicker sessions before the system drops dead :-)

> On the network card issue, pulling the module before late-launch, and
> re-modprobing it after, achieves nothing.

Subjectively, problems with the e1000 got a lot better by upgrading to
the latest vanilla Linux. But working remotely (i.e.  having load on
the NIC) is still problematic, and wifi (iwlagn) will almost certainly
break after Flicker.

> whereas the network one feels like the IOMMU is left is a crazy state
> that blocks interrupts from the ethernet hardware.  (Of course,
> that's complete guesswork, so I could be totally off-base.)

Our NIC often continued to work a bit after Flicker if there was no
load on the interface, so I don't think its the IOMMU.

I found reports on improved libata error handling where they said that
ATA is very sensitive to synchronization problems.

Due to the load-dependency and randomness of errors, I think its
a general problem of not handling IRQs correctly and/or not properly
finishing ongoing DMA transactions.

I thought that S2RAM support in modern drivers and hardware should
address this problem, since similar things happen: The OS stops
operating for some time and has to put devices and drivers
in a state where they can be cleanly recovered from.

Looking at kernel/power/suspend.c, it looks like the right kind of
things are going on. CPUs are disabled, kernel threads halted, IRQs disabled.
Then there is an iterator over PCI devices that puts them to sleep, and
devices can ask their drivers to do some final cleanup work before
powering down.

However, I did not have the time to fully replicate that. Using that
PCI iterator appearantly correctly shut down PCI devices(screen blanked
out), but after wakeup the system was still borked. And we actually
needed to graphics card to do some work.. :-)

Here is a mostly still correct documentation of the suspend flow:
http://elinux.org/Pm_Sub_System#Internal_Sequence_of_System_PM

I also just saw that you can disable individual devices using that very
same PM facility, so that may be worth trying out:
http://elinux.org/Pm_Sub_System#Internal_Sequence_of_Device_PM

/steffen
-- 
System Security Lab                            web:  www.trust.cased.de
CASED / TU Darmstadt                           phone: +49 6151 16-75565
PGP Key Fingerprint = B805 57BE E4AF 0104 CC51 77A1 CE6F 8D46 A04D 7875

2011	Jan	Feb	Mar	Apr	May	Jun (9)	Jul (2)	Aug	Sep (10)	Oct (18)	Nov (10)	Dec (2)
2012	Jan (2)	Feb (6)	Mar (17)	Apr (2)	May (13)	Jun (23)	Jul (1)	Aug (1)	Sep (11)	Oct (14)	Nov (18)	Dec (12)
2013	Jan (1)	Feb (5)	Mar (3)	Apr	May	Jun (18)	Jul (4)	Aug	Sep (1)	Oct	Nov	Dec (4)
2014	Jan (15)	Feb	Mar (20)	Apr	May	Jun (1)	Jul (3)	Aug (4)	Sep (2)	Oct (6)	Nov (6)	Dec
2015	Jan (5)	Feb (8)	Mar (7)	Apr (12)	May (3)	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

flickertcb-devel Mailing List for Flicker: Minimal TCB Code Execution (Page 10)

flickertcb-devel — Development information