From: Igor N. <ig...@no...> - 2014-10-25 06:44:03
|
> You are using zero-copy FILEIO, right? No, i don't use zero_copy, it's set to 0 in all devices. Or you mean zero_copy TCP provided by put_page_callback patch? > Then you must have stable pages on your system, How do i achieve that? I've read a lot of chatter around stable pages, but no real examples on how to manipulate the way how kernel handles them. On 25.10.2014 7:36, Vladislav Bolkhovitin wrote: > You are using zero-copy FILEIO, right? Then you must have stable pages on your system, > otherwise you might see corruptions you are seeing, when data on pages changed under DRBD. > > Vlad > > Igor Novgorodov wrote on 10/23/2014 11:53 PM: >> On 24.10.2014 6:46, Vladislav Bolkhovitin wrote: >>> Igor Novgorodov, on 10/22/2014 11:51 PM wrote: >>>> Which digests? iSCSI? >>> Yes, iSCSI >>> >>>> Or DRBD? >>>> Anyway, both iSCSI's Header & Data CRC32 digests and DRBD replication >>>> SHA1 digests are enabled for a long time. >>> Did you see occasional errors in the logs? >> SCST Logs? >> Only occasional disconnects of one initiator, but i'm not sure that's >> related: >> >> [411169.783716] iscsi-scst: ***ERROR***: Connection with initiator >> iqn.2011-04.ru.domain:krvm2 unexpectedly closed! >> [411170.042574] scst: Using security group >> "iqn.2011-04.ru.domain:VM_STORAGE2_1" for initiator >> "iqn.2011-04.ru.domain:krvm2" (target iqn.2011-04.ru.domain:VM_STORAGE2_1) >> [411170.043345] iscsi-scst: Negotiated parameters: InitialR2T No, >> ImmediateData Yes, MaxConnections 1, MaxRecvDataSegmentLength 1048576, >> MaxXmitDataSegmentLength 1048576, >> [411170.043418] iscsi-scst: MaxBurstLength 1048576, FirstBurstLength >> 524284, DefaultTime2Wait 0, DefaultTime2Retain 0, >> [411170.043468] iscsi-scst: MaxOutstandingR2T 1, DataPDUInOrder Yes, >> DataSequenceInOrder Yes, ErrorRecoveryLevel 0, >> [411170.043518] iscsi-scst: HeaderDigest CRC32C, DataDigest CRC32C, >> OFMarker No, IFMarker No, OFMarkInt 2048, IFMarkInt 2048 >> [411170.043569] iscsi-scst: Target parameters set for session >> 4f3c00003d0200: QueuedCommands 32, Response timeout 90, Nop-In interval >> 30, Nop-In timeout 30 >> >>>> Concerning stable page writes - should switching to vdisk_blockio help >>>> me? >>> Yes, it might help. >>> >>>> That should avoid page cache. >>>> And why this issue causes problems? SCST modifies it's buffer after write()? >>> Have you checked Google as I recommended? It's really long to describe it here. >>> >>> Vlad >> Yes, i've checked with http://lwn.net/Articles/442355/ >> But that does not explains whether SCST has problems with it or not. >> As far as i understand the problem occurs when the process issuing >> write() requests modifies write buffer after write() >> >>>> On 23.10.2014 6:38, Vladislav Bolkhovitin wrote: >>>>> Hmm, stable pages issue (google it)? I'd suggest you to try with data digests enabled. >>>>> >>>>> Vlad >>>>> >>>>> Igor Novgorodov, on 10/21/2014 10:25 AM wrote: >>>>>> Hello! >>>>>> >>>>>> I've recently upgraded one of my dual-node single-primary clusters >>>>>> to latest SCST (3.0 branch, rev. 5843, was 2.2 rev 5319 i guess) and >>>>>> DRBD (8.4.5 branch latest git, was 8.4.3). Kernel 3.14.22 (was 3.4.x) >>>>>> 2 LUNs (8 and 10 Tb) are exported via iscsi and vdisk_fileio. >>>>>> >>>>>> It seems to work OK, but i started getting digest errors when running >>>>>> online DRBD verification every now and then. >>>>>> >>>>>> Primary node: >>>>>> [236515.107301] block drbd0: Starting Online Verify from sector 6344796192 >>>>>> [241421.058609] block drbd0: Digest mismatch, buffer modified by upper >>>>>> layers during write: 17187211632s +4096 >>>>>> Secondary node: >>>>>> [81647.559278] block drbd0: Online Verify start sector: 6344796192 >>>>>> [86553.382954] block drbd0: Digest integrity check FAILED: 17187211632s >>>>>> +4096 >>>>>> >>>>>> It then disconnects, connects, resyncs and goes on, but verify is aborted. >>>>>> >>>>>> I've read about this error (which is arguably an error), the idea behind >>>>>> is that some application or kernel is modifying the data buffer while >>>>>> it is being written to the block device before getting an ack that the >>>>>> buffer is written. >>>>>> >>>>>> So, the questions: >>>>>> 1. Does SCST really do this nasty kind of thing? >>>>>> 2. If not, why does that started to happen? >>>>>> 3. Why does this occur only on one of DRBD devices? >>>>>> The other one has been verifying for 30+ hours now without a problem. >>>>>> Maybe that's related to it having other i/o pattern, i don't know. >>>>>> >>>>>> Glad to hear any suggestions, thanks in advance. |