Thread: [Mailsync-list] mailsync message identification method question

Status: Beta

Brought to you by: fullcity, tpo

mailsync-list

[Mailsync-list] mailsync message identification method question

From: Gunter O. <G.O...@po...> - 2006-11-07 08:09:54

Hi!

I'm sorry I cannot offer any accessories for inter-human relationships, as=
=20
most other posters here do at the moment. ;)

Instead I just have a short question concerning mailsync.

I'm currently migrating my mail folder hierachie to a new IMAP server=20
using mailsync. During the sync mailsync informed me that - as it is also=20
mentioned in the man page - my drafts do not yet carry a message ID and=20
thus could not be synced.

* Can I just safely switch from Message-ID-based message identification to
  MD5 for all following syncs?

* Will mailsync automagically detect the change and rebuild its internal
  sync tracking information, or will I have to delete its internal
  tracking message?

The man page states that MD5 message identification will hash=20
the "From", "To", "Subject", "Date" and "Message-ID" header fields to a=20
message fingerprint which is used instead of just the "Message-ID"=20
header.

* Wouldn't it reduce the admittedly small probability of a has collision
  if the "Message-ID" header would not be hashed into the MD5 but just
  appended? ie. the message identity fingerprint would consist of an MD5
  hash of the "From", "To", "Subject" and "Date" header fields,
  concatenated with the "Message-ID" header field. In this case a hash
  collision would only misclassify two messages as identical which already
  have the same "Message-ID" header, which they normally should not have
  in the first place.

* Wouldn't it make sense to also hash the message's octet size into the
  message identification hash if MD5 is used? Or may the computable
  message size differ slightly depending on the maail store backend?

I hope someone can shed a bit light on these issued for me. :-)

Greetings,

  Gunter

=2D-=20
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The calender of the Theocracy of Muntab counts down, not up. No-one=20
knows why, but it might not be a good idea to hang around and find out. =20
      -- (Terry Pratchett, Wyrd Sisters)
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+                   PGP-verschl=FCsselte Mails bevorzugt!                 +
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Re: [Mailsync-list] mailsync message identification method question

From: Holger K. <hol...@gm...> - 2006-11-07 11:57:59

Gunter Ohrner schrieb:

> I'm currently migrating my mail folder hierachie to a new IMAP server 
> using mailsync. During the sync mailsync informed me that - as it is also 
> mentioned in the man page - my drafts do not yet carry a message ID and 
> thus could not be synced.
> 
> * Can I just safely switch from Message-ID-based message identification to
>   MD5 for all following syncs?
> 
> * Will mailsync automagically detect the change and rebuild its internal
>   sync tracking information, or will I have to delete its internal
>   tracking message?

I assume both question get a yes as an answer. But that is only because i never noticed any error when i did so, not insight in the inner workings of mailsync. Could be just coincidence.

But as you have your old emails savely stored on the old imap server you should just try and if things fail: delete all emails on the new one and start over.

 
> * Wouldn't it reduce the admittedly small probability of a has collision
>   if the "Message-ID" header would not be hashed into the MD5 but just
>   appended? ie. the message identity fingerprint would consist of an MD5
>   hash of the "From", "To", "Subject" and "Date" header fields,
>   concatenated with the "Message-ID" header field. In this case a hash
>   collision would only misclassify two messages as identical which already
>   have the same "Message-ID" header, which they normally should not have
>   in the first place.

If they don't have the same Message-ID they can't get the same hash. So no collision anyway. Appending has no advantage over hashing. Just makes comparison uglier.

 
> * Wouldn't it make sense to also hash the message's octet size into the
>   message identification hash if MD5 is used?

> Or may the computable
>   message size differ slightly depending on the mail store backend?

Unfortunately yes. You would need to read the whole mail and count bytes.

Re: [Mailsync-list] mailsync message identification method question

From: Gunter O. <G.O...@po...> - 2006-11-07 16:04:46

Am Dienstag, 7. November 2006 12:57 schrieb Holger Krull:
> > * Can I just safely switch from Message-ID-based message
> > identification to MD5 for all following syncs?
> I assume both question get a yes as an answer. But that is only because
> i never noticed any error when i did so, not insight in the inner
> workings of mailsync. Could be just coincidence.

Ok, that suffices me. ;) I just wanted to know if changing the algorithm=20
would be a guarantee to nuke both mail stores or if it had a chance to=20
work. ;)

Thanks a lot, I just started a new sync with the md5 algorithm.

> But as you have your old emails savely stored on the old imap server
> you should just try and if things fail: delete all emails on the new
> one and start over.

Hu, I'd rather like to avoid that - the old server is rather slow, iss=20
attached to a thin pipe and the amount of data is pretty massive... ;)

> > * Wouldn't it reduce the admittedly small probability of a has
> > collision if the "Message-ID" header would not be hashed into the MD5
> > but just appended? ie. the message identity fingerprint would consist

> If they don't have the same Message-ID they can't get the same hash. So

Well, they can. As I mentioned I admit that it's a rather unlikely event=20
and it probably won't ever happen, but according to murphies law if it=20
DOES happen, it won't only be like shooting myself in the foot but rather=20
comparable to blowing my own leg of... And I'd rather like to avoid that=20
experience. ;)

> Appending has no advantage over hashing. Just makes comparison uglier.

Well, it'd make the pretty unlikely collision even more unlikely, as in=20
this case an md5 hash collision and an equal Message-ID must happen at=20
the same time to different messages in the same folder. That's be=20
something even I would call "practically impossible".

> > Or may the computable message size differ slightly depending on the
> > mail store backend?=20
> Unfortunately yes. You would need to read the whole mail and count
> bytes.

Ah, I see. That's bad.

Greetings,

  Gunter

=2D-=20
*** Powered by AudioScrobbler --> http://www.last.fm/user/Interneci/ ***
16:52 | Scooter - Mesmerized
16:46 | Scooter - Devil Drums
*** PGP-Verschl=FCsselung bei eMails erw=FCnscht :-) *** PGP: 0x1128F25F ***

Re: [Mailsync-list] mailsync message identification method question

From: Holger K. <hol...@gm...> - 2006-11-07 16:17:13

Attachments: signature.asc

Gunter Ohrner schrieb:
=20
> Hu, I'd rather like to avoid that - the old server is rather slow, iss =

> attached to a thin pipe and the amount of data is pretty massive... ;)

But computers do work at night without supervisor...

>=20
>> Appending has no advantage over hashing. Just makes comparison uglier.=

>=20
> Well, it'd make the pretty unlikely collision even more unlikely, as in=
=20
> this case an md5 hash collision and an equal Message-ID must happen at =

> the same time to different messages in the same folder. That's be=20
> something even I would call "practically impossible".

That doesn't account for a missing message id, does it?
And md5 hash collision on a string like a typical (existing) message id i=
s "practically impossible" i believe. Must a have book somewhere that had=
 probability calculations for md5 (improbability in that case). I feel th=
e the urge for a hot cup of tee suddenly.

Re: [Mailsync-list] mailsync message identification method question

From: Gunter O. <G.O...@po...> - 2006-11-07 18:30:27

Am Dienstag, 7. November 2006 17:16 schrieb Holger Krull:
> > Hu, I'd rather like to avoid that - the old server is rather slow,
> But computers do work at night without supervisor...

Yes, but the last sync ran several days. I could have done that again if=20
it would have been really neccessary, but it would still have been a bit=20
annoying.

> >> Appending has no advantage over hashing. Just makes comparison
> > Well, it'd make the pretty unlikely collision even more unlikely, as
> That doesn't account for a missing message id, does it?

Mh, it does, as in this case there would still be the md5 hashes which=20
would distinguish the mails, so in this case there would be no difference=20
to the status quo.

> And md5 hash collision on a string like a typical (existing) message id
> is "practically impossible" i believe.

Yes, it probably is, and I'm also trying md5 message identification right=20
now. Still I'm not sure if I'd gve guarantees that a collision will never=20
ever happen if mailsync is used a lot and on large mail stores. However=20
if I really want this third "dual" message identification algorithm I can=20
write a patch, so I'll just shut up on this topic for now. ;)

Greetings,

  Gunter

=2D-=20
*** Powered by AudioScrobbler --> http://www.last.fm/user/Interneci/ ***
19:14 | Tristania - Lethean River
19:08 | Tristania - Opus Relinque
19:01 | Tristania - A Sequel of Decay
18:54 | Tristania - Aphelion
*** PGP-Verschl=FCsselung bei eMails erw=FCnscht :-) *** PGP: 0x1128F25F ***

Re: [Mailsync-list] mailsync message identification method question

From: Gunter O. <G.O...@po...> - 2006-11-07 18:22:43

Am Dienstag, 7. November 2006 17:04 schrieb Gunter Ohrner:
> Thanks a lot, I just started a new sync with the md5 algorithm.

Ok, this one failed badly; causing mailsync to abort with a SIGABRT nearly=
=20
immediately. Details can be found below if anyone is interested, took it=20
as a sign to restart with a fresh msinfo location.

With a fresh msinfo location and mailsync started its work, however it=20
began dublicating some mails in some folders which definitely had not=20
been touched except by the IMAP-servers on both sides (of course) and=20
mailsync itself. It did neighter duplicate all mails in the affecteed=20
folders, just some, and not all folders where affected (or maybe simply=20
not all folders contain messages which trigger this behaviour).

I randomly chose one mail to which this had happened - it was now=20
available twice in both stores - and saved it into a text file using my=20
MUA. A diff on the textfile displayed a changed X-UID on the remote side=20
as the only changes:

******************************
$ for OLD in 1 2 ; do for NEW in 1 2 ; do echo "Original store #$OLD; new=20
store #$NEW:" ; diff "msg${OLD}_origstore.txt" "msg${NEW}_newstore.txt" ;=20
done ; done
Original store #1; new store #1:
Original store #1; new store #2:
41c41
< X-UID: 16
=2D--
> X-UID: 55
Original store #2; new store #1:
41c41
< X-UID: 52
=2D--
> X-UID: 16
Original store #2; new store #2:
41c41
< X-UID: 52
=2D--
> X-UID: 55
$ diff msg1_origstore.txt msg2_origstore.txt
41c41
< X-UID: 16
=2D--
> X-UID: 52
$
******************************

Is this more likely the problem's cause or one of its effects?

I aborted the run and re-ran "mailsync -s" just to see what the next=20
mailsync run would do. I detected the (newly created) duplicates in both=20
stores and removed them from both stores, just to copy them again...=20
=46unky...

Has anyone any clue what's happening here? So far it looks as if mailsync=20
does not lose mail and that it does not indefinitely duplicate mails (as=20
it detects its self-created duplicates) but I'm not sure if I like to lay=20
my mails in the hands of a shoftware which shows such effects...

The remote site (new store) is running a Cyrus 2.2.12 server=20
(v2.2.12-Debian-2.2.12-4ubuntu1), my local server is a Dovecot 1.0beta5.=20
The only MUA which accessed both IMAP stores in the meantime is kMail=20
1.9.5.

I wonder what's going on there...

Greetings,

  Gunter


PS: Mailsync-crash with "-t md5" and a "-t msgid" msinfo store:

I installed Debian Sarge's mailsync package so the backtrave is probably=20
entirely useless, though I included it anyway just for the sake of=20
completeness.

It did only crash if called with "-t md5", also when simulating.

The output:

******************************
$ mailsync -s -t md5 do-sync
Only simulating
Synchronizing stores "schli-store" <-> "local-store"...
Authorizing against {h1050806.serverkompetenz.net/imap}
Authorizing against {Hb/imap}
Authorizing against {Hb/imap}
Authorizing against {Hb/imap}
Aborted (core dumped)
$
******************************

The backtrace:

******************************
(gdb) bt
#0  0x4032b741 in kill () from /lib/libc.so.6
#1  0x4032b4c5 in raise () from /lib/libc.so.6
#2  0x4032ca08 in abort () from /lib/libc.so.6
#3  0x4fb5f317 in __cxa_call_unexpected () from /usr/lib/libstdc++.so.5
#4  0x4fb5f354 in std::terminate () from /usr/lib/libstdc++.so.5
#5  0x4fb5f4c6 in __cxa_throw () from /usr/lib/libstdc++.so.5
#6  0x4fb14f90 in std::__throw_out_of_range ()=20
from /usr/lib/libstdc++.so.5
#7  0x0805f39c in std::operator+<char, std::char_traits<char>,=20
std::allocator<char> > ()
#8  0x0805cff2 in std::operator+<char, std::char_traits<char>,=20
std::allocator<char> > ()
#9  0x0804a9be in ?? ()
#10 0x40317e36 in __libc_start_main () from /lib/libc.so.6
#11 0x0804a401 in ?? ()
******************************


=2D-=20
*** Powered by AudioScrobbler --> http://www.last.fm/user/Interneci/ ***
17:35 | Elwood - stompin little scouts
17:31 | Elwood - Waverider
17:26 | Elwood - I Can Seek
17:21 | Scooter - Trip to Nowhere (Bonus)
*** PGP-Verschl=FCsselung bei eMails erw=FCnscht :-) *** PGP: 0x1128F25F ***

Re: [Mailsync-list] mailsync message identification method question

From: Holger K. <hol...@gm...> - 2006-11-09 17:02:04

Gunter Ohrner schrieb:
> Am Dienstag, 7. November 2006 17:04 schrieb Gunter Ohrner:
>> Thanks a lot, I just started a new sync with the md5 algorithm.
> 
> Ok, this one failed badly; causing mailsync to abort with a SIGABRT nearly 
> immediately. Details can be found below if anyone is interested, took it 
> as a sign to restart with a fresh msinfo location.

Sorry to hear. I just tested it here, it crashes too if it is working with the same msinfo file.
I got some dupes too (having a much larger mailbase as the last time i did this). With md5 any change to a stored mail makes the mail appear to be new, so the duplication is not really surprising. 
After i deleted the dupes they didn't reappear. I guess if you use a programm that changes the contents of a stored mail you will always have this problem if using md5. If you connect to both imap stores with kmail, is kmail changing the X-UID?

Re: [Mailsync-list] mailsync message identification method question

From: Gunter O. <G.O...@po...> - 2006-11-19 13:57:03

Am Donnerstag, 9. November 2006 18:01 schrieb Holger Krull:
> mailbase as the last time i did this). With md5 any change to a stored
> mail makes the mail appear to be new, so the duplication is not really
> surprising.

Ah, it surprising after just reading the man page, I had expected mailsync=
=20
to only trigger on changes to the header fields listed there.

> stores with kmail, is kmail changing the X-UID?

I'm not surre, though I do not know why it should. I didn't even actively=20
open the mail folders in question, the only thing kMail did was examine=20
these folder for new mail using the regular mail-checking function. I did=20
never explicitely open the mails in question. However, I do not know what=20
kMail does "under the hood", so I cannot say for sure that it didn't=20
change these headers.

Greetings,

  Gunter

=2D-=20
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
There are two ways to write error-free programs; only the third one=20
works.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+                   PGP-verschl=FCsselte Mails bevorzugt!                 +
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Re: [Mailsync-list] mailsync message identification method question

From: Holger K. <hol...@gm...> - 2006-11-19 15:57:30

Gunter Ohrner schrieb:
> Am Donnerstag, 9. November 2006 18:01 schrieb Holger Krull:
>> mailbase as the last time i did this). With md5 any change to a stored
>> mail makes the mail appear to be new, so the duplication is not really
>> surprising.
> 
> Ah, it surprising after just reading the man page, I had expected mailsync 
> to only trigger on changes to the header fields listed there.
> 

You're right and i was wrong.
Mailsync is supposed to look only at the header. Which now puzzles me a lot because i don't understand why some mails get copied or duplicated in the first place.