You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Agata Kruszona-Z. <ch...@mo...> - 2020-06-10 13:28:01
|
Hi, If I may suggest - ask this on github. The community there is much more active than here, someone might have a solution that they can share with you. Regards, Agata W dniu 05.06.2020 o 13:42, Markus Köberl pisze: > Are there nagios/icinga scripts available for monitoring. This should be quite easy using mfscli. > I guess already somebody did the work and is willing to share the scripts. > > > Thank you, > Markus Köberl > -- -- Agata Kruszona-Zawadzka MooseFS Team |
From: Markus K. <mar...@tu...> - 2020-06-05 12:01:01
|
Hi! In the FAQ section there are 3 methods mentioned how to verify the connection to the mfsmaster. I wonder whats the best best way to test (inside a systemd.service) if I can read and write files also in case my dedicated network for communication with chunkservers is not working? My goal would be preventing a srvice (slurmd) from starting in that case. For preventing slurmd.service from starting in case the MooseFS file system is not mounted I have something like that in mind: # /etc/systemd/system/slurmd.service.d/override.conf [Unit] After=remote-fs.target AssertPathIsMountPoint=/mnt/mfs The question is how can I make sure that the communication is working and I can read and write files? Thank you, Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |
From: Markus K. <mar...@tu...> - 2020-06-05 12:00:35
|
Are there nagios/icinga scripts available for monitoring. This should be quite easy using mfscli. I guess already somebody did the work and is willing to share the scripts. Thank you, Markus Köberl -- Markus Koeberl Graz University of Technology Signal Processing and Speech Communication Laboratory E-mail: mar...@tu... |
From: Jay L. <jl...@sl...> - 2020-04-13 15:46:42
|
Hi, Just to provide a quick update. I confirmed that I have enough disk space and updated MooseFS to the latest and the problem persists. I see nothing in the kernel logs that coincides with these errors. The error always occurs at the top of the hour. (For example 10:00:00, 3:00:00, 21:00:00 or 18:00:00), and occurs at most every 3 hours but often less frequently than that. (The time between instances is inconsistent.) I am still scratching my head on this one. Any other ideas? JL On Fri, Apr 10, 2020 at 5:43 AM Agata Kruszona-Zawadzka <ch...@mo...> wrote: > W dniu 10.04.2020 o 01:11, Jay Livens pisze: > > Hi, > > > > I see this periodically in the logs and am not sure what it means. > > Everything seems to be working fine, but I wanted to confirm that this > > did not signal some other form of problem. > > > > I found the actual MFS code > > <https://github.com/moosefs/moosefs/blob/master/mfsmaster/bgsaver.c> > > generating this, but I am not familiar enough with C to understand what > > it means. > > Hi, > > This means your master encountered a write error while trying to save > data to a changelog file. This might mean any kind of error: during > open, write, fsync or close. Do you have enough space in your metadata > directory (usually "/var/lib/mfs", but check your config)? Is the hdd in > your master server working properly? Are there any generic (kernel) > errors surrounding this message in your logs? Are you using the latest > version of MooseFS? > > Agata > > -- > Agata Kruszona-Zawadzka > MooseFS Team > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Agata Kruszona-Z. <ch...@mo...> - 2020-04-10 09:43:22
|
W dniu 10.04.2020 o 01:11, Jay Livens pisze: > Hi, > > I see this periodically in the logs and am not sure what it means. > Everything seems to be working fine, but I wanted to confirm that this > did not signal some other form of problem. > > I found the actual MFS code > <https://github.com/moosefs/moosefs/blob/master/mfsmaster/bgsaver.c> > generating this, but I am not familiar enough with C to understand what > it means. Hi, This means your master encountered a write error while trying to save data to a changelog file. This might mean any kind of error: during open, write, fsync or close. Do you have enough space in your metadata directory (usually "/var/lib/mfs", but check your config)? Is the hdd in your master server working properly? Are there any generic (kernel) errors surrounding this message in your logs? Are you using the latest version of MooseFS? Agata -- Agata Kruszona-Zawadzka MooseFS Team |
From: Jay L. <jl...@sl...> - 2020-04-10 00:14:00
|
Hi, I see this periodically in the logs and am not sure what it means. Everything seems to be working fine, but I wanted to confirm that this did not signal some other form of problem. I found the actual MFS code <https://github.com/moosefs/moosefs/blob/master/mfsmaster/bgsaver.c> generating this, but I am not familiar enough with C to understand what it means. Thank you, Jay ---------------------------- Jay Livens jl...@sl... (617)875-1436 ---------------------------- |
From: Agata Kruszona-Z. <ch...@mo...> - 2020-01-28 12:45:26
|
> From: *Jay Livens* <jl...@sl... <mailto:jl...@sl...>> > Date: Tue, Jan 21, 2020 at 2:12 AM > > > Hi, > > Every now and then, I need to restart the mfsmaster service for reasons > such as upgrading MooseFS or perhaps a reboot on the MFS node stemming > from a kernel upgrade. My system only has one mfsmaster instance and so > failing over is not an option. This is in a home environment and so the > traffic on my MooseFS instance is low; however, I want to be sure that I > restart MooseFS in the safest way. I have tried the following: > > 1. Stop all chunskservers (cgiservers and metaloggers) and then the > master. Start the master service and then start all the slaves. (I > have automated this with Ansible, but it is still a bit bothersome.) > > 2. Just bounce the mfsmaster processes (mfsmaster, mfscgi > and mfschunkerser) and leaving all chunkservers up. This worked, but I > worry that doing so could cause corruption issues. (It did not seem to > when I tried it.) > > Anyway, I am trying to get the perspectives of the experts here and > understand how you folks handle this. > > TIA Hi, If you only want to restart your master, you don't need to stop everything. To be on the safe side, you should stop any client processes that write data to your MooseFS instance (via mfsmount) before you restart your master, but it's not necessary to stop chunkservers. -- Agata Kruszona-Zawadzka MooseFS Team |
From: Jay L. <jl...@sl...> - 2020-01-21 01:11:42
|
Hi, Every now and then, I need to restart the mfsmaster service for reasons such as upgrading MooseFS or perhaps a reboot on the MFS node stemming from a kernel upgrade. My system only has one mfsmaster instance and so failing over is not an option. This is in a home environment and so the traffic on my MooseFS instance is low; however, I want to be sure that I restart MooseFS in the safest way. I have tried the following: 1. Stop all chunskservers (cgiservers and metaloggers) and then the master. Start the master service and then start all the slaves. (I have automated this with Ansible, but it is still a bit bothersome.) 2. Just bounce the mfsmaster processes (mfsmaster, mfscgi and mfschunkerser) and leaving all chunkservers up. This worked, but I worry that doing so could cause corruption issues. (It did not seem to when I tried it.) Anyway, I am trying to get the perspectives of the experts here and understand how you folks handle this. TIA JL |
From: Tim G. <Ti...@ns...> - 2019-12-05 11:47:55
|
Sorry for the delay. I meant that I have an established mount currently outside of moosefs but I don't have the ability to bring that mount down (well with great difficulty). I wanted to create a moosefs solution and somehow add that already established mount into it, if you can't do that and I know it was a curveball question, can anyone think how I could do it? I guess some kind of copy from the old mount to the new moosefs mount, pause the solution for a second and change the mount. Tim From: Wilson, Steven M [mailto:st...@pu...] Sent: 15 November 2019 13:43 To: Aleksander Wieliczko; Tim Guy Cc: moo...@li... Subject: Re: [MooseFS-Users] New install, using established mount I could be mistaken but it sounds like Tim wants to put up one MooseFS chunkserver and mount a MooseFS filesystem from that chunkserver with a goal of 1. The 40TB of data is then copied to this MooseFS filesystem. And then later a new MooseFS chunkserver is added and the filesystem goal will be set to 2. If that is what is being asked, then it is possible to do it this way. You can change the goal of an existing MooseFS filesystem from 1 to 2 and it will start making new copies of each file in the background onto the new chunkserver. Is that what you are asking, Tim? Steve ________________________________ From: Aleksander Wieliczko <ale...@mo...> Sent: Friday, November 15, 2019 3:21 AM To: Tim Guy <Ti...@ns...> Cc: moo...@li... <moo...@li...> Subject: Re: [MooseFS-Users] New install, using established mount Hi, I believe that I misunderstand your question. You have to copy files to MooseFS. There is no other way to put files inside MooseFS. Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro<http://moosefs.pro> pon., 4 lis 2019 o 12:30 Tim Guy <Ti...@ns...<mailto:Ti...@ns...>> napisał(a): Hi everyone. I have a cloud backup service that I offer that runs on a centos 7 VM with a /mnt mount to a separate raid 6 storage server I have around 40Tb of data but not replication/backup as such apart from the raid 6. I need to rectify this but struggle to get downtime on the centos VM to be able to change anything. Is it possible to introduce an established Linux mount into a Moosefs file system for that mount to be goal 1 to then duplicate through to newer hardware for the 2nd goal? Or will I have to copy the data from an established mount to the new Moosefs file system already configured? I hope that makes sense. Regards Tim _________________________________________ moosefs-users mailing list moo...@li...<mailto:moo...@li...> https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: William K. <wil...@gm...> - 2019-12-05 11:03:44
|
The moosefs documentation states that: "Note: If number of Chunkservers in maintenance mode is equal or greater than 20% of all Chunkserver, MooseFS treats all Chunkservers like maintenance mode wouldn’t be enabled at all." Is there a way to change that 20% to 25%? I have a combination of disk numbers and sizes that means maintenance mode is never entered. As well, it would be nice if it was per disk rather than per server. BillK |
From: William K. <wil...@gm...> - 2019-11-30 02:39:52
|
Does moosefs have the ability run replication and master/chunkserver communications on a separate network to client communications? Is there a guide? I presume its using the topology file? BillK |
From: Casper L. <cas...@pr...> - 2019-11-27 16:02:54
|
Hi Dave, I would definitely advise goal 3. An undetected bad chunk combined with single server outage could make a chunk unavailable. >From a performance perspective it depends on too many factors to say something sensible. Write/read workload ratio, disk speeds, number of disks, file sizes, do you do many meta-operations, is there a normal throughput, or do you have spikes in which a lot of data is requested? In our case there is not much performance gain or loss. When network I/O is a bottleneck, you could reduce the number of disks per chunkserver. If IOPS is a problem you could reduce the used size per disk (ie: increase the number of disks) Because with increasing goal from 2 to 3 you are not actually decreasing the number of read operations on the cluster as a whole, you are not really 'relieving' much stress on disk on average. Unless there is 1 big file that is constantly read, we can assume that read operations are balanced throughout the cluster anyway. Obviously the total number of write operations on disks in the cluster will increase, but the copies are written after the first chunk is stored, so writing won't be slower, but as more copies are written, writes can affect other reads. Greetings, Casper Op di 26 nov. 2019 om 20:20 schreef David Myer via moosefs-users < moo...@li...>: > Dear MFS users, > > Out of curiosity, would increasing the number of file replicas across more > disks reduce read times by spreading the read load? I have a lot of spare > space and I thought it might be worth using it for this reason. My replica > goal is currently 2. > > Thanks, > Dave > > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: David M. <dav...@pr...> - 2019-11-26 19:19:24
|
Dear MFS users, Out of curiosity, would increasing the number of file replicas across more disks reduce read times by spreading the read load? I have a lot of spare space and I thought it might be worth using it for this reason. My replica goal is currently 2. Thanks, Dave Sent with [ProtonMail](https://protonmail.com) Secure Email. |
From: Wilson, S. M <st...@pu...> - 2019-11-15 20:17:56
|
I could be mistaken but it sounds like Tim wants to put up one MooseFS chunkserver and mount a MooseFS filesystem from that chunkserver with a goal of 1. The 40TB of data is then copied to this MooseFS filesystem. And then later a new MooseFS chunkserver is added and the filesystem goal will be set to 2. If that is what is being asked, then it is possible to do it this way. You can change the goal of an existing MooseFS filesystem from 1 to 2 and it will start making new copies of each file in the background onto the new chunkserver. Is that what you are asking, Tim? Steve ________________________________ From: Aleksander Wieliczko <ale...@mo...> Sent: Friday, November 15, 2019 3:21 AM To: Tim Guy <Ti...@ns...> Cc: moo...@li... <moo...@li...> Subject: Re: [MooseFS-Users] New install, using established mount Hi, I believe that I misunderstand your question. You have to copy files to MooseFS. There is no other way to put files inside MooseFS. Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro<http://moosefs.pro> pon., 4 lis 2019 o 12:30 Tim Guy <Ti...@ns...<mailto:Ti...@ns...>> napisał(a): Hi everyone. I have a cloud backup service that I offer that runs on a centos 7 VM with a /mnt mount to a separate raid 6 storage server I have around 40Tb of data but not replication/backup as such apart from the raid 6. I need to rectify this but struggle to get downtime on the centos VM to be able to change anything. Is it possible to introduce an established Linux mount into a Moosefs file system for that mount to be goal 1 to then duplicate through to newer hardware for the 2nd goal? Or will I have to copy the data from an established mount to the new Moosefs file system already configured? I hope that makes sense. Regards Tim _________________________________________ moosefs-users mailing list moo...@li...<mailto:moo...@li...> https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Aleksander W. <ale...@mo...> - 2019-11-15 08:22:02
|
Hi, I believe that I misunderstand your question. You have to copy files to MooseFS. There is no other way to put files inside MooseFS. Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro pon., 4 lis 2019 o 12:30 Tim Guy <Ti...@ns...> napisał(a): > Hi everyone. > > > > I have a cloud backup service that I offer that runs on a centos 7 VM with > a /mnt mount to a separate raid 6 storage server > > > > I have around 40Tb of data but not replication/backup as such apart from > the raid 6. I need to rectify this but struggle to get downtime on the > centos VM to be able to change anything. > > > > Is it possible to introduce an established Linux mount into a Moosefs file > system for that mount to be goal 1 to then duplicate through to newer > hardware for the 2nd goal? > > > > Or will I have to copy the data from an established mount to the new > Moosefs file system already configured? > > > > I hope that makes sense. > > > > Regards > > > > Tim > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Aleksander W. <ale...@mo...> - 2019-11-15 07:59:25
|
Hi, If I understand you correctly. You want to have resources in GOAL 1 on a specific chunk server(Witha RAID6) and then move chunks to different chunk servers but in GAOAL 2? If yes you can use LABELS and storage class with archive option. You can assign LABELS to specific chunk servers and then create a storage class that will keep chunks in goal 1 on one label and after some time they will be moved to a different label. Please refer to our labels manual: https://moosefs.com/Content/Downloads/moosefs-storage-classes-manual.pdf Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro pon., 4 lis 2019 o 12:30 Tim Guy <Ti...@ns...> napisał(a): > Hi everyone. > > > > I have a cloud backup service that I offer that runs on a centos 7 VM with > a /mnt mount to a separate raid 6 storage server > > > > I have around 40Tb of data but not replication/backup as such apart from > the raid 6. I need to rectify this but struggle to get downtime on the > centos VM to be able to change anything. > > > > Is it possible to introduce an established Linux mount into a Moosefs file > system for that mount to be goal 1 to then duplicate through to newer > hardware for the 2nd goal? > > > > Or will I have to copy the data from an established mount to the new > Moosefs file system already configured? > > > > I hope that makes sense. > > > > Regards > > > > Tim > > > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Aleksander W. <ale...@mo...> - 2019-11-06 13:25:30
|
Hi Jay, Basically "last error" tab tells us when the last IO error or CRC error occurred. More details should be available in the system log. I would like to add that in case of using filesystems with compression we should add '~' mark ad the beginning of the path declaration in mfshdd.cfg file. '~' means that significant change of total blocks count will not mark this drive as damaged. Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro niedz., 29 wrz 2019 o 21:33 Jay Livens <jl...@sl...> napisał(a): > Hi, > > I have a small MooseFS cluster running on four identical nodes. > Everything was running smoothly until a week ago when one of the nodes > started showing a value under "Last Error." The "Last Error" field updates > every couple of days. The status is still shown as "Ok" for the drive. > > I have run scans on the hard drive on the "Last Error" node, and they > passed without issues. I don't see any issues in the SMART data either. > > What exactly is going on and what exactly does a value in 'Last Error" > tell me? Can someone advise on what else I should check on? > > Thank you, > > Jay > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Remolina, D. J <dij...@ae...> - 2019-11-04 23:34:39
|
If your drives are using ZFS, are you using the following option in each of the zfs pools? zpool set failmode=continue {dataset} If this is not the case, then it is possible one disk failure can halt the whole cluster. I was told about this a while back and I simulated a failure and in fact I could no longer write to my MooseFS file system when I forced failed a disk. Bringing the disk back online would allow the cluster to work again properly. I later change the setting and repeated the experiment and upon a single disk failure, the file system would continue to operate properly. Diego ________________________________ From: Aleksander Wieliczko <ale...@mo...> Sent: Monday, November 4, 2019 4:06 AM To: Jay Livens <jl...@sl...> Cc: moo...@li... <moo...@li...> Subject: Re: [MooseFS-Users] Single disk failure brings down an unrelated node Hi, Jay, I believe that we are talking about MooseFS 3.0.105. Yes? First of all I would like to ask about hard disks. Do you use separate hard disk for OS and separate hard disk for chunks? About question number 1: These components are independent and they are not designed to bring each other down. Is it possible that OS and chunks are stored on the same physical disk? In such a scenario IO errors will influence the whole machine. About second question: That should work exactly like you described. It is extremely weird that you had some missing chunks. Goal 3 means 3 copies, so lost of two components should not affect access to the data. Is it possible to get some more logs from master server? Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro<http://moosefs.pro> pon., 4 lis 2019 o 05:17 Jay Livens <jl...@sl...<mailto:jl...@sl...>> napisał(a): Hi, I just had a weird MFS problem occur and was hoping that someone could provide guidance. (Questions are at the bottom of this note.) My cluster is a simple one with 5 nodes and each node has one HDD. My goal is set to 3 for the share that I am referring to in this post. I just had a drive go offline. Annoying but manageable; however, when it went offline, it appears that it took another unrelated node offline with it and to make matters worse, when I looked at the info tab in MFS, it said that I was missing a number of chunks! I have no idea why this would happen. Here is the syslog from the unrelated node: Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+ Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 times: [ replicator,read chunks: got status: IO error from (192.168.x.x:24CE)] <-- The IP of the failed node Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+ After those messages, the node stopped responding and I could not ping it. A reboot brought it back online. Here are my questions: 1. Why would a bad disk on one node bring down another so aggressively? Shouldn't they behave 100% independently of each other? 2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the bad drive and the offline node) then shouldn't I still have access to all my data? Why was MFS indicating missing chunks in this scenario? Shouldn't I have 3 copies of my data and so be protected from a double disk failure? Thank you, JL _________________________________________ moosefs-users mailing list moo...@li...<mailto:moo...@li...> https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Jay L. <jl...@sl...> - 2019-11-04 21:53:44
|
Diego, I will review that. Thank you! Aleksander, I responded to your earlier email directly. Let me know what I can provide. Thank you to both of you! Jay ---------------------------- Jay Livens jl...@sl... (617)875-1436 ---------------------------- On Mon, Nov 4, 2019 at 3:00 PM Remolina, Diego J < dij...@ae...> wrote: > If your drives are using ZFS, are you using the following option in each > of the zfs pools? > > zpool set failmode=continue {dataset} > > If this is not the case, then it is possible one disk failure can halt the > whole cluster. I was told about this a while back and I simulated a failure > and in fact I could no longer write to my MooseFS file system when I forced > failed a disk. Bringing the disk back online would allow the cluster to > work again properly. I later change the setting and repeated the experiment > and upon a single disk failure, the file system would continue to operate > properly. > > Diego > > ------------------------------ > *From:* Aleksander Wieliczko <ale...@mo...> > *Sent:* Monday, November 4, 2019 4:06 AM > *To:* Jay Livens <jl...@sl...> > *Cc:* moo...@li... < > moo...@li...> > *Subject:* Re: [MooseFS-Users] Single disk failure brings down an > unrelated node > > Hi, Jay, > > I believe that we are talking about MooseFS 3.0.105. Yes? > > First of all I would like to ask about hard disks. > Do you use separate hard disk for OS and separate hard disk for chunks? > > About question number 1: > These components are independent and they are not designed to bring each > other down. > Is it possible that OS and chunks are stored on the same physical disk? > In such a scenario IO errors will influence the whole machine. > > About second question: > That should work exactly like you described. It is extremely weird that > you had some missing chunks. > Goal 3 means 3 copies, so lost of two components should not affect access > to the data. > > Is it possible to get some more logs from master server? > > Best regards, > > Aleksander Wieliczko > System Engineer > MooseFS Development & Support Team | moosefs.pro > > > pon., 4 lis 2019 o 05:17 Jay Livens <jl...@sl...> napisał(a): > > Hi, > > I just had a weird MFS problem occur and was hoping that someone could > provide guidance. (Questions are at the bottom of this note.) My cluster > is a simple one with 5 nodes and each node has one HDD. My goal is set to > 3 for the share that I am referring to in this post. > > I just had a drive go offline. Annoying but manageable; however, when it > went offline, it appears that it took another unrelated node offline with > it and to make matters worse, when I looked at the info tab in MFS, it said > that I was missing a number of chunks! I have no idea why this would > happen. > > Here is the syslog from the unrelated node: > > Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+ > Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: > got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node > Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 > times: [ replicator,read chunks: got status: IO error from > (192.168.x.x:24CE)] <-- The IP of the failed node > Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+ > > After those messages, the node stopped responding and I could not ping > it. A reboot brought it back online. > > Here are my questions: > > 1. Why would a bad disk on one node bring down another so > aggressively? Shouldn't they behave 100% independently of each other? > 2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the > bad drive and the offline node) then shouldn't I still have access to all > my data? Why was MFS indicating missing chunks in this scenario? > Shouldn't I have 3 copies of my data and so be protected from a double disk > failure? > > Thank you, > > JL > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > |
From: Tim G. <Ti...@ns...> - 2019-11-04 11:43:32
|
Hi everyone. I have a cloud backup service that I offer that runs on a centos 7 VM with a /mnt mount to a separate raid 6 storage server I have around 40Tb of data but not replication/backup as such apart from the raid 6. I need to rectify this but struggle to get downtime on the centos VM to be able to change anything. Is it possible to introduce an established Linux mount into a Moosefs file system for that mount to be goal 1 to then duplicate through to newer hardware for the 2nd goal? Or will I have to copy the data from an established mount to the new Moosefs file system already configured? I hope that makes sense. Regards Tim |
From: Aleksander W. <ale...@mo...> - 2019-11-04 09:34:23
|
Hi, Jay, I believe that we are talking about MooseFS 3.0.105. Yes? First of all I would like to ask about hard disks. Do you use separate hard disk for OS and separate hard disk for chunks? About question number 1: These components are independent and they are not designed to bring each other down. Is it possible that OS and chunks are stored on the same physical disk? In such a scenario IO errors will influence the whole machine. About second question: That should work exactly like you described. It is extremely weird that you had some missing chunks. Goal 3 means 3 copies, so lost of two components should not affect access to the data. Is it possible to get some more logs from master server? Best regards, Aleksander Wieliczko System Engineer MooseFS Development & Support Team | moosefs.pro pon., 4 lis 2019 o 05:17 Jay Livens <jl...@sl...> napisał(a): > Hi, > > I just had a weird MFS problem occur and was hoping that someone could > provide guidance. (Questions are at the bottom of this note.) My cluster > is a simple one with 5 nodes and each node has one HDD. My goal is set to > 3 for the share that I am referring to in this post. > > I just had a drive go offline. Annoying but manageable; however, when it > went offline, it appears that it took another unrelated node offline with > it and to make matters worse, when I looked at the info tab in MFS, it said > that I was missing a number of chunks! I have no idea why this would > happen. > > Here is the syslog from the unrelated node: > > Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+ > Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: > got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node > Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 > times: [ replicator,read chunks: got status: IO error from > (192.168.x.x:24CE)] <-- The IP of the failed node > Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+ > > After those messages, the node stopped responding and I could not ping > it. A reboot brought it back online. > > Here are my questions: > > 1. Why would a bad disk on one node bring down another so > aggressively? Shouldn't they behave 100% independently of each other? > 2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the > bad drive and the offline node) then shouldn't I still have access to all > my data? Why was MFS indicating missing chunks in this scenario? > Shouldn't I have 3 copies of my data and so be protected from a double disk > failure? > > Thank you, > > JL > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Jay L. <jl...@sl...> - 2019-11-04 04:40:53
|
Hi, I just had a weird MFS problem occur and was hoping that someone could provide guidance. (Questions are at the bottom of this note.) My cluster is a simple one with 5 nodes and each node has one HDD. My goal is set to 3 for the share that I am referring to in this post. I just had a drive go offline. Annoying but manageable; however, when it went offline, it appears that it took another unrelated node offline with it and to make matters worse, when I looked at the info tab in MFS, it said that I was missing a number of chunks! I have no idea why this would happen. Here is the syslog from the unrelated node: Nov 4 03:24:45 chunkserver4 mfschunkserver[587]: workers: 10+ Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: replicator,read chunks: got status: IO error from (192.168.x.x:24CE) <-- The IP of the failed node Nov 4 03:25:24 chunkserver4 mfschunkserver[587]: message repeated 3 times: [ replicator,read chunks: got status: IO error from (192.168.x.x:24CE)] <-- The IP of the failed node Nov 4 03:26:32 chunkserver4 mfschunkserver[587]: workers: 20+ After those messages, the node stopped responding and I could not ping it. A reboot brought it back online. Here are my questions: 1. Why would a bad disk on one node bring down another so aggressively? Shouldn't they behave 100% independently of each other? 2. Since I have a goal of 3 and effectively lost 2 drives (e.g. the bad drive and the offline node) then shouldn't I still have access to all my data? Why was MFS indicating missing chunks in this scenario? Shouldn't I have 3 copies of my data and so be protected from a double disk failure? Thank you, JL |
From: David M. <dav...@pr...> - 2019-10-29 04:27:24
|
Hi, I've been trying to phase out some chunk servers. I've been marking disks for removal and slowly reducing the number of disks and servers. The endangered chunk count has started increasing today and it's not related to a disk or server getting disconnected without notice - the chunk migration process seems to be doing it. Can anyone help me understand why MFS would be doing this and is there a way to avoid it? Thanks, David Sent with [ProtonMail](https://protonmail.com) Secure Email. |
From: Jay L. <jl...@sl...> - 2019-09-30 02:27:40
|
Dave, Thank you. The Smartctl output is below, and it appears that I do not have any reallocated sectors. Thank you in advance for any thoughts. === START OF INFORMATION SECTION === Device Model: ST2000DM008-2FR102 Serial Number: ZFL08677 LU WWN Device Id: 5 000c50 0b50ff26c Firmware Version: 0001 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 5 SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Sep 29 21:56:04 2019 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 201) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30a5) SCT Status supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-- 079 064 006 - 76784672 3 Spin_Up_Time PO---- 098 098 000 - 0 4 Start_Stop_Count -O--CK 100 100 020 - 33 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0 7 Seek_Error_Rate POSR-- 073 060 045 - 18906494 9 Power_On_Hours -O--CK 096 096 000 - 3537 (26 240 0) 10 Spin_Retry_Count PO--C- 100 100 097 - 0 12 Power_Cycle_Count -O--CK 100 100 020 - 18 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error -O--CK 100 100 099 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 188 Command_Timeout -O--CK 100 100 000 - 0 189 High_Fly_Writes -O-RCK 100 100 000 - 0 190 Airflow_Temperature_Cel -O---K 070 067 040 - 30 (Min/Max 28/32) 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0 192 Power-Off_Retract_Count -O--CK 100 100 000 - 138 193 Load_Cycle_Count -O--CK 099 099 000 - 2301 194 Temperature_Celsius -O---K 030 040 000 - 30 (0 19 0 0 0) 195 Hardware_ECC_Recovered -O-RC- 079 064 000 - 76784672 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0 240 Head_Flying_Hours ------ 100 253 000 - 3487 (125 231 0) 241 Total_LBAs_Written ------ 100 253 000 - 1813646568 242 Total_LBAs_Read ------ 100 253 000 - 26007428236 ||||||_ K auto-keep |||||__ C event count ||||___ R error rate |||____ S speed/performance ||_____ O updated online |______ P prefailure warning SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 3432 - # 2 Short offline Completed without error 00% 3429 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Jay On Sun, Sep 29, 2019 at 6:49 PM David Myer <dav...@pr...> wrote: > Hi Jay, > > I encounter this occasionally and have usually found no issues with the > disks according to SMART. Can you post your smartctl output? I was advised > that if there are "reallocated sectors" in the smart summary, this could > relate to the problem. Alternatively, if your disk(s) have compression > enabled, this may be causing errors. > > One thing you could try is mark the disk for removal, reformat it when > ready, then re-add it to the cluster. > > Cheers, > Dave > > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Sunday, September 29, 2019 3:33 PM, Jay Livens <jl...@sl...> > wrote: > > Hi, > > I have a small MooseFS cluster running on four identical nodes. > Everything was running smoothly until a week ago when one of the nodes > started showing a value under "Last Error." The "Last Error" field updates > every couple of days. The status is still shown as "Ok" for the drive. > > I have run scans on the hard drive on the "Last Error" node, and they > passed without issues. I don't see any issues in the SMART data either. > > What exactly is going on and what exactly does a value in 'Last Error" > tell me? Can someone advise on what else I should check on? > > Thank you, > > Jay > > > |
From: David M. <dav...@pr...> - 2019-09-29 22:49:17
|
Hi Jay, I encounter this occasionally and have usually found no issues with the disks according to SMART. Can you post your smartctl output? I was advised that if there are "reallocated sectors" in the smart summary, this could relate to the problem. Alternatively, if your disk(s) have compression enabled, this may be causing errors. One thing you could try is mark the disk for removal, reformat it when ready, then re-add it to the cluster. Cheers, Dave Sent with [ProtonMail](https://protonmail.com) Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Sunday, September 29, 2019 3:33 PM, Jay Livens <jl...@sl...> wrote: > Hi, > > I have a small MooseFS cluster running on four identical nodes. Everything was running smoothly until a week ago when one of the nodes started showing a value under "Last Error." The "Last Error" field updates every couple of days. The status is still shown as "Ok" for the drive. > > I have run scans on the hard drive on the "Last Error" node, and they passed without issues. I don't see any issues in the SMART data either. > > What exactly is going on and what exactly does a value in 'Last Error" tell me? Can someone advise on what else I should check on? > > Thank you, > > Jay |