2016-07-27 12:13 AM
Hi All ,
I'm getting below aggregation errors in my packet concentrators :
NwConcentrator[8736]: [Aggregation] [failure] Unable to retrieve page from meta pool, please increase aggregate.meta.perpage and/or aggregate.meta.page.factor and restart
NwConcentrator[8736]: [Aggregation] [info] Aggregation is stopping
NwConcentrator[8736]: [Aggregation] [info] Aggregation threads are being shutdown
While restarting the packet concentrator aggregation service i am getting an error like :
"Aggregation failed to start.TransportException: Message start was not recognized by concentrator."
While cross-verifying with RSA SA community KB and stated that - Concentrator takes some time before the aggregation actually starts because it is “initializing databases” and “initializing the SDK” so as soon as everything is initialized properly, the aggregation starts.
After successful initialization of DB and SDK , my concentrator move up with consuming state but again it get down to offline state .
What could be the possible reason for this fluctuations ?
Thanks in Advance !
Regards
Pranav
2016-07-27 03:34 AM
Hi Sankar,
Can you restart the nwconcentrator service from CLI.
By the way, what version of SA ou are using ?
2016-07-27 03:39 AM
Hi Soumyajit ,
SA version - 10.5.2
Before restarting i would like to know the reasons for such fluctuations
Regards
Pranav
2016-07-27 03:50 AM
Hi Sankar,
Sometimes, it may happen due to huge amount of packets traffic flow. So it takes time to aggregate it simultaneously. If the box you are using is Hybrid, you may see this situation sometimes based on the rate of packet capture on Decoder.
But if you are using separate box (I mean physical decoder and physical concentrator), it should not.
Please raise case with support, if you see this issue rapidly.
Also, I would like to encourage you to move your platform from 10.5.2 to 10.6.1.
2016-07-27 04:59 AM
Hi Pranav,
As from the logs you mentioned, it shows some problems in retrieving meta pages. To solve this we need to increase both "aggregate.meta.perpage and aggregate.meta.page.factor" in the explore view of the concentrator as per the below screenshot.
You could raise the values to:-
/concentrator/config/aggregate.meta.page.factor = 200
/concentrator/config/aggregate.meta.perpage = 30
Note that regarding the second error "Aggregation failed to start.TransportException: Message start was not recognized by concentrator."
This is normal as the concentrator won't start aggregation till the initialization of databases is finished.
After changing the values of "aggregate.meta.perpage and aggregate.meta.page.factor" you will have to restart the services. If the problem happens again after you change the values and restart, then this won't be the main reason for the crash. You could also check the memory consumed by the services when the crash occurs. If you want, please send me the output of the "top" command and I will check if there is any service that is consuming more memory than it should.
Hope this helps!
Best regards
Khaled
2016-07-27 05:09 AM
Hi Kahled ,
Good Morning
I ran a check in explore view of concentrator , but the values are in default as you mentioned :
What could be the reason for this type of re-indexing ?
Regards
Pranav
2016-07-27 05:14 AM
Good Morning Pranav,
The re-index part is normal but what we need to know is to why it crashes in the first place. The nwconcentrator service when it goes up after the crash it re-indexes the last slice that was closed in a wrong manner.
When it crashes again, please send me the output of the top command so I can check if it crashes due to memory utilization.
Best regards
Khaled
2016-07-27 05:24 AM
Sure Khaled , I'll keep my post as observation mode .
Thanks !
2016-08-04 10:13 AM
Hey Khaled ,
Seems concentrator starts to re-index again and now its in consuming state.
I've issued "top" command to list out any service is consuming more memory . Please find the output.
top - 14:09:11 up 21 days, 3:33, 1 user, load average: 2.52, 2.61, 2.61
Tasks: 682 total, 2 running, 680 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 1.4%sy, 0.0%ni, 92.2%id, 5.8%wa, 0.0%hi, 0.4%si, 0.0%st
Mem: 99051764k total, 97303988k used, 1747776k free, 10656k buffers
Swap: 20971516k total, 4237428k used, 16734088k free, 76491532k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10075 root 20 0 0 0 0 R 19.8 0.0 120:57.12 flush-253:9
8284 rabbitmq 20 0 2431m 88m 1876 S 14.5 0.1 3641:37 beam.smp
5608 root 20 0 319g 50g 36g S 4.6 53.1 2706:57 NwConcentrator
1959 root 39 19 0 0 0 S 4.3 0.0 389:26.93 kipmi0
4984 root 20 0 15428 1816 984 R 1.3 0.0 0:03.34 top
226 root 20 0 0 0 0 S 0.7 0.0 127:29.86 kblockd/0
227 root 20 0 0 0 0 S 0.3 0.0 172:23.35 kblockd/1
2846 root 20 0 196m 3500 1236 S 0.3 0.0 74:38.81 snmpd
3605 root 20 0 495m 1596 1016 S 0.3 0.0 12:33.39 dsm_sa_datamgrd
3799 root 20 0 900m 20m 2372 S 0.3 0.0 69:05.10 NwAppliance
1 root 20 0 19364 916 684 S 0.0 0.0 3:45.00 init
2 root 20 0 0 0 0 S 0.0 0.0 0:04.09 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 2:07.31 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 6:23.25 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0
6 root RT 0 0 0 0 S 0.0 0.0 0:03.86 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 1:48.91 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/1
9 root 20 0 0 0 0 S 0.0 0.0 6:02.20 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:02.79 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:07.54 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/2
13 root 20 0 0 0 0 S 0.0 0.0 0:51.93 ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:02.48 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:08.18 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/3
17 root 20 0 0 0 0 S 0.0 0.0 1:08.58 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:01.74 watchdog/3
19 root RT 0 0 0 0 S 0.0 0.0 0:05.36 migration/4
20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/4
21 root 20 0 0 0 0 S 0.0 0.0 0:41.12 ksoftirqd/4
22 root RT 0 0 0 0 S 0.0 0.0 0:01.40 watchdog/4
23 root RT 0 0 0 0 S 0.0 0.0 0:05.18 migration/5
24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/5
25 root 20 0 0 0 0 S 0.0 0.0 0:46.95 ksoftirqd/5
26 root RT 0 0 0 0 S 0.0 0.0 0:01.40 watchdog/5
27 root RT 0 0 0 0 S 0.0 0.0 0:03.43 migration/6
Kindly review it .
Regards
Pranav
2016-08-04 10:43 AM
Hi Pranav,
Regarding the output of the top command specially the below line:-
5608 root 20 0 319g 50g 36g S 4.6 53.1 2706:57 NwConcentrator
It seems that the concentrator is crashing due to memory issues as the NwConcentrator service is consuming a lot of memory. It is consuming 50 GB of resident memory which is pretty high for one process but also the interesting part is the virtual memory which is 319 GB which is a sign of a possible memory leak. I don't know which version of SA you are running but there were some memory leak problems that were fixed in the new releases. Note that if you upgrade to this version it should fix the memory leak issue which is the part of the virtual memory "319 GB" but regarding the resident memory it won't decrease this as this memory is the actual memory used by the process. To sum up all the above:-
1- You have a memory problem which is due to the concentrator being overloaded. To solve this you will have to decrease the load on the concentrator.
2- There is a problem with Virtual memory due to a memory leak problem. If you are not already on 10.6.1 I recommend you to upgrade to 10.6.1. If you won't be able to do this, there is a workaround that helped a lot of customers before which is having a cron job that flushes cache over night.:-
0 1 * * * root echo 3 > /proc/sys/vm/drop_caches
3 1 * * * root stop nwconcentrator
5 1 * * * root start nwconcentrator
The above is a cron job in /etc/crontab, which will free the cache at 1:00 am and then stop and start the nwconcentrator respectively at 1:03 and 1:05 am. This won't affect your production and won't lead to any data loss what so ever. You can implement this to free up the virtual memory overnight otherwise an upgrade to 10.6.1 should fix the issue.
If you address only one of the points above, it will make the fluctuation happen less but it won't address the issue permanently. If you want to address the issue permanently you will have to address both points 1 and 2.
Hope this helps!
Best regards
Khaled