2014-11-07 06:01 PM
Hello,
We first upgraded one log hybrid from 10.3.4 to 10.4 without any issues really. That log hybrid is located on the same subnet as the SA server. We then proceeded to upgrade a second log hybrid, one which is not in the same subnet as the SA server. The installation hung for a reason we now believe is that the upgrade instruction document lacks the mention of ports used by Puppet, RabbitMQ, MCollective, and so on. So during that install we clicked the Upgrade button from the GUI, but never got the Reboot or Enable button. We eventually got the puppet agent working - at least it seems that way - and even removed and repurposed the appliance from the SA server. Earlier today I had to manually sign the request made by the puppet agent at the puppet master and after that I got a message requesting a manual reboot from the appliance itself. The message never vanished after how many reboots. We then removed all the SSL stuff from the log hybrid, cleaned the certificates from the puppet master and initiated puppet agent --test --waitforcert 30. That time we even got the puppet master to sign the request, thus giving us the pop up window in the SA GUI containing the fingerprint. However, we still haven't got rid of the Enable button. I suspect if I gave it a few more tries I might get to to display Reboot Required for me, but that would never go away either.
I think syslog collection is working ok, but the log collector service is not. Here are some sample errors gathered from all around the system:
log hybrid - /var/log/messages:
[AMQPClientBase] [failure] An error occurred creating an AMQP channel: : connection closed unexpectedly
[BufferedChannel] [failure] An error occurred publishing to an AMQP channel: : connection closed unexpectedly
[EventBroker] [failure] failure in updating statistics for: No such node (stats)
log hybrid - /var/log/rabbitmq/sa@localhost.log:
=WARNING REPORT==== 7-Nov-2014::17:30:03 ===
HTTP access denied: user 'logcollector' - invalid credentials
=ERROR REPORT==== 7-Nov-2014::17:30:03 ===
webmachine error: path="/api/nodes"
"Unauthorized"
=WARNING REPORT==== 7-Nov-2014::17:30:03 ===
HTTP access denied: user 'logcollector' - invalid credentials
=ERROR REPORT==== 7-Nov-2014::17:30:03 ===
webmachine error: path="/api/connections"
"Unauthorized"
=WARNING REPORT==== 7-Nov-2014::17:30:03 ===
HTTP access denied: user 'logcollector' - invalid credentials
=ERROR REPORT==== 7-Nov-2014::17:30:04 ===
closing AMQP connection <0.983.2> (127.0.0.1:48949 -> 127.0.0.1:5671):
{handshake_error,starting,0,
{amqp_error,access_refused,
"PLAIN login refused: user 'logcollector' - invalid credentials",
'connection.start_ok'}}
=ERROR REPORT==== 7-Nov-2014::17:30:05 ===
closing AMQP connection <0.987.2> (127.0.0.1:48950 -> 127.0.0.1:5671):
{handshake_error,starting,0,
{amqp_error,access_refused,
"PLAIN login refused: user 'logcollector' - invalid credentials",
'connection.start_ok'}}
=ERROR REPORT==== 7-Nov-2014::17:30:05 ===
closing AMQP connection <0.991.2> (127.0.0.1:48951 -> 127.0.0.1:5671):
{handshake_error,starting,0,
{amqp_error,access_refused,
"PLAIN login refused: user 'logcollector' - invalid credentials",
'connection.start_ok'}}
In /var/log/puppet/masterhttp.log on the SA server the log looks completely OK around every puppet test attempt - no error status codes are reported. For example:
[07/Nov/2014:22:37:46 UTC] "POST /production/catalog/530e71a6-d288-4b25-a2fe-455a10398a91 HTTP/1.1" 200 49979
Is it somehow possible to start from square one - meaning that I would get to start from the same point I did before everything went wrong - so a point where I would see the "Upgrade to 10.4" button in the SA GUI instead of the red or yellow Reboot Required or Enable buttons? Or is there a simpler way to fix things? Honestly I do not know what else might be corrupted or broken due to the interrupted installation on the first try because the ports for puppet communication etc. were not open because we did not have any knowledge of such requirements. Re-imaging the device is not an option. What does the Enable button in the SA GUI actually initiate? If I knew that it might be easier for me to track and debug the issue.
Finally, on the log hybrid the /etc/hosts includes:
x.x.x.x puppetmaster.local
where x.x.x.x is the IP address of the SA server.
Anyone know how to further approach the issue or how to force a 10.4 re-upgrade as it happened the first time. I have no problem re-installing every RPM as long as it does not lose data or configuration regarding event sources. I have a support ticket open regarding this case but I am seeking further help in the hope I would find some already during the weekend.
2014-11-10 11:52 AM
You would first need to check of all of your rpm's were installed.
Compare it with another instance.(if any rpms missing, I would probably do a re-upgrade).
If all rpms are there,then do a remove and repurpose of that appliance.
In addition to this, locate the UUID of that appliance
then on SA, run this command: /etc/puppet/scripts/delNode..py <UUID of the appliance>
wait for script completion.
Then on that appliance perform a cleanup.
Then on the appliance page of SA UI, perform a Discover using discover button.
Hopefully if all goes well, you would get the appliance you want to enable.
Enable does following things:
1: Exchanges the puppet certs
2: Discovers the appliance and services running puppet agent.
3: configured health and wellness and services on the appliance.
2014-11-11 04:16 AM
Thank you for your advice Mudit. However, I managed to get the hybrid enabled before your post. I suppose there is a large chance that your methods would have worked also, as I did not try cleanin the /etc/RabbitMQ/ssl folder. I did, however, clean the /var/lib/puppet/ssl folder multiple times in addition to cleaning the cert from the puppet master (SA server). What I ended up doing to get things working was remove rabbitmq-server along with any depencies, i.e. nwlogcollector, erlang, etc., and reinstall them. This seems to have done the trick for me. Before that, even Remove and Repurpose did not do the trick.
My safest recommendation for other people is that they have port TCP 8140 open to the SA server, along with 5761 open bi-directionally and 61614 open to the SA server (or even bi-directionally, not quite sure). At least the puppet port (8140) seems to be vital for the upgrade to work. That is, of course, if you do not have all the components in the same subnet, as was our situation. Unfortunately, these ports have not been mentioned in the upgrade installation guide.
2014-11-11 04:23 AM
That's great to hear.
It is possible that failures were due to some dependency errors.
But once all the rpms have been properly installed/upgraded, the steps that I had mentioned would definitely in resolving the Enable problems.
2016-11-10 03:18 PM
Tomi, there is a port scanning script that is helpful. It is available here https://community.rsa.com/docs/DOC-45752