WAS restart script to kill off hung threads

Our WebSphere environment has nightly restarts because some of the apps are so shitty that they cannot run for more than 24 hours at a time and app owners do not care (another conversation for another day). Given this piece of information a long time ago we implemented nightly restarts that will reboot all apps on a given cluster.

Every so often we get a Splunk notification that a cluster node is not back up and running, and then when we investigate, we find that no java processes are running on this machine. After digging into it, I discovered that our script attempts to shut down each AppServer, but since a rogue app has a hung thread that is preventing the stopServer.sh command from completing.

To combat this in our Restart_AppServer.sh script, I have utilized the ‘timeout’ and ‘pgrep’ commands.

The timeout command was pretty straight forward: if the call does not return in the provided amount of seconds, then kill the command trying to run.

pgrep is also a pretty straight forward command, but the only problem with it is that the Restart_AppServer.sh command contains a parameter that is the name of the server. So if you have an AppServer named ‘Level_1’ then when you do a ‘pgrep -f Level_1’ you will get 2 PIDs: the one for the AppServer, and one for the Restart_AppServer.sh.

To get around this I looked up the PID of the Restart_AppServer.sh script, and then removed it from the grep command using the ‘-v’ option which is used to remove results from the result set.

timeout to stop the server, but kills the attempt if it doesn’t complete in time.

timeout 120 ${WAS_ROOT}/bin/stopServer.sh ${wrk_server}

pgrep grabs the process ID(s) of whatever you’re grepping for.

CONTROLLER_SCRIPT_PID=`pgrep -f Controller`
echo "********** pid =  ${CONTROLLER_SCRIPT_PID}"

SERVER_PID=`pgrep -f $1 | grep -v ${CONTROLLER_SCRIPT_PID}`
echo "********** $1 pid =  ${SERVER_PID}"
if [[ ${SERVER_PID} != "" ]]
    echo "### ERROR ### AppServer $1 could not be shutdown gracefully, and had to be killed" >> ${TEMP_LOG}
    pgrep -f $1 | grep -v ${CONTROLLER_SCRIPT_PID} | xargs kill -9

WebSphere jython installation script enhancement

We kept having an issue where the app would successfully deploy to all the nodes in the cluster, but for an unknown reason the app would only partially startup, or not startup at all. This would require our on-call to be paged, get on the phone, login to the WAS console, and manually restart the app.

I suspect that the recent errors occurred because the script was trying to start the app before the app was fully synced and installed, which meant it may start on some cluster nodes, but not all of them, resulting in our team having to manually go start the app.

By utilizing the AdminApp.isAppReady(app) function, the script will now verify whether the app is ready to start or not. If the app is ready right after install, it’s smooth sailing and the app will be started. However, if the app is not ready to be started, the script will sleep for 30 seconds, and then inspect the app again to see if it is ready. The script will do this a maximum of 5 times, but on the first instance of the app being ready, the app will be started. After the 5th time, the app will try to be started anyway and a log entry made that it MAY need further attention. At that point the interested party should attempt to hit the app and see if it is ready or not, and call us if needed.

import sys
import time
# get line separator
lineSeparator = java.lang.System.getProperty('line.separator')

print "Verify app is ready to start, and if not, give it more time to get ready"
ctr = 0
result = AdminApp.isAppReady(app)
print "initial isAppReady=" + result

while (result == "false" and ctr < 6):
        print "APP IS NOT READDY TO START!!!! Sleeping to give app time to be ready to start..."
        result = AdminApp.isAppReady(app)
        print "isAppReady=" + result
        ctr += 1

if(result == "false"):
        print "final isAppReady=false and app MAY need additional attention"

Pass parameters to interactive Unix script

I hhhhhhhhaaaaaattttteeee being a monkey that just pushes a button. There’s always a better (and cheaper) way to restart a system than to have a human push a button to restart a system. God gave us computers just for that reason!

We have an environment that treats the person running the restart command as someone that is not familiar with running the stop.sh and start.sh scripts. To get to this point in this environment requires a huge amount of IT experience, but for whatever reason the stop/start command requires a crap-ton of hand-holding.

I’m not in charge of this system, but during an on-call I had to be the monkey that pushes the stop/start buttons and follow-along with the series of very basic questions. Eff that, surely it can be scripted, but “impossible” replied my compadre, it cannot because the stop/start requires you to enter some usernames, passwords, numbers, and a prostate exam.

That’s bush league, there has to be a better way that can remove me from the process, sure enough, there is a cool feature in Unix that allows a line-separated list file to be passed to a script. An hour later, this is what was produced.

Test file that mimics the system stop/start interactive commands:

admin@server01:/tmp> cat intro.sh
# Ask the user for their name
echo Give me your number
read varname
echo Provied number: $varname

echo Give me your username
read varname
echo Username: $varname

echo Give me your password
read varname
echo Password: $varname

echo 2 Give me your username
read varname
echo Username: $varname

echo 2 Give me your password
read varname
echo Password: $varname

echo Sleeping...
sleep 10
echo Done sleeping.

echo Press 7 to exit
read varname
echo You have entered: $varname

Here’s the input file that correlates to the questions being asked

admin@server01:/tmp> cat input.txt

Here’s what the execution of the file looks like

admin@server01:/tmp> cat input.txt | ./intro.sh
Give me your number
Provied number: 1
Give me your username
Username: user1
Give me your password
Password: pwd1
2 Give me your username
Username: usr2
2 Give me your password
Password: password2
Press 7 to exit
You have entered: 7

As long as you know that the input from the user is the same order every time, you can use this method to “cat” a file of options to the script.