WAS restart script to kill off hung threads

Our WebSphere environment has nightly restarts because some of the apps are so shitty that they cannot run for more than 24 hours at a time and app owners do not care (another conversation for another day). Given this piece of information a long time ago we implemented nightly restarts that will reboot all apps on a given cluster.

Every so often we get a Splunk notification that a cluster node is not back up and running, and then when we investigate, we find that no java processes are running on this machine. After digging into it, I discovered that our script attempts to shut down each AppServer, but since a rogue app has a hung thread that is preventing the stopServer.sh command from completing.

To combat this in our Restart_AppServer.sh script, I have utilized the ‘timeout’ and ‘pgrep’ commands.

The timeout command was pretty straight forward: if the call does not return in the provided amount of seconds, then kill the command trying to run.

pgrep is also a pretty straight forward command, but the only problem with it is that the Restart_AppServer.sh command contains a parameter that is the name of the server. So if you have an AppServer named ‘Level_1’ then when you do a ‘pgrep -f Level_1’ you will get 2 PIDs: the one for the AppServer, and one for the Restart_AppServer.sh.

To get around this I looked up the PID of the Restart_AppServer.sh script, and then removed it from the grep command using the ‘-v’ option which is used to remove results from the result set.

timeout to stop the server, but kills the attempt if it doesn’t complete in time.

timeout 120 ${WAS_ROOT}/bin/stopServer.sh ${wrk_server}

pgrep grabs the process ID(s) of whatever you’re grepping for.

CONTROLLER_SCRIPT_PID=`pgrep -f Controller`
echo "********** pid =  ${CONTROLLER_SCRIPT_PID}"

SERVER_PID=`pgrep -f $1 | grep -v ${CONTROLLER_SCRIPT_PID}`
echo "********** $1 pid =  ${SERVER_PID}"
if [[ ${SERVER_PID} != "" ]]
    echo "### ERROR ### AppServer $1 could not be shutdown gracefully, and had to be killed" >> ${TEMP_LOG}
    pgrep -f $1 | grep -v ${CONTROLLER_SCRIPT_PID} | xargs kill -9

WebSphere jython installation script enhancement

We kept having an issue where the app would successfully deploy to all the nodes in the cluster, but for an unknown reason the app would only partially startup, or not startup at all. This would require our on-call to be paged, get on the phone, login to the WAS console, and manually restart the app.

I suspect that the recent errors occurred because the script was trying to start the app before the app was fully synced and installed, which meant it may start on some cluster nodes, but not all of them, resulting in our team having to manually go start the app.

By utilizing the AdminApp.isAppReady(app) function, the script will now verify whether the app is ready to start or not. If the app is ready right after install, it’s smooth sailing and the app will be started. However, if the app is not ready to be started, the script will sleep for 30 seconds, and then inspect the app again to see if it is ready. The script will do this a maximum of 5 times, but on the first instance of the app being ready, the app will be started. After the 5th time, the app will try to be started anyway and a log entry made that it MAY need further attention. At that point the interested party should attempt to hit the app and see if it is ready or not, and call us if needed.

import sys
import time
# get line separator
lineSeparator = java.lang.System.getProperty('line.separator')

print "Verify app is ready to start, and if not, give it more time to get ready"
ctr = 0
result = AdminApp.isAppReady(app)
print "initial isAppReady=" + result

while (result == "false" and ctr < 6):
        print "APP IS NOT READDY TO START!!!! Sleeping to give app time to be ready to start..."
        result = AdminApp.isAppReady(app)
        print "isAppReady=" + result
        ctr += 1

if(result == "false"):
        print "final isAppReady=false and app MAY need additional attention"

Fix WebSphere OutOfMemory error during deployment

Deployments within the WebSphere v7 environment were randomly failing with OutOfMemory (OOM) errors. We initially thought it was due to a very large EAR file being deployed, but after a while, this theory was inconsistent because OOMs occurred with small and large EARs alike. Within the SystemOut.log, we found this error:

[1/10/14 06:17:51:553 CST] 00000000 AbstractShell E   WASX7120E: Diagnostic information from exception with text "com.ibm.websphere.management.application.client.AppDeploymentException: com.ibm.websphere.management.application.client.AppDeploymentException [Root exception is java.lang.OutOfMemoryError]
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError " follows:

 com.ibm.websphere.management.application.client.AppDeploymentException [Root exception is java.lang.OutOfMemoryError]
    at com.ibm.ws.management.application.client.AppInstallHelper.getAppDeploymentInfoGenericRead(AppInstallHelper.java:520)
    at com.ibm.ws.management.application.client.DefaultBindingHelper.prepareTask(DefaultBindingHelper.java:216)
    at com.ibm.ws.scripting.AdminAppClient.createPreferences(AdminAppClient.java:2885)
    at org.eclipse.core.launcher.Main.basicRun(Main.java:282)
    at org.eclipse.core.launcher.Main.run(Main.java:981)
    at com.ibm.wsspi.bootstrap.WSPreLauncher.launchEclipse(WSPreLauncher.java:341)
    at com.ibm.wsspi.bootstrap.WSPreLauncher.main(WSPreLauncher.java:111)
Caused by: java.lang.OutOfMemoryError
    at org.objectweb.asm.Type.getDescriptor(Unknown Source)
    at com.ibm.ws.amm.scan.util.info.impl.InfoImpl.getClassName(InfoImpl.java:217)
    at com.ibm.ws.amm.scan.util.info.impl.InfoImpl.getClassInfo(InfoImpl.java:158)

From the install script, we saw this error:

[1/10/14 06:17:52:941 CST] 00000000 AbstractShell A   WASX7093I: Issuing message: "WASX7017E: Exception received while running file "/websphere/utilities/scripts/installNewApplication.py"; exception information: com.ibm.websphere.management.application.client.AppDeploymentException: com.ibm.websphere.management.application.client.AppDeploymentException [Root exception is java.lang.OutOfMemoryError]
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError

We knew we had to increase the heap size of the deployment manager, but that turned out to be a little less obvious then anticipated.

The Python script used to kickoff the install eventually calls the Profile’s (Dmgr) script: /websphere/AppServer/profiles/Dmgr/bin GenPluginCfg.sh

binDir=`dirname ${0}`
. ${binDir}/setupCmdLine.sh
${WAS_HOME}/bin/GenPluginCfg.sh "$@"

The last line of the profile’s GenPluginCfg.sh calls the cell’s version of GenPluginCfg.sh, which is where the heap can be adjusted as needed (-Xmx):

"$JAVA_HOME/bin/java" -Xmx1024m 
  -classpath "$WAS_CLASSPATH" com.ibm.ws.bootstrap.WSLauncher 
  com.ibm.websphere.plugincfg.generator.PluginConfigGenerator "$WAS_HOME" "$CONFIG_ROOT" "$WAS_CELL" "$WAS_NODE" $@

Don’t forget to backup any script you plan to alter before making updates. No reason to make a fat-finger mistake that causes a cryptic error message and takes forever to figure out.

Cloned Cluster on WebSphere does not start

We had a need to create a new cluster of WebSphere 7 JVMs (Cluster_B) that are identical to an existing cluster (Cluster_A).  No problem, an easy task that I’ve done many times before. I proceeded to venture through the WAS console to create a new cluster using the existing Cluster_A_was01 member as a template. The new config was told to create new ports, I clicked through the save buttons, and gave the cluster members a few minutes to ensure they were synced up properly with the new configuration.

Everything worked as expected right up to the point that the server did not start after issuing the start command from the CLI (Command Line Interface).

websphere_01:~> /was/AppServer/profiles/AppServer/bin/startServer.sh Cluster_B
ADMU0116I: Tool information is being logged in file
ADMU0128I: Starting tool with the AppServer profile
ADMU3100I: Reading configuration for server: Cluster_B
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3011E: Server launched but failed initialization. startServer.log,
SystemOut.log(or job log in zOS) and other log files under
should contain failure information.

This is a new server, that was cloned from an existing one, so there could be a conflict of a param that I missed (ports, cookie names, etc.).  I look inside the Cluster_B log directory, and there is no SystemOut.log to be found.

websphere_01:~> cd /was/AppServer/profiles/AppServer/logs/Cluster_B
websphere_01:/was/AppServer/profiles/AppServer/logs/Cluster_B> ls -latr
total 16
-rw-r–r–  1 websphereUser websphereGroup    0 2014-02-26 15:19 native_stdout.log
-rw-r–r–  1 websphereUser websphereGroup    5 2014-02-26 15:35 Cluster_B.pid
-rw-r–r–  1 websphereUser websphereGroup 1935 2014-02-28 13:26 startServer.log
-rw-r–r–  1 websphereUser websphereGroup 2259 2014-02-28 13:26 native_stderr.log

Note that I tried to start the server, it failed, and told me to look in the SystemOut.log.  There is no SystemOut.log listed.  I’m now in uncharted waters.  I’ve never seen an instance of starting up a new JVM where no SystemOut.log or SystemErr.log is created.  Thanks for mutton WebSphere.

After verifying the ports are different from the cloned JVM from Cluster_A, kicking kittens, and other config comparisons, I thought to look at the JVM args, which would be identical to Cluster_A, since it is a clone.  I see that AppDynamics is there, and right next them are the bane of the past couple of hours: a check mark next to Debug with the port set to 7777, just like Cluster_A’s debug configuration.

To be sure that the identical debug ports are the issue (and not AppD), I first remove the AppDynamics JVM params and try again.  Failure.  Next the debug config is removed altogether, and the server boots right up.  I changed the debug port on Cluster_B to 7778, reboot, and it again starts right up.

It would have been nice for the WAS server to let me know that there was a debug port conflict, instead of me fumbling around in the dark with no idea of where to start.  It would have saved me a couple of hours, and several kicks to kittens.


WebSphere 7 app startup exception without a hint

I received the error below after making an update to my app. As you can see there is nothing but IBM specific code in the stack trace. So I had a good idea which app was causing the error, but there is no indication as to which part of my code is the offender. After multiple debug sessions and System.println.out() in the constructor, I was able to find that my .properties files were not being loaded properly.

It’s not that WebSphere is a terrible app server, it’s just that it’s so freakin’ expensive for what you get.

[9/19/13 16:17:55:564 CDT] 0000002e servlet E com.ibm.ws.webcontainer.servlet.ServletWrapper run [Servlet Error]-[class java.lang.NullPointerException: null]: java.lang.ClassNotFoundException: class java.lang.NullPointerException: null
at java.beans.Beans.instantiate(Unknown Source)
at java.beans.Beans.instantiate(Unknown Source)
at com.ibm.ws.webcontainer.servlet.ServletWrapper$1.run(ServletWrapper.java:1909)
at com.ibm.ws.security.util.AccessController.doPrivileged(AccessController.java:118)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.loadServlet(ServletWrapper.java:1900)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:730)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:502)
at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:181)
at com.ibm.ws.webcontainer.webapp.WebApp.handleRequest(WebApp.java:3944)
at com.ibm.ws.webcontainer.webapp.WebGroup.handleRequest(WebGroup.java:276)
at com.ibm.ws.webcontainer.WebContainer.handleRequest(WebContainer.java:931)
at com.ibm.ws.webcontainer.WSWebContainer.handleRequest(WSWebContainer.java:1592)
at com.ibm.ws.webcontainer.channel.WCChannelLink.ready(WCChannelLink.java:186)
at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleDiscrimination(HttpInboundLink.java:452)
at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleNewRequest(HttpInboundLink.java:511)
at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.processRequest(HttpInboundLink.java:305)
at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.ready(HttpInboundLink.java:276)
at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.sendToDiscriminators(NewConnectionInitialReadCallback.java:214)
at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.complete(NewConnectionInitialReadCallback.java:113)
at com.ibm.ws.tcp.channel.impl.AioReadCompletionListener.futureCompleted(AioReadCompletionListener.java:165)
at com.ibm.io.async.AbstractAsyncFuture.invokeCallback(AbstractAsyncFuture.java:217)
at com.ibm.io.async.AsyncChannelFuture$1.run(AsyncChannelFuture.java:205)
at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1646)

websphere datasource-ish issue

The following error occurs because the port is not open on the server (hhudb), which means I cannot connect to the DB that resides on the server. There are multiple commands to use in order to help you figure out if your port is not open:

netstat -an| grep 50000

netstat -an| grep LISTEN|grep ^tcp

telnet hhudb 50000

The error originally received:

The test connection operation failed for data source mcshh01 on server dmgr at node Dmgr with the following exception: java.sql.SQLNonTransientException: [jcc][t4][2043][11550][4.11.69] Exception java.net.ConnectException: Error opening socket to server HHUDB/ on port 50,000 with message: Connection refused. ERRORCODE=-4499, SQLSTATE=08001DSRA0010E: SQL State = 08001, Error Code = -4,499. View JVM logs for further details.