xmlstarlet to remove XML stanzas

Given our environmnet has both WebSphere v7 and WebSphere v9, we must merge their respective plugins. There are similarly named clusters in both v7 and v9 (e.g. Level_1, Level_2, etc.), and for some reason the GenPluginCfg.sh will merge one (and only one) of the clusters. They’re not even the same clusters in test and prod. In addition, there are unique entries for the cluster in question.

I have noticed this before, but given everything worked as expected through our IHS (aka Apache), it did not register on our radar. However, when I updated our traffic to go through the latest IHS version, we began to see ServerIOTimeouts to the cluster that spans both WAS v7 and WAS v9. We have yet to pinpoint exactly why IHS v9 is more strict than IHS v7, but either way we had to fix this problem.

The error messages were saying it was due to the ServerIOTimeout, but the numbers were not matching with what I had explicitly set for Level_1 servers (60 seconds). This led me to the “Shared Cluster” that the plugin merge had created on its own.

ERROR: ws_common: ServerActionfromReadRC: ServerIOTimeout fired. Time out 1. retry count 0. serverIOTimeoutRetry -1, retry YES, rc 2, server Level_1_was_v9_01_1, URI /someUrl, client port 1234

The plugin-cfg.xml file with the merged and independent pieces look like this:

<ServerCluster CloneSeparatorChange="false" GetDWLMTable="false"
	IgnoreAffinityRequests="true" LoadBalance="Round Robin"
	Name="Shared_3_Cluster_0" PostBufferSize="64" PostSizeLimit="-1"
	RemoveSpecialHeaders="true" RetryInterval="60" ServerIOTimeoutRetry="-1">
	<Server CloneID="1basreo4a" ConnectTimeout="5"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="0"
		Name="wasv901Node_Level_1_was_v9_01_1"
		ServerIOTimeout="-1" WaitForContinue="false">
		<Transport ConnectionTTL="28" Hostname="wasv901"
			Port="9445" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<Server CloneID="1692lco3o" ConnectTimeout="90"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="-1"
		Name="wasv701Node_Level_1_WAS_v7_01_0"
		ServerIOTimeout="-1" WaitForContinue="false">
		<Transport Hostname="wasv701.company.com" Port="30006" Protocol="http"/>
		<Transport Hostname="wasv701.company.com" Port="31006" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<Server CloneID="1692lcpij" ConnectTimeout="90"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="-1"
		Name="wasv702Node_Level_1_WAS_v7_02_0"
		ServerIOTimeout="-1" WaitForContinue="false">
		<Transport Hostname="wasv702.company.com" Port="30006" Protocol="http"/>
		<Transport Hostname="wasv702.company.com" Port="31006" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<Server CloneID="1bassd4fp" ConnectTimeout="5"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="0"
		Name="wasv902Node_Level_1_was_v9_02_1"
		ServerIOTimeout="-1" WaitForContinue="false">
		<Transport ConnectionTTL="28" Hostname="wasv902"
			Port="9445" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<PrimaryServers>
		<Server Name="wasv901Node_Level_1_was_v9_01_1"/>
		<Server Name="wasv702Node_Level_1_WAS_v7_02_0"/>
		<Server Name="wasv902Node_Level_1_was_v9_02_1"/>
	</PrimaryServers>
	<BackupServers>
		<Server Name="wasv701Node_Level_1_WAS_v7_01_0"/>
	</BackupServers>
</ServerCluster>

	
<!-- WAS v7 -->
<ServerCluster CloneSeparatorChange="false" GetDWLMTable="false"
	IgnoreAffinityRequests="true" LoadBalance="Round Robin"
	Name="Level_1_0" PostBufferSize="64" PostSizeLimit="-1"
	RemoveSpecialHeaders="true" RetryInterval="60" ServerIOTimeoutRetry="-1">
	<Server CloneID="1692lco3o" ConnectTimeout="90"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="-1" Name="wasv701Node_Level_1_WAS_v7_01"
		ServerIOTimeout="60" WaitForContinue="false">
		<Transport Hostname="wasv701.company.com" Port="30006" Protocol="http"/>
		<Transport Hostname="wasv701.company.com" Port="31006" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<Server CloneID="1692lcpij" ConnectTimeout="90"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="-1" Name="wasv702Node_Level_1_WAS_v7_02"
		ServerIOTimeout="60" WaitForContinue="false">
		<Transport Hostname="wasv702.company.com" Port="30006" Protocol="http"/>
		<Transport Hostname="wasv702.company.com" Port="31006" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<PrimaryServers>
		<Server Name="wasv702Node_Level_1_WAS_v7_02"/>
	</PrimaryServers>
	<BackupServers>
		<Server Name="wasv701Node_Level_1_WAS_v7_01"/>
	</BackupServers>
</ServerCluster>


<!-- WAS v9 -->
<ServerCluster CloneSeparatorChange="false" GetDWLMTable="true"
	IgnoreAffinityRequests="false" LoadBalance="Round Robin"
	Name="Level_1_1" PostBufferSize="0" PostSizeLimit="-1"
	RemoveSpecialHeaders="true" RetryInterval="60" ServerIOTimeoutRetry="-1">
	<Server CloneID="1basreo4a" ConnectTimeout="5"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="0"
		Name="wasv901Node_Level_1_was_v9_01"
		ServerIOTimeout="60" WaitForContinue="false">
		<Transport ConnectionTTL="28" Hostname="wasv901"
			Port="9445" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<Server CloneID="1bassd4fp" ConnectTimeout="5"
		ExtendedHandshake="false" LoadBalanceWeight="77"
		MaxConnections="0"
		Name="wasv902Node_Level_1_was_v9_02"
		ServerIOTimeout="60" WaitForContinue="false">
		<Transport ConnectionTTL="28" Hostname="wasv902"
			Port="9445" Protocol="https">
			<Property Name="keyring" Value="/ihs/security/plugin-key.kdb"/>
			<Property Name="stashfile" Value="/ihs/security/plugin-key.sth"/>
		</Transport>
	</Server>
	<PrimaryServers>
		<Server Name="wasv901Node_Level_1_was_v9_01"/>
		<Server Name="wasv902Node_Level_1_was_v9_02"/>
	</PrimaryServers>
</ServerCluster>

My first thought was to see if I could prevent the GenPluginCfg.sh script from merging these clusters together, but that proved to be a waste of time. I then thought to just delete this part from Test’s plugin-cfg.xml file to see if it worked, and to my delight it worked fine without issue.

Sometimes there are unintended consequences, so I put all this info into an IBM Support ticket, and had their brain power evaluate the problem at large. They said this is a poor implementation choice on their side (to merge the clusters), but they’ve seen it before and there was no time table to fix it.

I told them about my idea to just simply remove the “shared” related parts of the plugin-cfg.xml and they said that would be a perfectly fine way to fix this problem.

I first started trying to use some form of awk/sed/gawk to solve this, but those were proving to be close, but no cigar. This then led me to xmlstarlet to parse XML, which I put in another Unix script to manipulate the plugin-cfg.xml after the merge had occurred, but before it was sent out to my IHS servers:

xpathShared=`xml el -v ${PLUGIN_TEMP} | grep UriGroup | grep Shared_`
xmlstarlet ed -d "$xpathShared" plugin-cfg.xml > xml1

xpathShared=`xml el -v xml1 | grep ServerCluster | grep Shared_`
xmlstarlet ed -d "$xpathShared" xml1 > xml2

xpathShared=`xml el -v xml2 | grep Route | grep sharedCell_`
xmlstarlet ed -d "$xpathShared" xml2 > xml3

xpathShared=`xml el -v xml3 | grep UriGroup | grep sharedCell_`
xmlstarlet ed -d "$xpathShared" xml3 > xml4

xpathShared=`xml el -v xml4 | grep VirtualHost | grep sharedCell_`
xmlstarlet ed -d "$xpathShared" xml4 > plugin-cfg.xml

This script greatly simplified the removal of the unnecessary merged stanzas, and is much more maintainable then even if I had gotten the awk/sed commands to work.

WAS restart script to kill off hung threads

Our WebSphere environment has nightly restarts because some of the apps are so shitty that they cannot run for more than 24 hours at a time and app owners do not care (another conversation for another day). Given this piece of information a long time ago we implemented nightly restarts that will reboot all apps on a given cluster.

Every so often we get a Splunk notification that a cluster node is not back up and running, and then when we investigate, we find that no java processes are running on this machine. After digging into it, I discovered that our script attempts to shut down each AppServer, but since a rogue app has a hung thread that is preventing the stopServer.sh command from completing.

To combat this in our Restart_AppServer.sh script, I have utilized the ‘timeout’ and ‘pgrep’ commands.

The timeout command was pretty straight forward: if the call does not return in the provided amount of seconds, then kill the command trying to run.

pgrep is also a pretty straight forward command, but the only problem with it is that the Restart_AppServer.sh command contains a parameter that is the name of the server. So if you have an AppServer named ‘Level_1’ then when you do a ‘pgrep -f Level_1’ you will get 2 PIDs: the one for the AppServer, and one for the Restart_AppServer.sh.

To get around this I looked up the PID of the Restart_AppServer.sh script, and then removed it from the grep command using the ‘-v’ option which is used to remove results from the result set.

timeout to stop the server, but kills the attempt if it doesn’t complete in time.

timeout 120 ${WAS_ROOT}/bin/stopServer.sh ${wrk_server}

pgrep grabs the process ID(s) of whatever you’re grepping for.

CONTROLLER_SCRIPT_PID=`pgrep -f Controller`
echo &quot;********** pid =  ${CONTROLLER_SCRIPT_PID}&quot;

SERVER_PID=`pgrep -f $1 | grep -v ${CONTROLLER_SCRIPT_PID}`
echo &quot;********** $1 pid =  ${SERVER_PID}&quot;
if [[ ${SERVER_PID} != &quot;&quot; ]]
then
    echo &quot;### ERROR ### AppServer $1 could not be shutdown gracefully, and had to be killed&quot; &gt;&gt; ${TEMP_LOG}
    pgrep -f $1 | grep -v ${CONTROLLER_SCRIPT_PID} | xargs kill -9
fi

Pass parameters to interactive Unix script

I hhhhhhhhaaaaaattttteeee being a monkey that just pushes a button. There’s always a better (and cheaper) way to restart a system than to have a human push a button to restart a system. God gave us computers just for that reason!

We have an environment that treats the person running the restart command as someone that is not familiar with running the stop.sh and start.sh scripts. To get to this point in this environment requires a huge amount of IT experience, but for whatever reason the stop/start command requires a crap-ton of hand-holding.

I’m not in charge of this system, but during an on-call I had to be the monkey that pushes the stop/start buttons and follow-along with the series of very basic questions. Eff that, surely it can be scripted, but “impossible” replied my compadre, it cannot because the stop/start requires you to enter some usernames, passwords, numbers, and a prostate exam.

That’s bush league, there has to be a better way that can remove me from the process, sure enough, there is a cool feature in Unix that allows a line-separated list file to be passed to a script. An hour later, this is what was produced.

Test file that mimics the system stop/start interactive commands:

admin@server01:/tmp&gt; cat intro.sh
#!/bin/bash
# Ask the user for their name
echo Give me your number
read varname
echo Provied number: $varname

echo Give me your username
read varname
echo Username: $varname

echo Give me your password
read varname
echo Password: $varname

echo 2 Give me your username
read varname
echo Username: $varname

echo 2 Give me your password
read varname
echo Password: $varname

echo Sleeping...
sleep 10
echo Done sleeping.

echo Press 7 to exit
read varname
echo You have entered: $varname

Here’s the input file that correlates to the questions being asked

admin@server01:/tmp&gt; cat input.txt
1
user1
pwd1
usr2
password2
7

Here’s what the execution of the file looks like

admin@server01:/tmp&gt; cat input.txt | ./intro.sh
Give me your number
Provied number: 1
Give me your username
Username: user1
Give me your password
Password: pwd1
2 Give me your username
Username: usr2
2 Give me your password
Password: password2
Press 7 to exit
You have entered: 7

As long as you know that the input from the user is the same order every time, you can use this method to “cat” a file of options to the script.

CPU running high, AppDynmaics help

2014-11-10_AppD_CPU_notification

Still trying to get to the root of this error, but we are at least notified of its existence via AppD and are able to give it time to complete, or just kill it off.

When the issue occurs, we typically use the unix ‘top’ command to see what PID is pegging the CPU, and will stop the WebSphere node, and kill off the PID. The hope is to get AppD to help us track down the runaway Java method that is causing the CPU spike, and fix the issue instead of killing off the symptom.

waspapps02:~> top
top - 16:45:47 up 63 days, 14:26,  1 user,  load average: 3.92, 3.58, 3.52
Tasks: 176 total,   1 running, 175 sleeping,   0 stopped,   0 zombie
Cpu(s): 91.3%us,  8.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.2%hi,  0.2%si,  0.0%st
Mem:   8129720k total,  8064584k used,    65136k free,    35760k buffers
Swap:  4192956k total,  2240932k used,  1952024k free,   212712k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
**18679 wasadmin  20   0 2966m 1.4g 6516 S  159 18.2 148:32.77 java**
18211 wasadmin  20   0 1852m 1.3g 5904 S   15 16.7  49:43.31 java
18727 wasadmin  20   0 3388m 2.2g 7772 S    3 28.2  60:20.76 java
 5378 wasadmin  20   0 1877m 1.2g 7288 S    1 15.3  55:03.02 java
 8031 root      20   0  208m 3628 2252 S    1  0.0 134:01.03 aex-metricprovi
18541 wasadmin  20   0 1946m 940m 7556 S    1 11.8  27:35.41 java
 3278 wasadmin  20   0  165m  15m 2956 S    0  0.2   3:19.99 splunkd
17419 wasadmin  20   0  8772 1236  852 R    0  0.0   0:00.01 top
    1 root      20   0 10376   88   56 S    0  0.0   0:47.11 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.56 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:09.55 migration/0

Unix inode table max

We have a legacy application that creates a crap-load of files to track key data points within the app. Why is this not in a DB, I have no idea. Probably because the app was written in the 90’s, and it is so fragile and complex no one wants to touch it.

We recently migrated this app to an upgraded WebSphere environment, and OS, which took 15 different people from 5 different teams and a few months worth of effort. In this process, we knew there would be hiccups, and during one of my recent on-calls, I received a call with the error below:

19:12:22.763 [Thread-4795] ERROR - Database creation Error: 
java.io.FileNotFoundException: /mount/data.xml (No space left on device)
          at java.io.FileOutputStream.open(Native Method) ~[na:6.0]
          at java.io.FileOutputStream.<init>(FileOutputStream.java:179) ~[na:6.0]
          at java.io.FileOutputStream.<init>(FileOutputStream.java:131) ~[na:6.0]
          at com.company.web.servlet.CreateDatabase_Sax$CallWebServiceRunner.run(CreateDatabase_Sax.java:518) ~[classes/:na]
          at java.lang.Thread.run(Thread.java:735) [na:6.0]
19:16:12.294 [WebContainer : 9] ERROR com.web.servlet.FileUpload - Error processing file upload:
19:16:12.295 [WebContainer : 9] ERROR com.web.servlet.FileUpload - java.lang.Exception: Error creating server directory
java.lang.Exception: Error creating server directory
          at java.lang.Throwable.<init>(Throwable.java:67) ~[na:6.0]
          at com.company.web.servlet.FileUpload.doGet(FileUpload.java:81) [classes/:na]
          at com.company.web.servlet.FileUpload.doPost(FileUpload.java:152) [classes/:na]
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:738) [javax.j2ee.servlet.jar:na]
          at javax.servlet.http.HttpServlet.service(HttpServlet.java:831) [javax.j2ee.servlet.jar:na]
          at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1661) [com.ibm.ws.webcontainer.jar:na]
          at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:944) [com.ibm.ws.webcontainer.jar:na]
          at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:507) [com.ibm.ws.webcontainer.jar:na]

“No space left on device” made me think that the mount was out of space, but a “df -h” command revealed that the mount had plenty of space. So is the error message completely bogus, or possibly related to the hand-held device that caused the error? We knew that orders were failing to be sent so there was an error somewhere in the app. After confirmation from developers that the hand-held devices were indeed not the space culprit, we got our Unix team on the horn, and they immediately knew the problem: the inode table had been maxed out.

His response is much more informative and elegant than I could have put together for a Unix topic:

The parameter changed on the filer was ‘maxfiles’, a setting which most applications never get close to maximizing. As the parameter name implies, it simply controls how many files exist at a given time on a volume and is enforced at a storage level from the NetApp rather than a filesystem level by the OS. When the problem occurred, the 3 million inode limit was reached and by nature of how it works could not be reduced until we first had a bit of overhead for breathing room. What we saw initially after increasing the limit was that 20,000 new inode assignments were made, but then when I checked a couple hours later it had dropped about 250,000. It has been fairly stable at about 2.82M now for the rest of the weekend. Given the drop, I would say we could likely reduce our ceiling as well down to perhaps 3.2M if we wanted to pull the reins in a little bit from the increase.

I bet somewhere there were Linux logs that referenced the inode issues, but unfortunately I didn’t have access to them (Splunk anyone?). It would have been nice if somewhere the word “inode” would have been used in the error message. It would have saved me and a few of my teammates an hour or two on a weekend.

JDK not found on Linux Path

I’m researching Atlassian’s Stash to help us manage our Git repository, and in the process, I started with a completely new Suse Linux machine. I exploded the JDK, and added it to the path:


export PATH=$PATH:/jdk/jdk1.7.0_25/bin

However, this gave the dreaded “command not found”. I also tried to use “which java” command, but as expected, that revealed “command not found” as well. After verifying that the path did indeed exist (../bin/java -version), I knew that it had to be something higher in the Path that was being hit before my JDK was reached.

Digging a little higher into the path, I found a /usr/lib/java that existed, but it was corrupt. Since I do not own this machine, I simply put my JDK first in the path to fix the issue.


export JAVA_HOME=/jdk/java/jdk1.7.0_25
export PATH=$JAVA_HOME/bin:$PATH