Thursday, August 27, 2009

Where do you stop as a DBA? Thoughts about problem solving....

This is one of my few posts that are not about Informix features. I will write about a real life situation and use it for developing some thoughts about problem solving ability (or lack of it).

I will not use real names from either customers or suppliers except of course Informix and my employer. The scenario is this:

Several IBM Informix Dynamic Servers will be upgraded from version 10 on platform P1 to version 11.50 on platform P2. P2 is from a third party supplier, let's call it S1.
Currently the customer uses ontape utility for backup, but wants to take advantage of OnBar. So tests were made using OnBar, IDS 11.50 and a Storage Manager from another supplier S2 on platform P2. After some issues the tests run fine and some development instances are migrated. Suddently we start having issues. OnBar reports a BAR_TIMEOUT in BAR_ACT_LOG file... This message informs us that OnBar executed a BSA (XOpen standard for backup API) call which failed to return (successfuly or unsuccessfuly) within a specific ammount of time. OnBar will not give up, and will keep logging messages in BAR_ACT_LOG file. The Informix error message description is this:


"-43296 Warning: BAR_TIMEOUT Storage Manager Progress may be stalled. The current command has exceeded the BAR_TIMEOUT value. This might indicate a normal transfer of a large data set or it can mean the storage manager needs attention. Check to make sure the storage manager is still running and that data is still being transferred to/from the storage manager. Refer to the storage manager message log (or equivalent) for more information. If the cause of the error is still unclear, contact your storage manager Technical Support. If you are using IBM Informix Storage Manager, contact Technical Support at tsmail@us.ibm.com"

The message looks explicit and clear, but the Storage Manager operators decided that it should be a problem with OnBar. Let's make a small pause here... this is the first point I'd like to raise... What would you do? Trust the Informix error message that explicitly says to contact the Storage Manager technical support or do your own investigation?
Well, as an onsite consultant I decided to do some investigation... First things I did were:
  1. Running truss against onbar_d processes
  2. Running dbx/gdb/adb against the onbar_d processes
  3. Running lsof against the onbar_d processes
These simple steps revealed that:

  1. truss showed OnBar was "stuck" on send(). This is an OS function used to send data through a socket
  2. gdb showed that onbar_d had called the BSASendData() function which in turn had called send(). BSASendData() is part of the dynamic library we have to configure in $ONCONFIG (BSALIBPATH)
  3. lsof showed that the file descriptor (fd) used in the send() call was a socket opened to the storage manager "media server". This is a host which has direct access to the storage manager allocated tape drives
So, it was clear (only for me as we shall see...) that there was a problem in the storage manager while it tried to send data from the database server to the media server. The next logical step would be to analyse the status of the process receiving the data in the media server. Since I had no access to that server I asked for it, but without success... In the following days I received all kinds of evasive replies and useless requests both from the storage manager team and from the storage manager supplier (S2) technical support. Some of these were:

  • You cannot run a whole system backup with more than one concurrent process
    True for versions earlier than 11.10, but incorrect for version 11.50 (and this suggestion ignored all the debug info above)
  • We were trying to use more onbar_d processes than what the storage manager was configured to support, leading to waits on the OnBar side
    This could make sense, but when this happens, the onbar_d processes wait for a free "slot" at the begining of a dbspace backup.... they don't get "stuck" in the middle (usually it happened around 90% of a dbspace backup)
  • They suggested that we should change the BAR_TIMEOUT parameter (it exists only in XPS, not IDS....)
  • They even suggest me that I should consult the OnBar manual ;)
Finnally the tech support guy from S2 was on site, and I managed to have him collect the info (truss, gdb, lsof) from the media server processes. At that moment I understood why they ignored the info above. He didn't knew how to use or what were these utilities...
But I got a big surprise when I saw the outputs.... The process receiving data was "stuck" on recv(). This is an OS function used to get data from a socket. So, in short, we had an ESTABLISHED socket between two servers, one of them was on send(), and the other was on recv().... But no data was being transferred.

This made me think about network or TCP layer issues. But usually this are the last places to look for issues. Note that I don't consider network problems a rarity, but these usually translate into broken connections, not exactly hangs...
This observation was a turning point. I immediately assumed that the platform provider (S1) would have to be involved. The situation was even more complex because the media server was a Linux machine. So the problem could be in either platform...

During the next two months (!), I've been through a real nightmare, because the customer insisted that there could be something wrong with OnBar, I spent several hours in useless conference calls (customer and S2) and meetings. They insisted in making all kind of weird tests like:
  • Running onbar with few and more processes
  • Reconfigure their whole backup infrastructure
  • Reconfigure their storage manager parameterization
At some point they asked me the following:
  • Do you have customers running with this setup?
    They seemed desesperate to prove OnBar was doing something wrong...
    They don't understand we don't keep any kind of tracking on customer environments. We do keep an incredible useful internal knowledge base where customer problems are recorded, but as expected I was tired of making exhaustive searches on that KB without finding any similar cases...
  • They proposed to use another storage manager (interestingly from vendor S1)
    The point was more or less this: If it doesn't work with that, surely the problem is not with the storage manager (and that again would imply the problem was with IDS...)
  • They insisted in having a tech support case opened on IBM
    This seems reasonably, but the behavior of OnBar was completely explained in detail by me previously
At this point I was rather tired of the whole issue. Interestingly the tests with the second storage manager revealed two things:
  1. The new environment hit another problem
    This problem was already reported in IBM knowledge base and it was a problem in that storage manager BSA library. I had to wait one week in order to know if the version they were using included the necessary fix... it turned out it didn't ;)
  2. After solving that first issue the backups aborted with some sort of storage manager timeout... The real difference was in the storage manager behavior. First one waited indefinitely and this one establishes a timeout (much like OnBar does, but since it's the one calling the low level OS function it aborts the process instead of just warning the administrator like OnBar does)
After about two months of outstanding time waste I understood the only way to solve this in a timely manner was to reproduce the issue without any Informix or storage manager layer. Let me point that some people in the customer team had made exhaustive tests sending data across servers using scp for example. All these tests were successful, and they concluded it couldn't be anything in the TCP layer. I warn them (and showed them using the tools above that scp didn't use the same functions, block size etc.). if the tests had failed it would have beeen nice, but a working scp session didn't proove us that there was nothing wrong with the TCP stack.

After a few more days and a lot of GB sent, I was able to create a test case with just C programming. A simple client/server program that kept sending data through a socket. I made sure I established the socket options and block size similar to what the storage manager's BSA library used.

After this, they finally accepted that something was wrong with the vendor OS functions (specifically send() ). In the next day we had a conference call with all the parties involved. And in less than a week we had a fix (newly coded) for the OS. After installing we never had another similar hang on send() function.

The purpose of this post is to show how important it is to know how to debug a problem. Tools like truss, dbx/gdb/adb and lsof are invaluable for this. I personally think anybody working on IT problem solving independentely of their role (DBA, programer, system administrator, network administrator etc.) should have a minimal knowlege about how they work and what kind of information you can collect with them. There are obviously other tools equally useful in some scenarios. Tools like tcpdump, netstat etc. This whole case revealed that many people involved in this area don't have a clue about how these tools work and the info they can collect.
Everybody involved had a theory about the problem. But none of those theories were based on facts or real observations. They were merely speculations of what could be happening and all these tend to pass the problem to a different team...
Some facts about the situation:
  • lots of emails including polished accusations were exchanged
  • during two months I spent a considerable amout of time paying attention to this, trying to lead people into the righ direction (mostly without success)
  • Vendor S2 had a very large team involved. They even sent a consultant from another country into this customer site (when he arrived we were about to receive the fix, so apart from politics this guy was not able to do much about this problem)
  • The problem could have been solved in around two weeks (one for debugging and another for the vendor investigation and code fixing)
  • No one admitted that didn't understand the output of the tools above and no one (even after the conclusion) took the opportunity to ask me to explain how to use this kind of tools
    I don't think I know everything (and in fact my knowledge about these tools is somewhat superficial), but I only learn how to use this kind of stuff because in some situation in my past I came across some references to them and I took the time to experiment and read about them. In other words, we should use the problems as opportunities to gather more knowledge and to learn new things.
I keep receiving reports about problems without any useful information. My favorite ones are:

  • I tried it and it doesn't work!
    Surely I believe it doesn't work... But usually when something "doesn't work" it raises an error. Only rarely people include the error code/description. More often they include possible causes (usually not related to the error) than they include the error code or error message
  • The process was hang! It was probably in some kind of loop! So I killed it
    Well... processes don't "hang". They wait on something or they effectively stay in some loop. And the way to see that is by using the tools... And the tools don't work on a process that doesn't exist anymore...
  • I've found a bug!
    Sure... A bug is a malfunction. Something that works differently from what's documented or what is expected. Most of the times people find a problem. After some analisys (many times involving supplier's technical support) it may be mapped to a bug. People tend to expect something. If it doesn't happen they "found a bug". Usually they don't bother to read the documentation and try to understand the reasons for the unexpected behavior.
    Obviously people do hit bugs. Most of the cases that I open within IBM end up as bugs. But this is just a very small portion of the problems that customers report.
In short, I feel that in general, people's ability to study a problem in the IT world is vey limited. Usually people spend more time trying alternatives than collecting and understand problem data. Error messages and codes are ignored many times. And all these translate into a big waste of time, and obviously money... And of course, this directly impacts the quality and availability of the IT systems.


2 comments:

Anonymous said...

It happens all the time. Finding a problem can be really painful. Especially in big companies.

I've never used truss and after reading this article I will have to have a look at it.

Regards,
ukn

Anonymous said...

Hello Great Article
It am having a similar case and I would surely benefit from your 'Troubleshooting Style'
Kashif