"It's the errors, stupid..."
This article is written in English and Portuguese
Este artigo está escrito em Inglês e Português
English version:
This post's title is an allusion to "It's the economy, stupid" phrase that become popular in one of the USA presidential campaigns. It seems appropriate for today, and for the short problem description I'm going to describe.
Today, while working on a customer site, I was confronted with the following scenario:
A developer complained that two programs working against a non-logged database (yes, some customers still use them...) raised errors (-243/111) while DELETEing an overlapped result set. The customer DBAs replied the same as I probably would: "That can happen even on a non-logged database. Please use some sort of "SET LOCK MODE TO WAIT...' "
But the program was already using it... So... why does it happen?!
Well... we all know that there are some differences between logged and non-logged databases, but a complete list is hard to find. One is that we cannot change the isolation level in a non-logged database. It raises an error stating that the only mode allowed is DIRTY READ. But it does accept the SET LOCK MODE WAIT statement, so we would expect it to enforce it...
In short, what was I missing, apart from a possible, but improbable bug? The answer is, I'm missing a careful analysis of the error codes! And I should be ashamed, because I have recommended many times that people MUST always look carefully into the errors, before trying to create explanations for problems that they haven't qualified. Precisely what I was doing... So let's see some facts:
- The instruction causing he error was something like:
DELETE FROM some_table WHERE indexed_field = VALUE
And the query plan used the index on the column - Error -243 reads:
-243 Could not position within a table table-name.
The database server cannot set the file position to a particular row
within the file that represents a table. Check the accompanying ISAM
error code for more information. A hardware error might have occurred,
or the table or index might have been corrupted (truncated). Unless the
ISAM error code or an operating-system message points to another cause,
run the oncheck utility (secheck with IBM Informix SE or tbcheck with
IBM Informix OnLine) to check and repair table and index. - Error -111 reads:
-111 ISAM error: no record found.
The ISAM processor cannot locate the requested record. For C-ISAM
programs, no record was found with the requested index value or record
number, depending on the retrieval mode in use. Make sure that the
correct index is in use. For SQL products, see the SQL error message or
return code. Probably no row was found for this query - And finnaly, error -107 reads:
-107 ISAM error: record is locked.
Another user request has locked the record that you requested or the
file (table) that contains it. This condition is normally transient. A
program can recover by rolling back the current transaction, waiting a
short time, and re-executing the operation. For interactive SQL, redo
the operation. For C-ISAM programs, review the program logic and make
sure that it can handle this case, which is a normal event in
multiprogramming systems. You can obtain exclusive access to a table by
passing the ISEXCLLOCK flag to isopen. For SQL programs, review the
program logic and make sure that it can handle this case, which is a
normal event in multiprogramming systems. The simplest way to handle
this error is to use the statement SET LOCK MODE TO WAIT. For bulk
updates, see the LOCK TABLE statement and the EXCLUSIVE clause of the
DATABASE statement.
So, we were clearly overlooking the ISAM error code. Now that we passed that phase, what does -111 means in this context? That's not that evident, and I haven't found a good reference in the documentation yet.
In this case, both queries were using the index and found the rowid. But when then they moved to access the data, the row was already deleted by the other session. That's why the error is not -107, but -111 instead. You may be wondering "What?! Isn't that an inconsistency in the data processing?". Well, yes... And that's why things work differently if you're on a logged database, depending on the isolation level you use. In some cases you'd just receive a "no rows found" response. In other situations, where the isolation level would make this impossible, you'd get this -111 error (so that you notice something was wrong). But on non-logged databases the mechanisms that allow that, are not implemented, so the behavior is to raise an error.
The main points here is that you must always:
- Get the SQLCODE and the ISAM errors. Only the SQLCODE is too vague in many situations
- Look into them, and stop assuming things based just on SQLCODE or your previous experience (the richer it is, the more chances you'll have that it traps your mind)
Versão Portuguesa:
O título deste artigo é uma alusão à frase "It's the economy, stupid" que se tornou popular numa das campanhas presidenciais dos EUA. Pareceu-me apropriada para hoje e para o problema que vou descrever.
Hoje, durante uma presença num cliente, fui confrontado com o seguinte cenário:
Um programador queixou-se que dois programas a trabalhar numa base de dados sem logging (sim, alguns clientes ainda as usam...) estoiravam com erros (-243/-111) numa operação de DELETE em que os universos de registos se misturavam. Um dos DBAs do cliente respondeu como eu provavelmente teria feito: "Isso pode acontecer mesmo numa base de dados sem logging. Por favor usem algum tipo de 'SET LOCK MODE TO WAIT...'"
Mas os programas já o estavam a usar... Portanto... Porque é que os erros aconteciam?!
Bom... todos sabemos que existem diferenças entre as bases de dados com logging e as que não têm logging, mas não é fácil encontrar uma lista. Uma das diferenças é que não podemos mudar o nível de isolamento numa base de dados sem logging. Retorna-nos um erro indicando que o único nível suportado é o DIRTY READ. Mas no caso do SET LOCK MODE WAIT é aceite e seria de esperar que fosse respeitado.
Em suma, o que me estava a escapar, aparte algum possível mas improvável bug. A resposta é que me estava a escapar uma análise séria dos códigos de erro! E deveria ter vergonha, porque recomendei muitas vezes que as pessoas TÊM de olhar cuidadosamente para os erros antes de tentarem criar explicações para problemas que não chegaram a classificar/caracterizar corretamente. Exactamente o que estava a fazer.... Vejamos então os factos:
- A instrução que causava o erro era semelhante a:
DELETE FROM some_table WHERE indexed_field = VALUE
E o plano de execução usava o índice na coluna usada no WHERE - A descrição do erro -243 é:
-243 Could not position within a table table-name.
The database server cannot set the file position to a particular row
within the file that represents a table. Check the accompanying ISAM
error code for more information. A hardware error might have occurred,
or the table or index might have been corrupted (truncated). Unless the
ISAM error code or an operating-system message points to another cause,
run the oncheck utility (secheck with IBM Informix SE or tbcheck with
IBM Informix OnLine) to check and repair table and index. - A descrição do erro -111 é:
-111 ISAM error: no record found.
The ISAM processor cannot locate the requested record. For C-ISAM
programs, no record was found with the requested index value or record
number, depending on the retrieval mode in use. Make sure that the
correct index is in use. For SQL products, see the SQL error message or
return code. Probably no row was found for this query - E finalmente a descrição do erro -107 é:
-107 ISAM error: record is locked.
Another user request has locked the record that you requested or the
file (table) that contains it. This condition is normally transient. A
program can recover by rolling back the current transaction, waiting a
short time, and re-executing the operation. For interactive SQL, redo
the operation. For C-ISAM programs, review the program logic and make
sure that it can handle this case, which is a normal event in
multiprogramming systems. You can obtain exclusive access to a table by
passing the ISEXCLLOCK flag to isopen. For SQL programs, review the
program logic and make sure that it can handle this case, which is a
normal event in multiprogramming systems. The simplest way to handle
this error is to use the statement SET LOCK MODE TO WAIT. For bulk
updates, see the LOCK TABLE statement and the EXCLUSIVE clause of the
DATABASE statement.
Portanto, estava claramente a ignorar o código de erro ISAM. Agora que passámos essa fase, o que quer dizer o erro -111 neste contexto? Não é muito evidente, e ainda não encontrei uma boa explicação na documentação.
Neste caso, ambas as queries estavam a usar o índice para obterem o rowid. Mas quando depois iam aceder aos dados, a linha já tinha sido apagada pela outra sessão. É por isso que em vez do habitual -107 temos o -111. Pode estar a pensar "Hmmm?! Mas isso não é um processamento inconsistente?". Bem, sim... E é por isso que as coisas funcionam de forma diferente numa BD transacional (com log), dependendo do nível de isolamento que estamos a usar. Em alguns casos receberíamos apenas um "no rows found". Noutras situações, caso o nível de isolamento tornasse a situação impossível, então obteria-mos o erro -111 (para sinalizar que algo de verdadeiramente errado se passou). Mas em bases de dados sem logging os mecanismos que permitem esse comportamento não estão implementados, logo é retornado o erro.
Os pontos principais são que temos sempre de:
- Obter os erros SQLCODE e ISAM. Apenas o SQLCODE pode ser demasiado vago em muitas situações.
- Examinar
bem ambos os erros, e evitar tirar conclusões precipitadas baseadas
apenas no SQLCODE ou na nossa experiência passada (quanto mais rica for,
maiores serão as hipóteses de ficarmos presos a um raciocino que pode não ser o correto)
No comments:
Post a Comment