/ Java EE Support Patterns: Oracle Database
Showing posts with label Oracle Database. Show all posts
Showing posts with label Oracle Database. Show all posts

7.24.2011

ORA-01115 IO error reading solution

This case study describes the complete root cause analysis and resolution of an Oracle 10gR2 database problem (ORA-01115) affecting a Weblogic Integration 9.2 Java EE application.

Environment specifications

-          Java EE server: Oracle Weblogic Integration 9.2 MP2
-          Database server: Oracle 10gR2
-          Middleware OS: Sun Solaris 5.10
-          Database server OS: IBM AIX 5.3 TL5
-          JDK: Sun Java HotSpot(TM) Server VM Version 1.5.0_10-b03
-          Platform type: BPM

Problem overview

-      Problem type: java.io.IOException: [BEA][Oracle JDBC Driver][Oracle]ORA-01115: IO error reading block from file 30 (block # 255462)

An IO Exception error was observed from the Weblogic Integration server logs. This Exception was thrown from the remote Oracle database when trying to insert data within WLI internal tables.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·       What is the client impact? HIGH (full outage of our application)
·       Recent change of the affected platform? No
·       Any recent traffic increase to the affected platform? No
·       Since how long this problem has been observed?  This error suddenly appeared for the first time early in the morning
·       What is the health of the Oracle database server? Our Oracle DBA did confirm a problem regarding access to one of our Oracle data file (/oracle/db/oracle_data01.dbf')
·       Did a restart of the Weblogic Integration server resolve the problem? No

·         Conclusion #1: The problem seems to be isolated on the Oracle database server side and related to a possible IO problem accessing one of the Oracle internal data files.

Weblogic log file analysis: ORA-01115!

A first analysis of the problem was done by reviewing the Weblogic Integration managed server log errors.

java.io.IOException: [BEA][Oracle JDBC Driver][Oracle]ORA-01115: IO error reading block from file 30 (block # 255462)
ORA-01110: data file 30: '/oracle/db/oracle_data01.dbf'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 110: Media surface error
Additional information: 7
Additional information: 255461
Additional information: -1

        at weblogic.jdbc.base.BaseBlobOutputStream.write(Unknown Source)
        at weblogic.jdbc.base.BaseBlobOutputStream.write(Unknown Source)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean$OracleTableAccess.doStoreByInsert(BMPContainerBean.java:860)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean.doInsert(BMPContainerBean.java:1835)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean.ejbStore(BMPContainerBean.java:1787)
        at com.bea.wli.bpm.runtime.ProcessContainerBean.ejbStore(ProcessContainerBean.java:204)
        at com.bea.wli.bpm.runtime.ProcessContainerBean.ejbStore(ProcessContainerBean.java:191)
        at com.bea.wlwgen.PersistentContainer_sq78z1_Impl.ejbStore(PersistentContainer_sq78z1_Impl.java:191)
        at weblogic.ejb.container.manager.ExclusiveEntityManager.beforeCompletion(ExclusiveEntityManager.java:609)
        at weblogic.ejb.container.manager.ExclusiveEntityManager.beforeCompletion(ExclusiveEntityManager.java:580)
        at weblogic.ejb.container.internal.TxManager$TxListener.beforeCompletion(TxManager.java:1028)
        at weblogic.transaction.internal.ServerSCInfo$CallBeforeCompletionsAction.run(ServerSCInfo.java:1529)
        at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
        at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:147)
        at weblogic.transaction.internal.ServerSCInfo.doBeforeCompletion(ServerSCInfo.java:1161)
        at weblogic.transaction.internal.ServerSCInfo.callBeforeCompletions(ServerSCInfo.java:1143)
        at weblogic.transaction.internal.ServerSCInfo.startPrePrepareAndChain(ServerSCInfo.java:115)
        at weblogic.transaction.internal.ServerTransactionImpl.localPrePrepareAndChain(ServerTransactionImpl.java:1288)
        at weblogic.transaction.internal.ServerTransactionImpl.globalPrePrepare(ServerTransactionImpl.java:2100)
        at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:259)
        at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:227)
        at weblogic.ejb.container.internal.MDListener.execute(MDListener.java:463)
        at weblogic.ejb.container.internal.MDListener.transactionalOnMessage(MDListener.java:335)
        at weblogic.ejb.container.internal.MDListener.onMessage(MDListener.java:291)
        at weblogic.jms.client.JMSSession.onMessage(JMSSession.java:4072)
        at weblogic.jms.client.JMSSession.execute(JMSSession.java:3962)
        at weblogic.jms.client.JMSSession$UseForRunnable.run(JMSSession.java:4490)
        at weblogic.work.ServerWorkManagerImpl$WorkAdapterImpl.run(ServerWorkManagerImpl.java:518)
        at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
        at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)

As you can see, the ORA-01115 error was thrown (along with ORA-01110 ORA-27091 ORA-27072) during the insert operation of WLI data. An AIX related error was also thrown (Media surface error).

-          Conclusion #2: The problem seems to be related to a physical inaccessibility (OS and / or disk) of this critical Oracle data file. A logical file corruption problem was also kept in mind as a possible root cause.

OS problem replication: The media surface is damaged!

After the initial log review and problem isolation, an AIX system admin was engaged to further troubleshoot this problem.  A simple copy of the affected Oracle data file was attempted:

:> cp /oracle/db/oracle_data01.dbf .
cp: /oracle/db/oracle_data01.dbf: The media surface is damaged.


This simple problem replication step did confirm that the AIX OS also had lost control of the file.
The following AIX command was also executed in order to review the AIX OS error report:

errpt –a
…………………………………………………………………………………..
Description
DISK OPERATION ERROR

Probable Causes
MEDIA

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES
…………………………………………………………………………………..

-          Conclusion #3: At this point, it was determined that the problem was very likely related to a physical problem with the SAN / RAID cluster. A technician was dispatched to perform further analysis of our RAID (Redundant Array of Independent Disks) environment.

Root cause: RAID disk failure!

Health check of the RAID environment did confirm that one of the 7 available disks failed earlier in the morning; just a few minutes before the first occurrence of the ORA-01115 error in the Weblogic logs.

Unfortunately, the disk failure triggered corruption of the RAID parity; affecting one of our Oracle data file. New files could be written to the RAID (6 disks left) but some files were not available due to parity corruption.

A disk failure has negative impact on parity. A disk can fail for various reasons such as the disk may be overused and due to excess variation in the power supply, it may result in the failure of the disk. Failure to replace it immediately can cause adverse affects on the system performance and may even damage the disks which are used in backup. The RAID parity provides protection against drive failures but if it encounters a parity error, it will start writing a bad data on the drive.

In our case, the technician was able to resolve the parity corruption without having to restore any data from the backup tapes. Resolution of the parity automatically allowed Oracle to gain back control of the affected date file; without any Oracle listener or database restart.

A complete Weblogic restart was performed as the final resolution step in order to re-initialize the affected WLI JDBC Connection pools.

Conclusion and recommendations

-          When facing ORA-01115 with Oracle, please ensure to do a fast and complete root cause analysis
-          Avoid taking unnecessary and non efficient resolution steps such as an early Weblogic restart or Oracle database restart

3.28.2011

Network Adapter could not establish the connection problem using Weblogic 8.1

This case study describes the complete steps from root cause analysis to resolution of an intermittent Weblogic Connection Pool connectivity problem experienced between an Oracle Weblogic 8.1 application server and an Oracle 10g database.






It will also demonstrate the importance for an application support person of mastering some basic network troubleshooting skill sets and techniques in order to do proper problem isolation and root cause analysis of such type of problem.

Environment specifications

·         Java EE server: Oracle Weblogic Platform 8.1 SP6
·         OS: AIX 5.3 TL9 64-bit
·          JDK: IBM JRE 1.4.2 SR13 32-bit
·         RDBMS: Oracle 10gr2
·         Platform type: Ordering application

Monitoring and troubleshooting tools

·         AIX 5.3 PING comman
·         AIX 5.3 TRACEROUTE command
·         Weblogic dbping utility

Problem overview

-          Problem type: The DBMS driver exception was: Io exception: The Network Adapter could not establish the connection

An intermittent connectivity problem was observed in our production environment between our application server and database server. Such Weblogic error was spotted in our log during problem reproduction.

Initial problem mitigation did involve restarting the affected Weblogic managed server(s) almost on a daily basis until successful connection established with the remote database server.

Gathering and validation of facts

A Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·         What is the client impact? Low, problem was intermittent and our platform has proper load balancing and fail-over in place
·         Recent change of the affected platform? No
·         Any recent traffic increase to the affected platform? No
·         Any recent activity or restart of the application server or database server? Yes, the application server is restarted on daily basis. The remote database server was last physically restarted a few weeks ago following a network incident in the server farm
·         Since how long this problem has been observed?  Since a few weeks
·          Is the JDBC Connection Pool connectivity problem is consistent or intermittent? Problem is intermittent
·         Did a restart of the Weblogic server resolve the problem? No, currently only used as a mitigation strategy
·         Did the DBA team found any problem with the Oracle 10g database? No problem found with the database itself
·         Did the support team analysis the Weblogic logs and any error? Yes, Weblogic JDBC error was found and as per Weblogic  / JDBC Driver documentation, this indicates that the JDBC driver is unable to physically connect to the remote Oracle database

·         Conclusion #1: The problem and error type appear to point to a network / connectivity problem between Weblogic application server and remote database server
·         Conclusion #2: The recent network problem and physical restart of the Oracle database server is a potential trigger suspect

Weblogic error log review

The error below was found during problem reproduction. Such error prevented the initialization and deployment of our primary application JDBC Data Source and application.
                    
<Warning> <JDBC> <BEA-001129> <Received exception while creating connection for pool "<App Conn Pool>": Io exception: The Network Adapter could not establish the connection

<Error> <JDBC> <BEA-001150> <Connection Pool "<App Conn Pool>" deployment failed with the following error: 0:Could not create pool connection. The DBMS driver exception was: Io exception: The Network Adapter could not establish the connection.>

<Error> <JDBC> <BEA-001151> <Data Source "<App DS>" deployment failed with the following error: DataSource(App DS) can't be created with non-existent Pool (connection or multi) (App Conn Pool).>

Network health check using PING, TRACEROUTE and other utilities

Given the intermittent behaviour of this problem, the support team decided to perform some additional analysis of the network situation between our application and database server. The AIX PING command was used for that purpose as per below.

# Send 5 packets of 64 bytes to the remote database IP address
ping -c 5 -q -s 64 <IP address>

Network packet loss
# Validate the connectivity and route through the different HOP(s)) from the source server (Weblogic application server) to the destination server (Oracle 10g database server)
traceroute <IP address>

Traceroute connectivity problem

As per the above results, ~50% of loss packets were found between our application and database server. The intermittent connectivity problem was also reproduced using the traceroute command.

Please note that Weblogic has also a database "ping" utility that you can use to test your network connectivity and database listener availability from the WebLogic physical server to the remote DB server. This utility basically simulates the creation of a new JDBC Connection via the java.sql.DriverManager.


DB Ping Usage

../<JAVA_HOME>/bin/java -classpath ../<WL_HOME>/<WL_SERVER_HOME>/server/lib/weblogic.jar utils.dbping ORACLE_THIN <dbUserName> <dbPasswoes> <dbURL>


DB Ping - Other RDBMS Provider Usage

java utils.dbping DB2B  [-d dynamicSections] USER PASS HOST:PORT/DBNAME\nor    
java utils.dbping JCONN2       USER PASS HOST:PORT/DBNAME\nor    
java utils.dbping JCONN3       USER PASS HOST:PORT/DBNAME\nor    
java utils.dbping JCONNECT     USER PASS HOST:PORT/DBNAME\nor    
java utils.dbping INFORMIXB    USER PASS HOST:PORT/DBNAME/INFORMIXSERVER\nor    
java utils.dbping MSSQLSERVERB USER PASS HOST:PORT/[DBNAME]\nor    
java utils.dbping MYSQL        USER PASS [HOST][:PORT]/[DBNAME]\nor    
java utils.dbping ORACLEB      USER PASS HOST:PORT/DBNAME\nor    
java utils.dbping ORACLE_THIN  USER PASS HOST:PORT:DBNAME\nor    
java utils.dbping POINTBASE    USER PASS HOST[:PORT]/DBNAME\nor    
java utils.dbping SYBASEB      USER PASS HOST:PORT/DBNAME"); 



Finally, you can also use the TELNET command for that same purpose:

Telnet Usage
telnet <DB hostname> <DB listener port>

Network sniffer analysis

Following the findings, our application support team did engage a network sniffer team to troubleshoot the problem further. Analysis was done by sniffing the inbound and outbound traffic packets generated by the ping and traceroute commands at the network switch level between the source and destination server.

The sniffer team found that the lost packets were actually not coming out of the remote database server; which did isolate the problem further at the remote database server level.

Suspected root cause

The combination of the gathered facts along with application and network support teams did conclude on a routing problem affecting the Oracle 10g database and causing intermittent but consistent packet loss with our application server.

Given the recent network problem in the server farm and physical reboot of the server, it was suspected that the root cause of the problem was related to an invalid ARP table of the network switch and/or our Oracle database server.

ARP stands for address resolution protocol. The delivery an IP packet to the next hop is encapsulated in an Ethernet frame. Such frame must contain a destination address which is determined by inspecting the ARP cache table. If the table does not have an entry the switch will issue an ARP request and wait for a response from the next hop. Any problem with this cached table could lead to routing and connectivity problem; requiring a reset.

Resolution and results

As per the root cause analysis, the physical server support team did proceed with a physical reboot of the affected database server which did reset/clear both ARP tables at the switch and server level.

The results were quite conclusive post restart and the % of packet loss dropped from 50% to 0%. The traceroute command also indicated fast connectivity with no delay.

Conclusion and recommendations

·         When facing “The Network Adapter could not establish the connection” problem with Weblogic, do not assume that the database server is down; gather all the data and facts instead and proceed with simple network problem isolation using ping & traceroute and telnet (port health check
·         Perform your due diligence and problem isolation before engaging a network sniffer team, this will help you speed up the root cause analysis process
·          Please make sure to keep track on key events/deployments inside your environment, including network problem episodes as those type of events are often trigger of Java EE related problems