Wednesday, August 27, 2014

Revoke privileges from PUBLIC may cause ORA-600

Oracle Enterprise Manager (EM) reports on many security policies when they are violated. One of them are about execute privileges on standard PL/SQL packages granted to PUBLIC. One a day this summer with nothing better to do I went on to revoke these privileges. The wages for my efforts was several  ORA-600 errors:
ORA-00600: internal error code, arguments: [qmxqtmChkXQAtomMapSQL:2], [], [], [], [], [], [], [], [], [], [], []
00600. 00000 -  "internal error code, arguments: [%s], [%s], [%s], [%s], [%s], [%s], [%s], [%s]"
*Cause:    This is the generic internal error number for Oracle program
           exceptions.     This indicates that a process has encountered an
           exceptional condition.
*Action:   Report as a bug - the first argument is the internal error number
A quick research on My Oracle Support indicated that some standard packages depends on these privileges granted to PUBLIC. The solution is to grant back to PUBLIC if you are in a hurry, or grant the privilege to the owners of the individual packages (like XDB). The errors continued after the granting was done, but stopped after I bounced the database instance.

Just because EM reports something and the solution seems obvious, it is not alway wise to apply remedy immediately, especially in production. It surprised me a little that EM suggests a security policy that Oracle's own packages do not adhere to.

Sunday, August 24, 2014

Correlation between restore time and used space in a datafile

Recently during restore of a large database (over 10 TB) we monitored the alert log looking for messages like  «Full restore complete of datafile 42 to datafile copy … Elapsed time: 0:39:32»

We saw that the time to restore a 32GB file varied a lot, from around 20 minutes to around 75 minutes. This made me wonder why. We noticed that some of the files belonging to the UNDO tablespace  took usually little time, and then I found a couple of more data files that took little time to restore. Then I thought that maybe the UNDO tablespace at the time of the backup had many unused blocks (because there weren’t many active transactions at the time). So I checked the other two files that did not take long time to restore and found out they had lots of free space in them too. 

So I decided to take the file id and the restore time from all the messages in the alert log and load it into Oracle, in the source database (this all happened when we were making a clone for a RAT session). Then I converted the elapsed time to seconds, found  file size from DBA_DATA_FILES and free space for each file from DBA_FREE_SPACE. Calculating the correlation between restore time and (file size - free space) resulted in 0.96, but for the correlation between restore time and file size, resulted in 0.38. (I used the CORR function.)

By the way the backup was stored on a file system created on a logical volume from a group whose   LUNs where stored on an EMC VMAX 20, but the data files were restored to ASM on LUNs in an EMC XtremIO. 

 As pointed out on Twitter:

it could be caused by the RMAN compression of unused blocks. Next step would be a test with creating a database with several files with different degree of free space, and doing two RMAN backups, one with and another without compression and compare restore time.

This may sounds like a lot of fuzz about nothing, but what we had in mind was the deduplication that the XtremIO does; wondering if the deduplication of empty blocks could  take less time than blocks with real data in it.

Wednesday, July 23, 2014

DBMS_LDAP and ORA-12703

In case you are using the DBMS_LDAP package from a reasonable modern environment you may run into the error:
ORA-12703 this character set conversion is not supported
The simple solution to this is to run:
or if you are using SQL*Plus or SQL Developer, just:

Monday, July 7, 2014

Connecting to a guest console in Oracle VM

I struggled to connect to the console of a newly created VM. The VM was created using Oracle VM Manager (OVMM) and despite efforts to install the required packages I could not use the console button in the Manager interface.

I noticed that the VNC server built into the VM was listening on only (on the OVM server), you can see this if you check the file vm.cfg for the guest. Without analysing how the OVMM would try to connect to the VNC server on the other host I decided to create an ssh tunnel from my Mac to the OVM server:

ssh -L 5901:localhost:5901

(replace with IP address or host name of your OVM server).

Now, I think I have found a reason why Oracle specifically recommends TightVNC as a VNC viewer; I tried two other VNC viewers on my Mac, but I could not make them work with the console on the VM, sometimes it connected, but later it would hang, another just aborted. I downloaded the java version of the VNC viewer from and connected without any problems to localhost and port 5901 (using the tunnel created above).

Friday, July 4, 2014

Importing ISOs to Oracle VM using OVM Manager

Just a simple trick to import ISO files to the repository using Oracle VM Manager (OVMM). The import function supports imports through protocols http, https and ftp. If you want to do this in the supported way you can use a simple web server to serve the ISO files for you.

Python comes with a simple http server built in. If you have python installed (which is the case on both the OVM server and the OVMM host) you can start it with:
python -m SimpleHTTPServer

To make it easy for you, go to directory where the ISO file is located, start the http server and verify that you can see the file listed in a web browser. Then do the import in Oracle VM Manager.

In this case I copied the ISO file to OVM server (dom0) and started the http server there, and the URL for the import becomes as shown in the capture below (or use the IP address of the server where the ISO file is located and remember to use port 8000):

The name of the imported ISO is the same as the file name, rename to a say Oracle Linux 6 U5 and put in the origin of the file in description.

Wednesday, July 2, 2014

Real Application Testing - some notes

This post is a mixture of experiences from a recent project and an overview on how to use RAT in a project.

Real Application Testing (RAT) is a licensed feature for the Enterprise Edition (EE) of the Oracle database. It was released on 11g, but can be used to test migration of databases from 9i and 10g to 11g (assuming that the source databases have a required version/patch). The term Real is, I reckon, used to hint at a more authentic testing process. The tests are executed based on a capture from a real production system.

RAT can be used to:

  • Evaluate new hardware, including migration from one platform to something completely different.
  • Verify execution of SQL from today’s production on a newer version of the database
  • Analyze the effect of changes in configuration
  • Scaling up, see how current system works with higher load; this works best for queries / reporting and not so well for data manipulation (DML).

A project involving RAT can be divided into these activities:

  1. Capture workload from a production system
  2. Create a clone of production system with flashback database enabled and a restore point created.
  3. Prepare for replay on test database and with agents on application servers or other client machines.
  4. Execute replay
  5. Generate reports (RAT and AWR)
  6. Flashback database
  7. If you haven’t reached a conclusion yet, repeat from 4.

In two projects I was asked to do step 2 - 6 and figured out that I should write down what I learned for posterity. (The fact that I have a friend in Belgium that reminded me about blogging more is a mere coincidence.) I will not put many details on step 1 and 2 here.

Some think it is a good idea with a long capture, but keep the replay part in mind and the time it takes to actually execute the replay. Though the flashback of the database is impressively fast, a longer replay (with much updates) will lead to more time spent on setting the database back with flashback. If you can limit the capture to say two hours that is a time saver later, but if you need to confirm say scalability for one night's ETL jobs, you may need to capture several hours. Before you start with capture, make sure you have all the backups you need to duplicate the database with point in time recovery as close to the starting point as possible. This should not be an issue since all productions system (IMHO) should run in archiving mode and being backed up. This is an opportunity to verify your backups.

For large databases it may take quite some time to transport the backup (and capture files) to new platform and have it restored to a clone. Flashback database is a wonderful feature since it takes litle time to flashback a database compared to a full restore for large databases. Flashback is turned on with command:

A restore point is created with:


The name of the restore point is your decision (within limits). You will need some space for flashback logs, depending of the amount of changes to the database during replay.

In a bypass; yes, there are as usual two ways to do this. Enterprise Manager for the young and aspiring DBAs, or you can use the PL/SQL packages. When I have the routines ready for using the API I tend to favour them, so I did most of the work with the packages DBMS_WORKLOAD_CAPTURE and DBMS_WORKLOAD_REPLAY together with the packages for creating AWR reports DBMS_WORKLOAD_REPOSITORY. Sure, I had some looks at nice graphs and reports in EM too.

Going into more details on step 3; the result of the capture is a collection of many files that needs to be transported to the new platform. On the test database create a database DIRECTORY object that points to the file directory where they are stored. First step is to process the capture with:


where REPLAY is the name for the directory. The result of this is a new directory with a name that starts with pp followed by a version number. Copy it all to the machine(s) that will host the replay agents (actually replay the workload and send it to database.)  If you are doing this on a RAC-database, copy the same stuff to other nodes with same path for the directory object.

In each installation of the Oracle database and client you have the workload replay client (wrc) included under $ORACLE_HOME/bin. It’s task is to send SQL from the captured workload to the database.  Before you start the actual replay a calibration should be run first. With ORACLE_HOME and PATH set, execute:

wrc mode=calibrate replaydir=/tmp/replay

This will report the recommended minimum number of clients. Too few means that the replay clients cannot replay the workload fast enough. The same will happen if the servers where the wrc binaries are overloaded, or the network between them and the database have any issues.

So if you have to execute 10 clients, put 10 copies of this in a script:

wrc system/welcome1@clonedb mode=replay replaydir=/tmp/replay 
sleep 1
wrc system/welcome1@clonedb mode=replay replaydir=/tmp/replay
sleep 1

wrc system/welcome1@clonedb mode=replay replaydir=/tmp/replay 
sleep 1

Don’t run it yet.

Now we’re onto step 4. You repeat the following sequence if you decide to repeat a replay, but the preparation hitherto is only done once. Initialize the replay with the following command, using a distinct name for the replay each time:

exec DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY(replay_name=>’run_1’, replay_dir=>’REPLAY’)

In the middle of this we have to do some remapping... The connections to the database used during production capture are not the same as used during replay. After the previous command has completed you can query the view DBA_WORKLOAD_CONNECTION_MAP and see what was used during capture. This view is empty before the initialize_replay procedure is executed, after it has values in the column CAPTURE_CONN, but REPLAY_CONN is empty. First time you have to look at the output and see if you have many different connections to the database and figure out how you will remap them. Often, only one connection string is used (connection string as the key in TNSNAMES.ORA) and the mapping is simply replacing it with the one you are now using to reach the new clone database. In that case you can use the following PL/SQL block to remap:

  for i in (select conn_id, capture_conn 
    from dba_workload_connection_map m, dba_workload_replays r
    where replay_id = id
    and name = ‘run_1’)
    dbms_workload_replay.remap_connection(connection_id=>i.conn_id, replay_connection=>'cloneserver:1521/clonedb');
  end loop;
The name => ‘run_1’ refers to the name you gave to the replay. You can either use an entry from tnsnames.ora or connection string as shown above when setting the new connection. If you have some load balancing or application partitioning where you use different connections to the database they have to be remapped accordingly. Though you have to execute this for every replay, you only spend time on this first time and keep the PL/SQL ready for next time you do a replay.

Now after initialize, next step is prepare (!) In this step you set various parameters that how the replay is executed. In it simpliest way:

exec DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY(synchronization => TRUE, think_time_scale=>100, scale_up_multiplier=>1)
synchronization set to ‘TRUE’ is equal to ‘SCN’, another more advanced (according to manual) is ‘OBJECT_ID’. This sometimes causes problems for the replay and testing several times with synchronization to ‘OFF’, ‘SCN’ or ‘OBJECT_ID’ may be necessary.

Later on when a normal replay has been executed it is always interesting to see what happens when you reduce think time or scale up the load. Note, that DML, when scaled up won’t actually update the same way, so this needs some testing from you. For reporting usage scaling up should work, however, note if you double the number of users, they are in realilty not likely to send the same SQL as today twice as many times, that is, users usually will run slightly different queries. But if this is a problem you have to find out by analysing the reports - if an a few statements are popular in an OLTP system you can assume they will remain so after you scale up.

Now you can add filter for what will be actually replayed, but I am going to leave it out. After you have exuted the prepare_replay procedure above, you can kick off the script on the clients, that is the script that includes the wrc commands. They will start and write out a message that they are waiting for replay to start.

You can start the replay with:

When this is executed the replay is running and the agents (wrc) writes out a message that replay has started.

EM has a nice page for monitoring a replay. A few times we have seen that the replay lags behind compared to capture, but catches up later on, or sometimes takes longer time than the capture. Which sometimes means an error with RAT itself and not necessarily a problem with the new platform.

In step 5, after replay is finished, you generate a RAT report and also AWR reports. In DBA_WORKLOAD_REPLAYS you’ll find the replay_id of the replay and the RAT report can be generated with:

from dual;

Generate the AWR reports as usual. If you want to compare AWR report from capture with replay, you will have to import the AWR data (AWR data is automatically exported after the capture and included in the same directory). First verify that the capture is visible in DBA_WORKLOAD_CAPTURE. If not you can import all the data about captures found in the directory with:

exec dbms_workload_capture.get_info(‘REPLAY’)
where ‘REPLAY’ is the name of the DIRECTORY-object where capture is stored. The capture(s) should now be visible in DBA_WORKLOAD_CAPTURES. The AWR-data belonging to it can be imported with

select dbms_workload_capture.import_awr(42,’LUSER’) 
from dual;

Take a note of the number returned. This is a generated DBID that you will use later to refer to the database of the capture (this means that this is a made up DBID that is different than the DBID of the database where you executed the capture). 42 above is the capture_id from DBA_WORKLOAD_CAPTURES and LUSER is the name of a schema used for staging. Just create on temporary if you don’t have one ready.

The import can take some time. When it is finished you can see in DBA_HIST_SNAPSHOT what snapshots are available, check for the DBID that was returned in last command. You can now create comparsion reports with:

select * 
from table(dbms_workoad_repository.awr_diff_report_html(123456789,1,120,130,734501234,1,10,20));

The first three parameters are DBID, start snap id, end snap id; all from the first database (source/capture) and the last three the same respectively for the second database where replay took place. Creating these reports work pretty well in SQL Developer, use Ctrl-A and Ctrl-C to copy the html code and paste it into an editor and save it. You can actually create more reports than I have shown here, see the documentation for the mentioned packages to get an idea. AWR data is automatically exported to the same directory after a replay. This means you can later be import them into another incarnation of the database to compare different replays.

Step 6, flashback database is pretty easy. Shutdown the database, start it in mount, execute:

flashback database to restore point b4_replay;

Then open it with:

alter database open resetlogs;

If you are on RAC you have to shutdown the other nodes and start them after the previous command is completed.

There used to be some bugs that caused the database to crash spectacularly with an ORA-600 during flashback leaving the database useless and forcing you to do a new restore/duplicate from backup. Since I’m stressed during a project like this, and therefore I don’t have time to waste, I tend to avoid structural changes (like moving tablespaces around after creation of the restore point) that can challenge the flashback. I am not sure if the bugs are still there, though.

Step 7, are you done yet? I usually have to run the replay over and over several times trying out with different parameters for procedure dbms_workload_replay.prepare_replay(...). This takes a lot of time, don’t underestimate the project. Also remember that people will question your results.

In conclusion, try to figure out before you start what you want to prove. Are you comparing two systems, you need to agree on a common set of parameters; or are you trying to prove that a system is good enough (without comparing it to the old?) Try to lower you ambitions or scope, the time it takes with restore, replay, generating reports, looking over them, and possibly analyze further with extended trace, starting over,... it may feel like a walk in the desert sometimes.

Sunday, June 22, 2014

Some thoughts from last PoC

For the second time I have been involved with a proof of concept (PoC) where testing of an alternative hardware platform was the goal. Common for both PoCs were:

  1. The other option was Exadata
  2. Real Application Testing (RAT) was used to replay production recordings on the new hardware.

And for the second time I have concluded that sales people take this too easy. It is not easy and assuming you will outperform previous platform just because you have a bunch of less expensive flash disks, cache cards, and more is arrogant at best.

A decent PoC requires lots of preparation and you need to have done some planning for capacity and load. If you have no idea of throughput, expected load, the nature of the applications and more, you are likely to fail.

Exadata is not the optimal solution for everybody, that is not why Oracle charges lots of money for it; it is not like they are thinking “Everybody will want this, so we can double the costs”. Not that I know from the inside, but some in Oracle probably thought they needed to create a solution for processing large amount of data fast. In order to do that a congestion of the CPU has to be avoided by filtering out data that is not needed before it reaches the database. As an added benefit of not shipping useless data the system will perform less I/O. If you don’t need that, Exadata may not be for you. On the other hand, if a customer needs this for a BI solution and you are selling what you like to call “standard commodity solution”, be prepared to have something that scales and can replace the Exadata features (if Exadata is what you are competing against.)

If you sell a solution to the 10% of the costs, it will not succeed simply because your bidding price is low; without performance it will fail. Without careful planning and analysis you will fail. Cheap food is nice, but if it makes people sick, they are not happy to spend the money on hospital bills. And they will go somewhere else.

There is no good substitute for proper understanding. DBAs have heard for many years that we will become redundant because modern systems are self-managing or something like that. Well, I gladly take up brewing if that becomes a reality, but what has changed is the amount of decibels from the CIO screaming when new and more expensive hardware does not perform after all.

The fun with Exadata does not all come from the fact that it is expensive and state of the art. Rather we have had these experiences where you combine your understanding of the platform with old wisdom like “filter early” and end up with incredible performance improvements. Like one day I asked a domain expert about the data we were querying, and after adding one redundant predicate to the SQL statement, the response time went from 90 seconds to less than 1 second. The whole improvement was due to smart scan taking place.

Back to the PoC; we spent weeks waiting for proper integration between servers (running Linux) and the storage. Wrong interface was used for the interconnect (this was RAC), RAID 5 was used because after all, they hadn’t provided enough disks. The devices that would provide the caching features was not used, but the low-level devices was discovered by ASM instead due to wrong configuration (parameter ASM_DISKSTRING was unchanged leading to the sd* devices being discovered). After a long time the RAT replay showed that for a small number of users the performance was acceptable, but as load increased the throughput stalled and we concluded that the new solution was not dimensioned properly. The vendor suggested a new and even better storage solution with even more flash…

Probably without seeing it themselves, the vendors in both cases demonstrated the benefit with an engineered solution. When the customer sees that you need weeks to have the storage play with your server, they start to think that maybe Oracle has a point with their preconfigured and tested solution, ready to plug in. (Yes, we know that reality is not that simple, search Twitter for #patchmadness for an illustration.)

There are a few takeaways from the RAT testing itself, which is material for the next post. What I can say now is that restore of database, performing replays and flashback and start over again takes a long time. Keep it simple and try to keep the capture as short as possible without losing value for what you are testing.
The goal with a PoC is to qualify or disqualify a new solution. If a new suggested solution is disqualified, the PoC should be deemed a success; this was actually stated by the project manager and not by me. PoCs should not be taken too easy. Think about all the problems you can avoid by stopping the wrong solution to enter your data center.