I was reading Pythian's blog on 'what makes a good dba' this week.
I could not agree more about the next part :
"There is nothing more important than backups.
End of story. Performance tuning and building an awesome server are cool, but this is the bottom line: our data is sacred. Before anything else, you have to make regular backups and you have to make certain that those backups work properly. If you want to know why, read this sad story: CouchSurfing Deletes Itself, Shuts Down."
It's not only the importance of backups, it's the ability to recover from them - I can tell at least a handful of stories were a backup was available (or thought was available) but one was not able to recover.
There's nothing better than stories like these to point out were a dba's priorities should be.
It seems in the end the users didn't mind re-entering their data ... don't except the same from your enterprise end-users
As a DBA, I sure do a lot of non-DBA stuff these days - the next story is a perfect example.
For the last 2,5years I've been spending about 60% of my time at a customer's site. Because I have a java background and know a thing or two about Oracle application server and networking, they give me projects you normally don't give to a DBA.
For example :
- setting up linux based clustered load balancers (instead of using blue coat or big-ip f5)
- setting up linux/apache based clustered reverse proxies/ssl terminators
- build a SAML v2 extension to Oracle SSO server
- Oracle SSO server integration with Belgian e-ID passports
(mental note - blog about above topics)
People might say those are not a dba's task - and I guess that's true but I need a challenge now and then to keep the job interesting ...
Anyway - I was in a meeting with this customer a couple of weeks ago and I got a new challenge.
They asked me if I would be able to consume webservices from within PLSQL.
So I said sure - no problem - piece of cake - bring it on.
They prefer PLSQL - as they have a lot of PLSQL experienced developers in-house, some VB/.net developers and no java
developers. So it would either be PL/SQL or .net.
Always verify your source - order your '404 page not found' menu here.
Let's hope they don't have a bouncer called '301 permanent redirect'.
I sometimes have to smile when I hear someone talking about best practises on backup and recovery. Don't get me wrong - they all make very valid points. The fact is that the real world is sometimes very different.
In the real world there is only one valid point - money.
In the real world, the IT departement is there to support the business and not the other way around. This sometimes means that IT budget is tight, resulting in all kinds of strange and funny things but ultimately may also - indirectly - lead to data loss.
RMAN is nice - but what if you're still running on an unsupported Oracle 7.3 on an old platform and a so very very tired hardware, all because there is no budget. Or what if you just blew all the IT budget on nice and shiny hardware and forgot to calculate that RMAN MML part for your backup solution. Or what if you do not have the luxury of a big tape library and still have to switch tapes by hand on each and every server - who is doing the tape management anyway ... it's a silly job ... but you better keep that person motivated and happy !
Most of the recovery cases I will cover here are the result of human error ... and those are sometimes the result of budget constraints.
The same goes for this case - the hammer case.
This was a recovery case where I actually went on site.
It was ooooold hardware – a mid ‘90’s RS6000 with a DAS storage cabinet using SCSI disks.
Remember the RS6000 F30, F50 and 220's - that will give you an idea. Oh while we are on memory lane - remember the time where you had to carry around 10 different scsi cables and terminators at all time.
I always had a bunch of scsi cables laying around in the trunk of my car - together with meters of ethernet coax cable, terminators, a couple RANs, db/de9/25 connectors, gender changers, null modem and parallel cables and of course the good old soldering iron and multimeter.
But I never ever carried a rubber hammer around !
So when entering the server room I immediately noticed this rubber hammer hanging on a small cord next to the storage cabinet – (naturally)I asked what it is was used for ?
Apparently, when the machine was rebooted or shutdown – the disks would cool down and their head(s) would stick to the platters.
So when they started the machine again, they would listen to the disks and once they heard the struggling sound of the heads trying to free themselves, they would hit the disks with the hammer.
That seemed to always get the job done
What happened ? One of the disks had finally failed ........ Duh.
The disk had crashed while migrating AIX to a higher level - you know ... exportvg, migrate/reinstall os, importvg.
The mapping of the filesystems was quite simple - one JFS would map to one LV on one PV or disk. All disks/PVs belonged to one VG. No RAID. They had about 7 data PVs/scsi disks, and all database files where spread around those PV's.
So a disk had crashed, making it impossible to perform an importvg.
No problem - we have a backup - let's just make a new VG of the remaining 6 disks - and rebuild the database from scratch.
Their backup solution was a user per user export. Not a best practise.
And to top it off - the backup script was hardcoded. Not a best practise.
You know what's next right ? - a new user was added a couple of months before and the backup script was not altered accordingly.
Bad luck - but wait. The new user was sitting on that crashed disk in a seperate tablespace !
Now, we were able to replace the PCB on the SCSI disk and it would spin up ! (how lucky can you be)
So the actual data on the disk was still intact, but because the disk was once part of a VG, made by a logical volume manager (LVM), you couldn't just pop in the disk in a machine and see the data.
So it’s not like a FAT or NTFS filesystem on windows where you hook up a disk to the server and voila, there are your files - we had to get an expert on site who had to extract the datafiles from the disk, wading through all the physical partitions. This guy knew his stuff but I can still see him sitting there breaking a sweat.
Once we got the datafiles, they were orphaned from their database – so we used DUDE's heuristic scanner to get the data out of the datafiles into flatfiles. One of the senior developers then quickly identified which flatfile was which table and using sqlldr we were able to load the data back in. There were only a 120 tables or so.
Later on I wrote a routine that would allow us to identify deleted records within the base dictionary tables – this makes it possible to map objectid's to table/column/usernames even if the objects were dropped – very much like the FEDS tool.
Unloading without a SYSTEM tablespace is always time-consuming as one needs to identify the flat files. Without SYSTEM, there's no dictionary and without a dictionary there are no table/column names and datatypes.
In conclusion – if you see a hammer attached to a storage device ... Walk away.
Kurt Van Meerbeeck
Cell : +32 495 580714
Service & Support Manager
Cell: +45 53 74 71 27
South Africa :
Cell : +27 82 7799275