31 Days of Disaster Recovery

Yüklə 211,03 Kb.

səhifə	9/15
tarix	16.08.2018
ölçüsü	211,03 Kb.
	#63138

1 ... 5 6 7 8 9 10 11 12 ... 15

Day 10: Monitoring for Corruption Errors
823 Errors
824 Errors
825 Errors
829 Errors
Creating Alerts

Summary

You can see that the checksum options I’ve described in this post are very useful for protecting the integrity of your data and for increasing the chances that you will find corruption early as well as being the quickest way to detect a corrupt backup file. I highly encourage you to step through the demo code and see how it protects you and can save you a lot of hardship down the line.

Day 10: Monitoring for Corruption Errors

It’s day 10 of my 31 Days of Disaster Recovery series, and I want to talk about monitoring for corruption errors. There are four errors related to corruption for which everyone should raise alerts and send notifications. The four alerts are 823, 824, 825, and 829. The sooner you identify and address corruption, the greater the chance that it can be resolved without data loss and with minimal downtime.

823 Errors

An 823 error is raised when a read of a page fails at the OS layer. The OS returns the failure to SQL Server who re-throws the error to the user. The attempt to read the page is retried 3 time and on the fourth failure, the error is raised. This means that when you see this error, it actually failed four times. If you see it frequently, then it has failed four times as many errors. This error is generally a sign that of problems with the storage system. It could be a disk going bad, faulty drivers, incorrectly configured disk subsystem, etc. If you see me at an event someday, ask me to share the funny story about a misconfigured SAN causing rampant file corruption on a production SQL Server (discovered before we put it into active use, fortunately).

When you receive this error, the message text will indicate the underlying cause of the error. It is important to note that not all 823 errors are an indication of corruption. For example, error 21 (Device is not ready) is often the result of the drive the file is ongoing offline or the service account losing permissions to access the drive. If the OS error returned is error 23 (Data error – cyclic redundancy check), then you definitely have a corruption of the file. In fact, OS error 23 is the only 823 error that will trigger automatic page repair in database mirroring and Availability Groups.

If you are receiving 823 errors, I highly advise having the disk system checked as well as it could be an indicator of pending hardware/disk failure.

824 Errors

Error 824 means that the read of the page from the file system was successful, but SQL Server detecting corruption. Just like error 823, the read is retried 3 times, and the error is raised only after the final failure. If you receive an 824 error, it has actually occurred 4 times. This error is a clear indicator of corruption and you should take corrective steps.

825 Errors

As I explained above, reads encountering 823 and 824 errors are retried 3 times and only raised if they fail all retries. An 825 error indicates that one of the retries was successful. Unlike 823 and 824 errors, 825 is NOT a fatal error and does NOT get bubbled up to the user. 825 errors can occur undetected by anyone. The error is logged in the SQL log and in the event log, so there are many ways you can detect these, but most people don’t think to do so. If you are getting many 825 errors, it can be a sign of impending disk failure, and the disks should be checked as soon as possible.

829 Errors

829 errors are lesser known errors, but they are potentially a sign of corruption caused by the disk subsystem. The error means that a page has been marked RestorePending, The only way to fix this problem is to restore the page (or file/database). It can’t be repaired using DBCC. This error, in addition to 823 (CRC failure) and 824, can trigger automatic page repair in database mirroring or Availability Groups. Ironically, when automatic page repair is triggered, it sets the page to a status of RestorePending to ensure that no other transactions can be run against it while the page repair is being performed. If you have this error occurring and it is not the result of automatic page repair, check the disk system.

Creating Alerts

Alerts can be easily created by using the GUI (under the SQL Server Agent node in Object Explorer) or via T-SQL. First, create an Operator to receive notifications and then create the alerts. Below is an example of how to create an Operator to receive emails and alerts for the four error codes assigned to email the Operator.

USE msdb;

-- Create operator

Exec msdb.dbo.sp_add_operator

@name=N'DBAs',

@enabled=1,

@email_address=N'';

Go
-- Create alert

Exec msdb.dbo.sp_add_alert

@name=N'Corruption (823) detected',

@message_id=823,

@enabled=1;

-- Assign operator to alert

Exec msdb.dbo.sp_add_notification

@alert_name=N'Corruption (823) detected',

@operator_name=N'DBAs',

@notification_method = 1;

Go
-- Create alert

Exec msdb.dbo.sp_add_alert

@name=N'Corruption (824) detected',

@message_id=824,

@enabled=1;

-- Assign operator to alert

Exec msdb.dbo.sp_add_notification

@alert_name=N'Corruption (824) detected',

@operator_name=N'DBAs',

@notification_method = 1;

Go
-- Create alert

Exec msdb.dbo.sp_add_alert

@name=N'Error 825 occurred',

@message_id=825,

@enabled=1;

-- Assign operator to alert

Exec msdb.dbo.sp_add_notification

@alert_name=N'Error 825 occurred',

@operator_name=N'DBAs',

@notification_method = 1;

Go
-- Create alert

Exec msdb.dbo.sp_add_alert

@name=N'Page RestorePending (829) detected',

@message_id=829,

@enabled=1;

-- Assign operator to alert

Exec msdb.dbo.sp_add_notification

@alert_name=N'Page RestorePending (829) detected',

@operator_name=N'DBAs',

@notification_method = 1;

Yüklə 211,03 Kb.

Dostları ilə paylaş:

1 ... 5 6 7 8 9 10 11 12 ... 15