UGTS Document #46 - Last Modified: 8/29/2015 3:23 PM
Microsoft SQL Server Database Structure
Credits - 2012-10-24 - Kuseni, for a correction to the number of single pages allocated before uniform extents get allocated, and a correction to the documented structure of IAM pages.
Microsoft SQL Server Database File Structure
A database file at its simplest level is an array of 8KB pages - bytes 0x0-0x1FFF in the file belong to page ID = 0. Bytes 0x2000-0x3FFF belong to page ID = 1, etc... Page ID values start at 0 and count upward. Page IDs are 32 bit signed values, meaning that the largest possible database file in SQL server is 2^31 pages * 2^13 bytes per page = 2^44 bytes = 16TB. This is a limitation that has been present from SQL Server 2000 up to 2008 R2.
Pages are the fundamental unit of allocation for data stored to disk. Every object and structure is the database has a set of pages that it are allocated for it, and two objects cannot share the same page. Note that database log files (LDF) do not have pages - they have a much simpler structure - they are just a sequence of log records. Also note that database data and log files do not have to have MDF, LDF, and NDF file extensions for SQL server to be able to use them, but that is the convention: MDF for the primary data file, NDF for secondary data files, and LDF for log files.
Eight consecutive pages form an extent (64KB). An extent is the usual size for a disk I/O read or write operation, since larger reads and writes allow disk access to perform more optimally. If all 8 pages in an extent belong to the same object, the extent is said to be uniform. If they belong to different objects, then the extent is mixed.
A uniform extent is preferrred to a mixed extent for performance because the fewer the extents, the less disk I/O is required to read or write the object, and less metadata is needed to track the allocation of the object. However, uniform extents are not used when the total size of the object is less than a single extent, because that would waste far too much space if the database had a large number of very small objects. The first 8 pages for an object will be allocated individually from mixed extents, but after that, only uniform extents will be allocated for the object.
As noted, a database can have multiple files beyond just a single MDF file - the additional data files have an NDF file extension. The File ID is used to indicate which file in a list of database files is referenced, and it starts at 1 and counts up to 2^15-1 = 32767. This means that a database can have up to 32767 files, for a total limit of 512PB per database. FileID = 0 is an invalid value which typically means 'not used'
The combination of a PageID and a FileID is a 6 byte value called a page locator that uniquely identifies a page in a database. A page locator is written as FileID:PageID, such as 1:142 to indicate FileID = 1, PageID = 142. 0:0 typically means 'no page', or 'not used'. For the purposes of this article, a page locator of *:7 would mean page 7 in each database file.
An RID value (Row ID, also called a row locator, record locator, or Record ID) is a page locator plus a 2-byte slot index, for a total of 8 bytes. They are used to uniquely identify a table or index row within a page. The slot index starts counting at 0 up to the number of data rows on a page, and will be described in more detail. An RID is typically written in the form: FileID:PageID:Slot, such as 1:142:0 to indicate FileID = 1, PageID = 142, slot = 0.
Since SQL server is a transactional database system with logging and rollback capability, and it desigend to be maximally responsive at all times to the maximum number of users. To facilitate this, data is generally only moved around within a database and updated to the minimum extent necessary to get the job done quickly and to make it easy to undo the change, should that be necessary. This means that database files are never automatically shrunk or compacted - unused pages will sit around in a file until they are reallocated. If a data row is deleted from within a page, the space is simply deallocated but the data where the record used to be is not cleared. This means that the page locators and RID values do not have to be updated most of the time, and undo-ing an operation is usually as simple as flipping a few bits back to their previous state. It means that database structures don't move around very much in a database, and if they get corrupted, you can sometimes restore a backup file to compare page contents before and after the corruption to see what went wrong. It also means that databases can become fairly large and fragmented over time, and that scheduled maintenance is sometimes necessary to compact and defragment the database when it gets out of hand.
SQL Server provides a few undocumented DBCC commands - DBCC IND and DBCC PAGE - that allow you to view the internal structure of a database page or object. Alternatively, you can always take the database offline, detach the database files, and view them in a hex editor if you want to edit them directly, but that is both inconvenient, requires downtime, and lacks the parsing capabilities built-in to the DBCC commands.
To use these DBCC commands, you must first turn on trace flag 3604 (which redirects error log messages to the standard output) as follows:
DBCC TRACEON (3604)
The printOpt parameter is a type of verbosity - it can be from 0-3, with 3 being the most verbose. However, if you try to use printOpt = 3 on a free (unused) page, DBCC PAGE will fail. The command will also obviously fail if you use a FileNum or PageNum outside the range of valid values allocated for the database files. To find out which PageID values belong to a database object, use the DBCC IND command:
DBCC IND ('[dbname]', '[objectname]', [indexid])
For example, if you have a table 'Beta' in database 'Alpha', to view the contents of the clustered index (IndexID = 1), you would use the following:
DBCC IND ('Alpha','Beta',1)
This will return a table of all the pages that belong to the selected index of the table Beta in the database Alpha, and you can then call DBCC PAGE on the PageID values to view them.
When you run the DBCC IND command, you will see a PageType column. There are 15 different known PageType values in SQL server, most of which can be broadly classified into three types - data, status, and options:
Every page in the database that has been initialized will have a 96 byte header, and the remaining part of the page (8096 bytes) is for data content specific to the type of page. The 96 byte header has the following format (as determined by Mark Rasmussen and Paul Randal):
The advantages of having the common format for the page header for all page types are:
As shown in the header's AllocationID field, a page can only belong to a single allocation unit. An allocation unit is an (ObjectID, allocation type, partition) tuple, where allocation type is one of IN_ROW_DATA, LOB_DATA, or ROW_OVERFLOW_DATA. All data pages will have an IN_ROW_DATA allocation unit, and the other two are optional. LOB_DATA holds chunks of values greater than 8KB in size. ROW_OVERFLOW_DATA holds values too large to be stored in-row and have to be stored in a separate page. Partitions are an enterprise level SQL Server feature allowing very large objects to be artificially split across boundaries for improved performance. They are discussed further on.
SQL Server 2008 R2 defines 25 different types of objects, including: table, stored procedure, view, trigger, synonym, function, etc.... Each database has system tables that contain the metadata about the objects in the database. In SQL Server 2000, you could query and view (and modify) the contents of system tables directly, but in SQL Server 2005, you aren't allowed to do this anymore, and you must view the information through the views that Microsoft supplies. Both the sys.objects and dbo.sysobjects views will show you all the objects defined in a database:
SELECT * FROM sys.objects
Every database object has a unique Object ID. Object IDs are unique 32 bit numbers assigned semi-randomly, except that system objects have ID values <= 100 in SQL Server 2008 R2 and earlier, with a few exceptions.
LOB means a Large OBject, which is anything over 8KB in size, such as varchar(max), text, varbinary(max). SLOB means Small LOB, which is a value not longer than 8000 bytes, but which must be stored on a different page via row overflow. CLOB means Character LOB (such as text and varchar), whereas BLOB means Binary LOB (such as image and varbinary).
Note that some pages in a database may not be initialized. In this case, the page will have garbage data both in the header and data portion of the page. Other pages may explicitly have a page type of 0. In this case, only the first two bytes of the page can be relied upon, and the rest of the data on the page may be zeroes or garbage.
The most commonly used page types in a database are Data, Index, Text Map, Text Tree, and Index Allocation Map (IAM). Let's start by looking at data and index pages.
Data and Index Pages
Table data is usually stored as a B-Tree ordered by the table's clustered index. If the table has no clustered index, then the data is stored as a simple unordered list of data pages. The leaf pages of a clustered index, and the pages in a heap hold the rows of the table. The upper level of a clustered index, and all levels of a non-clustered index store only the colums in the index, and are used to look up data rows by the index. Because of this, data pages are sometimes referred to as Heap or B-Tree pages (HoBT). Data and index pages share the same structure, with only minor variations.
The 8096 byte data content area of a data or index page is used to store an array of data/index records. At the end of page is a slot array of 2-byte values, each holding the offset to the start of the record. The slot array grows backwards as records are added.
The remaining space not counting the slot array is used for data records. Microsoft officially quotes that the size limit for a data record is 8060 bytes, instead of 8096 bytes. This has probably been done for safety to reserve extra space for header information should that ever be necessary. By limiting a data record now to 8060 bytes, the space can be used in a later version of SQL server without breaking any data records near that limit. Note also that the 8060 byte limit is just an upper limit - you won't ever get more than 8060 bytes without row overflow or LOB storage, and for a complex row with lots of columns and variable length fields, you will probably have less than 8060 bytes as the limit due to the NULL bitmap and the variable length column data lengths and offsets that take up extra space in the record.
Each row of data is structured as follows:
Looking at this data structure, you can see that the minimum size for a row of data, not counting space used for indices, is 9 bytes, plus 2 bytes for the entry in the slot array. It would be 8 bytes if you could have a row with a single byte value, without needing the padding. But since the padding is needed, it makes no difference whether the row requires one byte or two bytes of data - it still uses 9 bytes by the storage engine. In practice, you rarely have a table smaller than 4 bytes, so this is not an issue.
Every row has at least 9 bytes of overhead space on a page, and most rows will also have variable length data, so most rows will have 11 bytes of overhead. In practice, there is usually a fair amount more overhead than this. In order from the greatest to least impact:
Index pages have the same general format as data pages, except that they only contain the columns in the index, and either the columns in the clustered index or the RID of the data record the index row points to. They may contain included columns if the index has some of these. There are three variations in the format of an index page (clustered leaf pages are the data pages for the table rather than index pages):
The variations of an index record compared to a data record are as follows:
Clustered index pages at non leaf level only have the columns in the clustered index, because the index columns are the clustered index columns and there is no need to have them in there twice. Non-leaf level index pages also have a 6 byte page locator to point to index pages at the next level down. Included Columns (new for SQL 2005, described further on) are only present at leaf level. Note that the B-trees for indexes in SQL server are not exactly like typical B-trees because the 'separator' records at the higher levels are repeated at the lower levels, and the lowest (leaf) level has index records for all of the rows, including the separators. This simplifies the procedure for the lookup of index records, but wastes of a small amount of storage space to store the separators at all levels of the tree.
Note that the clustered index key columns are present also on the non-leaf pages of non-clustered indexes, not because it is needed there, but most likely because it simplifies the code that data structure, and does not require significantly more space to store, seeing that there are relatively few non-leaf index records.
The minimum size for an index record is 6 bytes (compared to 9 bytes for a data row) for an index record on a non-clustered index at leaf level, where the only columns present are a 1-byte clustered index and a 1-byte index column, with a status byte A, a number of columns 2-byte value, and a NULL bitmap byte.
Normally the index records will contain the clustered index columns to lookup the rows. But if the table has no clustered index, then the table is a heap, and RID values will be used instead to lookup rows. Since rows in a heap are not moved or ordered in any way, RID values stay fairly constant, though a heap can become heavily fragmented over time as records change size or are deleted.
Note that a clustered index does not have to be unique. If the index looks up records by a clustered index, and it is not unique, then SQL server has to add information to the data and index rows to make the clustered index values unique. This extra column that gets added is called a 'Uniqueifier', which is just a 32-bit int which counts from 1 upward and is stored as the first column in the variable length columns list. Because the uniquifier is in the variable length columns list, it can be omitted for the rows where it is not needed. When your table requires a uniquifier, it adds an overhead of two bytes to every data and index row in your table, because the length of the uniquifier field must be stored for every row, and for the rows where it is present, there is additional overhead of 4 bytes.
Note that while a clustered index doesn't have to be unique, a primary key must be unique and the columns in the primary key must not allow NULLs - SQL server will not permit you to create a primary key which has NULL columns or non-unique column tuple-values. And if you try to designate columns as part of the primary key in the designer, the UI will automatically uncheck the checkbox to allow NULLs for that column.
Note also that a table can have a clustered index or not, and it can have a primary key or not, and it can have both a clustered index and a primary key and they can be on the same index or different indexes. But there can be no more than one primary key for the table, and no more than one clustered index for a table. The clustered index is mainly concerned with the order of the storage of the rows in the data pages, and the primary key is mainly concerned with uniquely identifying the rows. You can have one without the other, or both, or neither, and you can have both and have them on separate indexes. Indexes will lookup always lookup records by the clustered index value (with optional uniquifier), unless the table has no clustered index, in which case the lookup is by RID to a heap.
Index Column Limitations
The maximum number of columns in an index (clustered or not) is 16, and the maximum size allowed for a set of index columns (clustered or not) is 900 bytes. If you try to insert or update a row so that any index exceeds this limit, the operation will fail. It is supposed this limit was made so that every page in the B-tree in the worst case of a 900 byte index and 900 byte clustered index could still hold at least 4 index records (record size 1800+ bytes out of 8096 available in a page), and the B-tree does not have too many levels to walk through to reach the leaf level 0. This is still true even in SQL Server 2005 with Included Columns (described below) because included columns are not included in non-leaf level records, and so at least 4 index records can fit on a page at non-leaf level.
Starting with SQL Server 2005, an index can be defined with included columns. You cannot edit the list of included columns through the GUI (the property will be grayed-out / read-only), but you can define them if you create the index using T-SQL. An included column is included in the index records at leaf level like any other index column, but it is not used to look up data - it is only present as an optimization so that queries don't need to go read the actual data page if all they need are columns that are already included in the index record. Furthermore, a computed column can be defined as an included column. As usual, this is a performance / disk space and memory trade-off. Indexes should be as small as possible, so it is not wise to add extra included columns unless the performance gains are very noticable.
Computed columns are dynamically computed from the values of other columns in the same table. A computed column can optionally be marked as PERSISTED, in which can the column's value is stored on the data page like any other column and is only re-computed when the dependent column values change. If a computed column is not persisted, then the column value is always calculated at run-time, which saves space, but somewhat at the cost of performance. Computed columns can only be used in an index in certain cases, either in the key, or as an included column. If a computed column is included in an index in any way, the value is persisted in the index records rather than being computed dynamically, since having to refer to the data page to calculate the column value dynamically would defeat the purpose of the index.
Filtered Indexes, Sparse Columns, and Complex Columns
Two new features in SQL Server 2008 that go hand-in-hand are sparse columns and filtered indexes. A filtered index is an index with a where clause added, so that only the data rows that satisfy the where clause are indexed. Filtered indexes are most often used to exclude NULL values from being indexed, when only a small fraction of the rows in the table have non-NULL values for the column that is being indexed. This can yield a substantial savings in space used by the index.
Sparse columns are columns that are only included in the row record when the column's value is non-NULL. This is slightly different than the way variable length columns are stored. With a variable length column, 2 bytes are always stored for each column to keep track of the length of that column, even if it is 0, and the NULL bitmap always has a bit allocated for the column to keep track of whether the column has a NULL value or not. In contrast to this, sparse columns use no extra space in the row record if the value is NULL, but they do have extra overhead when the value is not NULL. As mentioned before, sparse columns are stored in a new special structure in the data row called the sparse vector, which is stored within a complex column in the variable length columns list. Here is what the sparse vector contains:
So we see that the sparse vector has an overhead of 4 bytes per sparse column that is present, plus the actual sparse data, and no overhead if the sparse column is not present (it does not even appear in the NULL bitmap). This can be constrasted with NULLable variable length columns. Even if a variable length column is NULL, it still requires 2 bytes in the variable column section to keep track of the length of the column, and it takes a bit in the NULL bitmap. Note that SQL server does a slight optimization for variable length columns in that if you have 4 such columns, but only the first three columns have non-NULL values, then the 4th column will not be stored, and the column count will only be 3 for that record, but in most cases, nearly all of the variable length columns will require 2 bytes of overhead per column. One could argue that SQL server ought to know to exclude the NULL columns because it has the NULL bitmap, and 1/8 of a byte per column is usually good enough for minimizing overhead, but SQL server does not do this as of 2008 R2. If less than about 20% of the values in a column are non-NULL, then a sparse column is clearly the way to go to save space.
Complex columns in general start with a single byte for the column header to indicate what kind of column it is. 1 = LOB pointer, 2 = row-overflow pointer, 5 = sparse vector.
Foreign Keys and Indexes
Foreign key constraints in SQL Server are not indexes and do not automatically create indexes in the referenced tables. Now suppose that you have added a foreign key constraint to prevent deletion of a record if there is a reference to it in another table. When this constraint is in place, every time you do a delete of a record in the parent table, SQL server has to join to the referencing table to make sure there are no references to the record you are trying to delete. And if this foreign column has no index on it, then this join has to do a table scan as opposed to an index lookup, which can be incredibly slow. In general, you need to be wary of this with foreign key constraints and almost always make sure that the foreign columns have indexes created.
Heaps and Fragmentation
When a table does not have a clustered index, the rows in the data pages are stored as a heap, where the rows have no particular order. Rows are usually added to the end of the heap in the order that they are inserted, but may sometimes be inserted into free space in between existing rows. If a row is updated such that it has to grow (increase in variable length data), then the row is taken out of its current position and moved to the end of the heap, and the slot array is updated to show the new position of the row. If there is no space on the current page for the updated row, then the record becomes a forwarding stub to the actual record which is moved to a different page, so that the RID values in the indexes don't have to change (which would cause a lot of extra logging). If a row is deleted, then the deleted row is simply cleared from the slot array by writing a zero to the slot array position. Heaps do not get automatically compacted, and over time they can become very fragmented.
Indexes and the Fill Factor
Indexes are stored a B-trees, and when the index is initially created, a page will have a maximum number of index rows that it can hold. By default, SQL server will try to fill each page to capacity to minimize the number of pages needed overall and minimize the size of the index. However, a completely full index causes a large initial number of page splits as rows are updated or deleted. Specifying a fill factor of less than 100 percent causes SQL server to fill each page to less than capacity when the page is index is initially created or rebuilt. This may be good for speed in the short term for tables with lots of write activity. It is only of temporary value because eventually all index pages average around 75%
As mentioned in the section on heaps, if a row in a heap grows too much to fit on its original page any longer, it will be replaced with a 9-byte forwarding stub. This stub is simply the Status Byte A (1 byte) which will be 0x04, followed by the new RID of the actual record (8 bytes). Note that tables with clustered indexes do not have or need forwarding stubs, since these tables reference records by the clustered index key rather than the RID value.
Allocation Units / Partitions
As mentioned before, each data page points to an allocation unit through the IndexID (idInd) and ObjectID (idObj). An allocation unit ID uniquely identifies a group of data pages for a heap or index. Allocation unit IDs are 64-bit values where the uppermost 16 bits are the idInd (IndexID) value, the middle 32 bits are the idObj (ObjectID) value, and the lower 16 bits appear to be a partition id (which is always 0 in non-enterprise editions of SQL server).
If you have multiple partitions on an index, then each partition (by design) gets its own separate storage for data pages, and (by design) each partition must be searched separately for records - which can be a speed optimization or penalty depending on how well the partition is defined for large tables. Because there is one allocation unit per partition in a partitioned index, allocation units are sometimes called partitions by SQL server.
Note that the idObj values are not the same as sys.objects.object_id values - idObj values are assigned sequentially start at 1, and refer to unique allocation units (which are unique by the schema object, index id, and type of data - in-row, lob, or overflow) but object_id values are random 31-bit numbers that refer to schema objects. For system objects, the idObj and object_id values are the same, and the idInd value holds the index_id for the system object (though not exactly the same). The idInd values are all 256 for allocation units for user-defined objects.
Since allocation unit IDs are 64-bit, they have enough space to hold 2^31 unique objects, 2^15 indexes, and 2^15 partitions, which is enough not to be limiting in any aspect.
System Page Types Using the Data Page Format
Since the data page format is versatile, it also used by all of the system page types, with the following differences:
GAM Intervals and GAM / SGAM / IAM / DCM / BCM Pages
SQL server tracks which pages and extents belong to which database objects / allocation units through GAM (Global Allocation Map) intervals. A GAM interval is a range of up to 63904 extents in a database file (3994MB, just short of 4GB, since an extent is 64KB in size). The first GAM covers extents 0-63903. The 2nd GAM covers extents 63904-127807, etc... Each GAM interval in a database file will have a single GAM, SGAM, DCM, and BCM page allocated for it to track space usage and status in that GAM interval. There will also be up to one IAM page per allocation unit in the GAM interval. The GAM, SGAM, IAM, DCM, and BCM pages are all 'synoptic' in the sense that they all have the same definition of a GAM interval in terms of the starting and ending extents.
Note that while you can artificially grow a database to any arbitrary size in pages (up to 16TB), GAM-type pages will not actually be created for an additional GAM interval until non-empty data pages (page type non-zero) are added to that GAM. This prevents the number of IAM pages from getting out of hand in a database in a pathological case where you have hundreds of thousands of allocation units. IAM pages are only created in the GAM if the GAM actually has data for that allocation unit.
As a corollary to this, IAM, data, and index pages are only ever allocated for a database object when there is data added to them. So if you create a table but the table has no rows, no data pages will be created for the table until rows are added - it will be strictly a schema-only change until then.
Each page of the 5 types uses the same format as a data page, and has two slots for two records of different schema. Slot 0 is the header record (94 bytes) and slot 1 is the bitmap record. Slot 0 has three fixed length columns:
Slot 1 is the bitmap slot, with a single bitmap column of variable length (up to 7992 bytes), holding one bit per extent in the GAM. The bits are stored a byte array of little-endian bytes (the least significant bit in each byte corresponds to the first extent, and the most significant bit to the last extent). Looking at an actual IAM page bitmap will help to illustrate this. This IAM page belongs to an IN_ROW_DATA allocation unit, and in this example it has marked the extents that belong to this allocation unit. Running DBCC PAGE with printOpt = 3 showed the following output:
IAM: Extent Alloc Status Slot 1 @0x15B1C0C2
In this example, the file has 1024 pages total (8MB), and the last extent in the file starts at 1:1016. This allocation unit has 3 uniform extents allocated to it at 1:544, 1:616, and 1:624. DBCC PAGE shows the output this way for brevity - it would be way too verbose to show each extent listed individually and a yes/no value indicating if the extent was allocated to this allocation unit or not. In this example, the corresponding bitmap would like like this in hexadecimal:
00000000 00000000 10600000 00000000
The above bitmap has 128 bits for 128 extents. Extents 68, 77, and 78 = 1 (allocated), and the rest are not allocated. Depending on the page type, a bit for a given extent has the following meaning:
Note that bits in the IAM, GAM, and SGAM pages are all mutually exclusive - for a given extent, a bit should never be set (owned) by more than one mapping page. If a bit is set for the same extent for multiple mapping pages, that indicates database corruption.
A typical database can have hundreds of IAM pages, but only one GAM and SGAM page per 4GB of database file size. Given that the IAM page in the example used only 120 bytes out of 8k, one might ask why it that SQL server stores allocation data in bitmap format rather than as a linked list. It would seem that having hundreds of mostly empty IAM pages is wasteful.
To answer this, we must remember three things:
Because no space would be saved, and keeping the same format between the IAM, GAM, and SGAM pages simplifies processing, and the IAM pages don't really use much overhead in the database file, there has been no need to optimize the format of an IAM page. This is the mostly likely reason that it still uses a bitmap and not a list.
Finally, note that for the DCM and BCM pages, data is maintained in these pages even if the database has been set to use the Simple recovery model.
Page Free Space (PFS) pages are data pages with only one data record. This data record has only one fixed length column, no NULL bitmap, and no variable length columns. The single fiixed length column is an 8088 byte array, and it stores one byte per page to track a number of statistics for that page.
The first PFS page is at *:1 in each database file, and it stores the statistics for pages 0-8087 in that database file. There will be one PFS page for just about every 64MB of file size (8088 bytes * 8KB per page). A large database file will use a long chain of PFS pages, linked together using the LastPage and NextPage pointers in the 96 byte header. Note that the PFS pages are NOT synoptic with the other mapping pages. All of the other mapping type pages refer to GAM intervals, but the PFS page range is not any kind of multiple of a GAM interval (8088 bytes vs 7988 bytes) - the intervals are completely separate. Also note that because there is one byte per 8KB page, instead of only one bit per extent like most of the other map pages, there are roughly 64 times as many PFS pages in a large database as there GAM pages.
Each PFS status byte has the following bit fields:
Log Sequence Numbers
LSN (Log Sequence Numbers) are stored internally as 10 byte values, and can be displayed in one of two ways - either as three fields in the form A:B:C, or as a 25-digit whole number (positive, no decimal places).
When an LSN value is stored as a 10-byte value, the storage structure is:
When displayed as a 25-digit whole number, B and C are padded on the left with zeroes to make them 10 and 5 digits and hold all possible values for a 32 or 16 bit number like this:
In the above, A is not left padded with zeroes, but B and C are. Writing an LSN in this way allows you to compare two LSN values with a numeric comparison to find out which one is smaller (and therefore which one happened first).
The meanings of the LSN fields are described here. The A value is the VLF file sequence number - every log record in a given VLF will have a particular m_fSeqNo, and this number increments as each new VLF is added to the log - it is the ID of the VLF. Note that because of this, a log file is limited to no more than 2^31 VLFs.
The B value is the block offset within the VLF, which translates to a byte offset within the VLF of 512 bytes * m_blockOffset to get to a particular block within the log file. The slot ID is the record index within the block, and it always counts from 1 upward.
Datetime Data Type
The datetime data type is a packed byte array which is composed of two integers - the number of days since 1900-01-01 (a signed integer value), and the number of clock ticks since midnight (where each tick is 1/300th of a second), as explored on this blog and this Microsoft article.
This gives the interesting result that a zero datetime value with all bytes zero is equal to 1900-01-01 at midnight. It also tells us that the datetime structure is a very inefficient way to store time (the datetime2 data type was created to address this concern), except that it is excellent at defaulting to a reasonable zero point, and that the date and time parts can be split apart very easily by SQL server.
Note that while it is capable of storing days up to the year plus or minus 58 million, it is limited by rule to only go between 1753-01-01 and 9999-12-31. And note that while the clock ticks part is a 32-bit number, in practice the highest value used will be 25919999. Since the datatime clock ticks are 1/300ths of a second, while they display accuracy to the millisecond, the will actually be rounded to the nearest 0, 3, 7, or 10 millisecond boundary in all conversions and comparisons.
Schema Objects / Type
Schema objects are objects with a permanent name which are addressable using the 4-part syntax server.database.schema.object. Since objects are uniquely identified by name regardless of type, you cannot have two objects with the same name, even if they are different types (view vs. procedure, for example), unless they belong to different schema. SQL Server 2008 defines 26 different object types:
Every object type has a char(2) type code:
In addition to these types, SQL Server 2012 (Denali) will define a new object type called SEQUENCE, which makes generation of identity columns more adjustable by centralizing the logic in database object.
File Header Page
The first page in each database file is the file header page, and there is only one such page per file. Reading this page can be done either with the standard DBCC PAGE command, or with the DBCC FILEHEADER command which is also undocumented, see example below:
DBCC FILEHEADER ('[dbname]', [FileID])
The file header page has a format identical to a data page, but the table is a system table whose schema seems to be hard-coded into SQL server, rather than defined in any other table. This data page has only a single row, and most of the columns are in the variable length columns section.
The page/tornBits checksum in the page header acts as a security checksum for this page - if you try to manually edit the header data in a binary editor and then attach the file in SSMS, the UI will tell you that the file is not a primary database file.
The first 16 bytes of the file header are fixed length (16 bytes). The remaining fields are all optional / variable length using the same structure as the variable length columns section of a data row. Therefore, rather than listing the offsets of each of these fields, the field index instead is given (there are 46 such fields in SQL 2008 R2, numbered 0x0 - 0x2d. To lookup the position of field N, read the offset at 0x12+N*2 and the offset just before it, to get the start to end position of the field in the file header.
The fixed length file header fields, followed by the offsets for the variable length fields are as following:
The variable length fields are (by column index):
Each database has a single boot page at 1:9, which has much the same format as the file header page - a data page with a single record. However, this record has no NULL bitmap or variable length column data - it is composed entirely of fixed length columns, which makes parsing it very easy. Here are the columns by offset relative to the start data record:
Each SQL server instance has one instance header page at 1:10 in the first database file of the master database. Todo... continue analysis of this page...
Dynamic Management Views (DMVs) and System Base Tables
SQL Server 2005 and higher no longer allow direct access to view system base tables. Instead DMVs (system views) are supplied that query these tables and show a relevant view of these tables to users. However, DBCC PAGE can still be used even in SQL 2008 to view the contents of the system base tables directly.
Microsoft describes the purposes of the various system base tables here.
System Table: sysallocunits (ObjectID = 7)
The sysallocunits table is the entry point containing the metadata that describes all other tables in the database. The first page of this table is pointed to by the dbi_firstSysIndexes field on the boot page. The records in this table have 12 fixed length columns, a NULL bitmap, and a number of columns field. Most of the columns in this table can be seen through the DMVs sys.system_internals_allocation_units, sys.partitions, sys.dm_db_partition_stats, and sys.allocation_units. The names in parenthesis are the columns as they appear in the DMV. The structure of this table is as follows:
System Table: syschobjs (ObjectID = 34)
The sysschobjs table is the underlying table for the sys.objects table. It has a NULL bitmap and one variable length column. Note that there is also a sys.system_objects table with object IDs less than 0, but the objects shown in that view are not in this table.
System Table: syscolpars (ObjectID = 41)
The sys.syscolumns and sys.columns DMVs show data in the syscolpars table, which is a list of every column defined in any table, view, input parameter to a procedure or function, or output column of a table valued function in the database. The columns in parenthesis are what the sys.sycolumns view shows.
Things to note: colid is a 16 bit number, meaning that a table can have no more than 2^15 columns, and a procedure or function can have no more than this many parameters (though in practice the limit is far below this many).
System Table: sysobjvalues (ObjectID = 60)
The sysobjvalues system base table is a general storage area for metadata about an object which is not set based (that is to say, scalar single valued attributes on an object). For object types such as functions, procedures, views, and default constraints, the most important such attribute is the T-SQL code which is the 'guts' of the object and defines how it runs.
Previously, these attributes used to be stored in the system table 'syscomments', and there is now a DMV called sys.syscomments which provides this same view for backwards compatibility. However, this DMV is only for compatibility with old code, and new code should use the DMV sys.sql_modules to get details about 'programmable' object types.
The sysobjvalues table has a NULL bitmap and two variable length columns:
One thing that we can see from this definition is that the body of a procedure or function is not practically limited in size, since imageval can be up to 1GB, and very complex expressions are possible in the T-SQL even of simple default and check constraint objects (though this would usually be bad for performance).
System Table: sysscalartypes (ObjectID = 50)
The sysscalartypes table holds a row for every built-in and user-defined data type. It can be read using the DMVs sys.systypes and sys.types. It has the following columns:
System Table: sysidxstats (ObjectID = 54)
The sysidxstats system table has one row for every index or heap in the database, including indexes on tables returned with table valued functions and user defined table types. It is visible with through the DMV sys.indexes and sys.sysindexes. This table is clustered by (id, indid). This table also includes automatically created statistics on columns that don't have indexes - these statistics have automatically generated names of the form _WA_Sys_[ObjectID-as 8 digit hex number]_[random 8 digit hex number] - the query optimizer uses them to select the best excution plan when a join includes unindexed columns that only have statistics. The format of this table is:
System Table: sysiscols (ObjectID = 55)
The sysiscols is a table with only 22 bytes of fixed length data per row, and it defines all indexes and statistics in system and user tables, both clustered and non-clustered indexes. Heaps (index id = 0) are not included in this table. The data in this table is visible via the DMV sys.index_columns (indexes only, statistics columns are not included). This table is clustered by (idmajor, idminor, subid):
Every database has a database level option called PAGE_VERIFY, and it can have one of three possible values - NONE, TORN_PAGE_DETECTION, and CHECKSUM. However, there is one limitation - prior to SQL server 2008, the tempdb database can only use the option NONE (this was corrected in SQL server 2008, where tempdb can use any PAGE_VERIFY option). This option can be set in SSMS through Database, properties, Recovery, Page Verify. The setting can be queried through the sys.databases.page_verify_option_desc view column. It can be set in T-SQL through ALTER DATABASE [dbname] SET PAGE_VERIFY [value]. It is not known yet how the value is stored by the storage engine though, as it does not appear to be written in the file header page.
This database level option sets the checksum algorithm that will be used to verifying that read/writes to/from disk are not the source of corruption. When the setting is something other than NONE, SQL server will use the algorithm to compute a checksum when writing a page to disk, and will write two data fields to the 96-byte header of the page: FlagBits will have a 0x100 (TORN_PAGE_DETECTION) or 0x200 (CHECKSUM) bit set, and the TornBits field will store the checksum. If this page is later read from disk, the FlagBits value will tell SQL server how to verify the checksum. If the FlagBits is set to NONE, or the database option is set to NONE, the checksum comparison will be skipped. Otherwise, the algorithm selected by the page will be used to compute the checksum for the data on the disk and compare to the TornBits value. If the two don't match, then something bad happened between the write and the read, and data got corrupted. SQL server will assume that this corruption was not user generated and was due to a fault in the I/O subsystem, and will raise a critical error telling you that database corruption was detected.
The TORN_BITS_DETECTION algorithm is the weaker of the two, and it was all that was available in SQL server 2000. It computes a 2-bit checksum for each of the 16 512-byte sectors on the page, and joins them all together to get a 32-bit checksum. In contrast, the CHECKSUM algorithm which was added in SQL server 2005 takes a 32-bit checksum where all 32-bits depend on the full contents of the 8KB page.
The CHECKSUM option is the stronger of the two because it has only a 1 in 2^32 chance of failing to detect corruption, whereas TORN_BITS_DETECTION has at best a 1 in 4 chance of failing to detect corruption for a single sector. The CHECKSUM option is slightly more CPU intensive to compute, but in comparison to the delay required to read a page from disk, the overhead is minimal. The CHECKSUM option is the default with SQL server 2005 and higher.
Paul Randal posted an article around common misconceptions about page verification. Once you know how page verification actually works, it is easy to see why these are misconceptions: