Saturday, 2 July 2011

SQLite overflow pages and other loose ends...

This is the fourth post dealing with the elements making up SQLite databases and complements the previous three:
We will remember from these previous posts that:
  • The entire database file is divided into equally sized pages - SQLite database files always consist of an exact number of pages
  • The page size is always a power of two between 512 (29) and 65536 (216) bytes
  • All multibyte integer values are read big endian
  • The page size for a database file is determined by the 2 byte integer located at an offset of 16 bytes from the beginning of the database file
  • Pages are numbered beginning from 1, not 0
  • Therefore to navigate to a particular page when you have a page number you have to calculate the offset from the start of the database using the formula:
    offset = (page number-1) x page size
and that the database may have the following possible page types:
  • An index B-Tree internal node
  • An index B-Tree leaf node
  • A table B-Tree internal node
  • A table B-Tree leaf node
  • An overflow page
  • A freelist page
  • A pointer map page
  • The locking page
In this post I am going to take a closer look at Overflow pages.

Overflow pages are required when a record within a database requires more space than that available within a cell in one database page. One SQLite database of forensic interest is the Cache.db file maintained by the Apple Safari web browser. One of the tables within this database is entitled cfurl_cache_blob_data which uses the receiver_data field to store the cached item itself (e.g. cached jpgs, gifs, pngs, html et al ) as a BLOB. A BLOB is a Binary Large OBject. These cached objects often require overflow pages and we can demonstrate the mechanics of them by walking through a record within Cache.db.

If you run a file carver across a Cache.db file searching for pictures you are likely to carve out a number corrupt pictures as shown in the example within Figure 1.

Double click to enlarge
Figure 1

It can be seen that this picture starts at File offset 4583317 within Cache.db. By examining the two bytes at offset 16 within this SQLite database we have established that the database page size is 1024 bytes. The record that contains this picture has six fields as shown in Figure 2.

Double click to enlarge
Figure 2

As discussed in my earlier post An analysis of the record structure within SQLite databases the data making up this record is store in a serialised way (the data representing field 1 is immediately followed by the data representing field 2 and so on with no delimiters). It can be seen therefore that a cell storing a record within the cfurl_cache_blob_data table is almost bound to overflow the 1024 byte database page.
In our example our corrupt picture starts at FO 4583317. To calculate the database page it is stored in we divide the offset by the page size 4583317/1024=4475.8955078125 and round up to establish the page number. Our corrupt picture header is in page 4476.
The SQLite.org file format states that overflow pages are chained together using a singly linked list. The first 4 bytes of each overflow page is a big-endian unsigned integer value containing the page number of the next page in the list. The remaining usable database page space is available for record data.
We know that our picture is likely to be stored in a number of overflow pages and we can establish the next page by looking at the first 4 bytes of the page that is in. Using the formula offset = (page number-1) x page size I can calculate that the offset of these 4 bytes at the start of the page is 4475 x 1024=4582400. This offset can be seen in Figure 3.

Double click to enlarge
Figure 3

The 4 bytes in hex are 00 00 11 7D which decoded big endian is 4477. The next page therefore in the linked list is page 4477. The first 4 bytes of this page found at offset 4583424 in hex are 00 00 11 7E which decoded big endian is 4478. The first 4 bytes of this page found at offset 4584448 in hex are 00 00 11 7F which decoded big endian is 4479. The first four bytes of this page found at offset 4585472 are in hex 00 00 00 00. This value signifies the last page in the linked list.
I can be seen in this example that our corrupt picture starts in page 4476 and overflows into pages 4477, 4478 and 4479. Obviously the overflow pages are contiguous in this case, so in theory at least, if I copy the data from the jpg header of the corrupt picture to the jpg footer and edit out the 'corruption' I should end up with a complete picture. The corruption was likely caused by the overflow page values at the start of each page so using a hex editor I can remove these and hey presto:











Essentially what we have done here is start in the middle of the record and work forwards to the end. Because overflow pages are chained together using a linked list this is relatively straightforward.
But what do we do if we want to locate the earlier pages in the record? This is a little more complicated because each overflow page does not contain the page number of the previous page. The Safari cache.db SQLite 3 database is an auto-vacuum database so we could utilise Pointer Map pages to locate the parent page of the page (4476) where our corrupt picture header is stored. You will recall from my previous post that Pointer Map pages store a 5 byte record relating to every page that follows the Pointer Map page. Pointer Map pages found in Safari Cache.db files will have a lot of entries that relate to overflow pages. The 5 byte records are structured with 1 byte indicating a Page Type and then 4 bytes, decoded big endian, referencing the parent page number as follows:
  • 0x01 0x00 0x00 0x00 0x00
    This record relates to a B-tree root page which obviously does not have a parent page, hence the page number being indicated as zero.
  • 0x02 0x00 0x00 0x00 0x00
    This record relates to a free page, which also does not have a parent page.
  • 0x03 0xVV 0xVV 0xVV 0xVV (where VV indicates a variable)
    This record relates to the first page in an overflow chain. The parent page number is the number of the B-Tree page containing the B-Tree cell to which the overflow chain belongs.
  • 0x04 0xVV 0xVV 0xVV 0xVV (where VV indicates a variable)
    This record relates to a page that is part of an overflow chain, but not the first page in that chain. The parent page number is the number of the previous page in the overflow chain linked-list.
  • 0x05 0xVV 0xVV 0xVV 0xVV (where VV indicates a variable)
    This record relates to a page that is part of a table or index B-Tree structure, and is not an overflow page or root page. The parent page number is the number of the page containing the parent tree node in the B-Tree structure.
We can expect to see a lot of Page Types 0x03 and 0x04. So how do we find the pointer map page? We know that the Pointer Map page may contain up to (database page size/5) records (rounded down if necessary) - in this case 1024/5=204.8 so there are 204 records in each Pointer Map page. The first Pointer Map page is Page 2. This is followed by 204 pages and then another Pointer Map page, page 207, followed by 204 pages and then another Pointer Map page, page 412 and so on. In other words there is a Pointer Map page every 205th page, starting page 2. In our example we know that our corrupt picture header is in database page 4476 and the applicable Pointer Map page is prior to it. To calculate the applicable Pointer Map page number we divide 4476 by 205 = 21.834146341463415, round down to 21 and multiply by 205 and then add 2 which equals 4307. The applicable Pointer Map page for page 4476 of the database is page 4307. Using the formula offset = (page number-1) x page size I can calculate that the offset to this page is 4409344. This page can be seen in Figure 4. Each Page Type flag where it references an Overflow page is bookmarked in green, other Page Types are in blue. The first 5 byte record relates to database page 4308, the second 5 byte record page 4309 and so on. The record for page 4476 is the 169th record on the Pointer Map page (4476-4307).

Double click to enlarge
Figure 4

It can be seen that the 169th record is in hex 04 00 00 11 7B. This record has a flag 0x04 which indicates that this record relates to a page that is part of an overflow chain, but not the first page in that chain. The parent page number is 00 00 11 7B decoded big endian to page 4475. The record for page 4475 is the 168th record on the page 04 00 00 11 7A. This record also indicates that the page is part of an overflow chain but not the first page. The parent page number is 00 00 11 7A decoded big endian to page 4474. The record for page 4474 is the 167th record on the page 03 00 00 11 6E. This record has a flag 0x03 which indicates that this record relates to the first page in an overflow chain. The parent page number is the number of the B-Tree page containing the B-Tree cell to which the overflow chain belongs. The parent page number is 00 00 11 6E decoded big endian to page 4462.
Using the formula offset = (page number-1) x page size I can calculate that the offset to the beginning of this page is 4568064. We can now decode the page header (detailed more fully in the post An analysis of the record structure within SQLite databases ) shown in Figure 5.

Double click to enlarge
Figure 5

Figure 5 shows the page header of the B-tree leaf node page. The first byte 0D is a flag indicating the page is a table B-tree leaf node. The next two bytes 00 00 indicate that there are no free blocks within the page. The next two bytes 00 03 read big endian indicate that there are three cells stored on the page. The next two bytes at offset 5 within the page header 00 EB decoded big endian give a value of 235 which is the byte offset of the first byte of the cell content area relative to the start of the page. The last byte of the eight byte page header 00 is used to indicate the number of fragmented free bytes on the page, in this case there are none. The remaining highlighted three pairs of bytes 02 60, 01 F1 and 00 EB are the cell pointer array for this page. These three values are offsets to the start of each cell when decoded big endian are 608, 497 and 235 respectively. We will focus on the cell at offset 235. At offset 235 we find two Varints representing the Length of Payload and Row ID (see Figure 6).

Double click to enlarge
Figure 6

The varints are B1 66 and 82 11. The calculation needed to decode them follows in Figure 7:

Double click to enlarge
Figure 7

Following the Length of Payload and Row ID are varints representing the Length of the Payload Header and the serial type codes of the entry_ID, response_object, request_object, receiver_data, proto_props and user_info fields respectively as shown in Figure 8:

Double click to enlarge
Figure 8

Figure 8 shows highlighted in blue and green the first three elements of the Cell make up - the Payload Length, the Row ID and the Payload Header. We have already decoded the Payload Length B1 66 and the Row ID 82 11. The next byte h0A denotes the length of the Payload Header which is in this case 10 bytes (including the Payload Header Length byte). It can be seen therefore that the remaining 9 bytes contain the varints 00, 9B 54, 96 3E, B1 4A, 00 and 00. To determine what each varint indicates we have to consult the Serial Type Code chart detailed in the post An analysis of the record structure within SQLite databases . Each Serial Type Code details the type and length of the data in the payload that follows the payload header.
  • 00 This serial type code indicates that the first field is NULL and the content length is 0 bytes. We know that the first field in our record relates to Row ID however the SQLite.org file format states If a database table column is declared as an INTEGER PRIMARY KEY, then it is an alias for the rowid field, which is stored as the table B-Tree key value. Instead of duplicating the integer value in the associated record, the record field associated with the INTEGER PRIMARY KEY column is always set to an SQL NULL.
  • 9B 54 This serial type code has a value of 3540 which is greater than 12 and an even number. The chart indicates therefore that this field is a BLOB (3540-12)/2 bytes in length [1764 bytes]
  • 96 3E This serial type code has a value of 2878 which is greater than 12 and an even number. The chart indicates therefore that this field is a BLOB (2878-12)/2 bytes in length [1433 bytes]
  • B1 4A This serial type code has a value of 6346 which is greater than 12 and an even number. The chart indicates therefore that this field is a BLOB (6346-12)/2 bytes in length [3167 bytes]
  • 00 This serial type code indicates that the field is NULL
  • 00 This serial type code indicates that the field is NULL
The serial type code for the response_object indicates that this field is a BLOB 1764 bytes in length. The entire database page would not be big enough to store this BLOB and the cell is even less capable. Figure 9 shows each cell highlighted alternately blue and green:

Double click to enlarge
Figure 9

The first blue shaded area is the cell the elements of which are duscussed above. The last four bytes of this cell highlighted in darker blue is the hex value 00 00 11 7A which when decoded big endian gives the value 4474. This is the page number of the first Overflow page for this cell and is consistent with the information found in the Pointer Map page discussed above.

Next Post
Following on from my earlier SQLite blog posts James Crabtree has been kind enough to code a Varint decoder and Alex Caithness of CCL has supplied me with his fully featured SQLite record recovery tool EPILOG. I'll review this software next time. Thanks to James and CCL.

References
http://www.sqlite.org/fileformat.html
http://www.sqlite.org/fileformat2.html

Tuesday, 10 May 2011

An analysis of the record structure within SQLite databases

My two previous posts Carving SQLite databases from unallocated clusters and SQLite Pointer Maps pages looked at the structure of SQLite databases as a whole. Information contained in those posts may hopefully facilitate the carving of complete SQLite databases. This post is aimed at examining the potential of carving individual records within an SQLite database but should be read in conjunction with the Carving SQLite databases from unallocated clusters post.

Background
Carving individual SQLite database records in certain circumstances may be more fruitful than carving whole databases. There are in fact a number of applications that do exactly this for some types of SQLite database. For example Firefox 3 History Recovery (FF3HR) is an application written to recover Firefox records. A paper entitled Forensic analysis of the Firefox 3 Internet history and recovery of deleted SQLite records written by Murilo Tito Pereira also deals with the recovery of Firefox records.
SQLite databases can be considered as a mini file system in their own right. Within this file system are areas that are marked as free that may contain deleted data. Record based recovery may help identify records that have been deleted but are still contained within the parent SQLite database. More obviously record based recovery is indicated where only deleted and partially overwritten databases are available. However for record based recovery to be useful the data you wish to recover must be stored within one table within the SQLite database concerned. If it is necessary to query two or more tables to extract useful data record based recovery is probably not going to be appropriate.

Table Record

Figure 1


Figure 1 shows, as viewed with the SQLite Database Browser software, a record within the Google Chrome History file URL table at row ID 608. This record is stored within the Google Chrome History SQLite database within a B-tree table leaf node in an area known as a cell. It can be seen from the column headers that this record consists of an id, a url, a title, a visit_count, a typed_count, a last_visit_time, a hidden flag and lastly a favicon _id.  To aid viewing I will repeat the record data below:
  • ID
    608
  • URL
    http://www.sqlite.org/fileformat2.html
  • title
    File Format For SQLite Databases
  • visit_count
    1
  • typed_count
    0
  • last_visit_time
    12949409092779476
  • hidden
    0
  • favicon_id
    46

The urls table is stored within one table B-tree which will consist of a root page and possibly a number of internal and leaf node pages.  I have established that the data representing the record detailed above is stored in a cell that exists within a B-tree table leaf node database page.

Cells
Lets recap some of the information dealt with in the earlier post Carving SQLite databases from unallocated clusters.  SQLite databases are divided into equal sized pages, the size of which is detailed in two bytes, decoded as a 16 bit integer big endian, at offset 16 of the database file within the database header.  Most of an SQLite database consists of B-tree structures consisting of one or more B-tree pages.  Each B-tree page has either an 8 or 12 byte page header (depending on whether it is a leaf or internal node).
              Figure 2

















As can be seen in Figure 2 the cells tend to be at the end of each database page in an area referred to as the cell content area.  These cells are used to store the database records, one record per cell.   The first cell to be written in a database page is stored at the end of the page and additional cells work back towards the start of the page.

The number of cells and their location within a database page is stored within the B-Tree page header at the following offsets.
  • Offset 1 2 bytes 16 bit integer read big endian   
  • Byte offset of first block of free space on this page (0 if no free blocks)
  • Offset 3 2 bytes 16 bit integer read big endian
  • Number of entries (cells) on the page
  • Offset 5 2 bytes 16 bit integer read big endian
  • Byte offset of the first byte of the cell content area relative to the start of the page. If this value is zero, then it should be interpreted as 65536
Figure 3  Page Header and Cell Pointer array

Figure 3 shows the page header of the B-tree leaf node page that contains the record detailed in Figure 1 above.  The first byte 0D is a flag indicating the page is a table B-tree leaf node. The next two bytes 00 00 indicate that there are no free blocks within the page. The next two bytes 00 03 read big endian indicate that there are three cells stored on the page.  The next two bytes at offset 5 within the page header 0B 8E decoded big endian give a value of 2958 which is the byte offset of the first byte of the cell content area relative to the start of the page.  The last byte of the eight byte page header 00 is used to indicate the number of fragmented free bytes on the page, in this case there are none.

The remaining highlighted three pairs of bytes 0C 44,  0B EC and 0B 8E are the cell pointer array for this page. The SQLite.org file format notes helpfully state that the cell pointer array of a b-tree page immediately follows the b-tree page header.  Let K be the number of cells on the b-tree. The cell pointer array consists of K 2-byte integer offsets to the cell contents. The cell pointers are arranged in key order with left-most cell (the cell with the smallest key) first and the right-most cell (the cell with the largest key) last.  The key value referred to is the row ID.  In this case we have three cells and therefore three offsets which when decoded big endian are 3140, 3052 and 2958.   These offsets allow us to find the start of each cell, it is worth pointing out that there may be free blocks or fragments between each cell so we can not use the offsets to determine the length of each cell.

The record detailed in Figure 1 is contained within the cell at offset 2958 within the page.  We will decode the contents of this cell but first we better look at the make up of a cell.

Figure 4 Cell make up
Figure 4 indicates four areas of interest.  The payload is the data forming the record as detailed in this example in Figure 1 and as suggested in the diagram it is stored in a serialized way with all the relevant data concatenated together.  The payload header details how we can identify each field within the concatenated data (see the Payload Header section below for details of how this works).  The Row ID number and the Payload length are stored using variable length integers known as varints.  To successfully decode the Cell and Payload headers we have to be able to decode a varint.


Varint
http://www.sqlite.org/fileformat2.html and http://www.sqlite.org/fileformat.html#varint_format provides some detail in respect to how varints are structured.  I will try here to simplify things and provide a few example decodings when we decode the cell relating to the record detailed at figure 1.

  • Varints are variable length integers between 1 and 9 bytes in length depending on the value stored
  • They are a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values
  • Where the most significant bit of byte 1 is set this indicates that byte 2 is required, where the most significant bit of byte 2 is set this indicates that byte 3 is required, and so on
  • Varints are big-endian: bits taken from the earlier byte of the varint are the more significant than bits taken from the later bytes
  • Seven bits are used from each of the first eight bytes present, and, if present, all eight from the final ninth byte
Figure 5
Figure 5 shows the beginning of the cell at offset 2958 within the page.  As shown in figure 4 the first value is the payload length represented by a varint.  The first byte is 5B.  We have to establish the value of the most significant bit and this can be done by converting the hex 5B to binary:

It can be seen that in this case the most significant bit is zero and therefore not set.  This varint is only one byte long and indicates that the payload length is 91 bytes.  The payload length is the length in bytes of both the payload header and the payload.

The next byte is  84.  Converting this to binary:

The most significant bit here has the value of one and therefore is set.  This indicates that this varint includes, at least, the next byte 60 which converted to binary:

It can be seen that the most significant bit is zero and therefore not set.  This byte therefore is not followed by another and is the last byte of this varint.  To establish the value of this varint we now have to take the least significant 7 bits of each of the two contributing bytes and concatenate them together:

00001001100000

We discard the leading zeros and convert the binary 1001100000 to decimal, giving a value of 608.  This  varint represents the row ID and we can see in figure 1 that the row ID is confirmed as 608.

The calculation we have carried out can be represented by a formula.   If we say that the value of the varint is N and the unsigned integer value of the first byte is x and the unsigned integer value of the second byte is y we can use the formula:

N =  (x-128) x 128 + y

If we substitute the value of our unsigned integers 132 and 96 into the formula:

(132-128) x 128 + 96 = 608

This formula works for two byte varints that can represent a maximum value of 16383.  I suspect we are not likely to encounter larger varints in the SQLite databases we have an interest in with the possible exception of SQLite databases used to store browser cache.  It is also worth noting that the most significant bit if included and allowed to contribute to the unsigned integer value would have a value of 128 (hence the [x-128]).  Therefore if the first byte of a varint is less than 128 you can exclude the possibility of there being a second byte in the varint.

Payload Header and Payload
We have already looked at two of the four areas of interest within a cell, the payload length and row ID. Next up is the Payload Header and Payload.

Figure 6  Payload Header make up
Figure 6 shows the make up of the payload header of a record within the URLs table of the Google Chrome History SQLite database.  The payload is the data forming the record stored in a serialized way with all the relevant data concatenated together.  The payload header details how we can identify each field within the concatenated data and will vary from table to table, the contents of which is dictated by the fields required in each record. All payload headers will have however a Payload Header Length followed by one or more Serial Type Codes.  The Serial Type Code is used to denote the type of data found in a field within the payload and it's length. All possible Serial Type Codes are varints and are detailed in a chart provided by SQLite.org at Figure 7:

Figure 7
Lets have a look at the  Payload Header of our example record detailed in Figure 1.

Figure 8

Figure 8 shows highlighted in blue and green the first three elements of the Cell make up shown in Figure 4 - the Payload Length, the Row ID and the Payload Header.  We have already decoded the Payload Length 5B and the Row ID 84 60.  The next byte h09 denotes the length of the Payload Header which is in this case 9 bytes (including the Payload Header Length byte). It can be seen therefore that the remaining 8 bytes shown in hex are 00594D01010601 and 01. These bytes represent varints so we have to consider that a value may be represented by more than one byte, however in this case the unsigned integer value of each byte is less than 128.  We can conclude therefore that each varint is only a single byte in length.  To determine what each varint indicates we have to consult the Serial Type Code chart shown at figure 7.  Each Serial Type Code details the type and length of the data in the payload that follows the payload header.  The multi byte integers are decoded big endian.
  • 00  This serial type code indicates that the first field is NULL and the content length is 0 bytes.  We know that the first field in our record relates to Row ID (see figure 1) however the SQLite.org file format states If a database table column is declared as an INTEGER PRIMARY KEY, then it is an alias for the rowid field, which is stored as the table B-Tree key value. Instead of duplicating the integer value in the associated record, the record field associated with the INTEGER PRIMARY KEY column is always set to an SQL NULL.
  • 59  This serial type code has a value of 89 which is greater than 13 and an odd number.  The chart indicates therefore that this field is a text string (89-13)/2 bytes in length [38 bytes]
  • 4D  This serial type code has a value of 77 which is greater than 13 and an odd number.  The chart indicates therefore that this field is a text string (77-13)/2 bytes in length [32 bytes]
  • 01  This serial type code has a value of 1 indicating the next field is an 8 bit integer using 1 byte
  • 01  This serial type code has a value of 1 indicating the next field is an 8 bit integer using 1 byte
  • 06  This serial type code has a value of 6 indicating the next field is an 64 bit integer using 8 bytes
  • 01  This serial type code has a value of 1 indicating the next field is an 8 bit integer using 1 byte
  • 01  This serial type code has a value of 1 indicating the next field is an 8 bit integer using 1 byte
It can be seen that our payload is 82 bytes in length (38+32+1+1+8+1+1).  The payload header was 9 bytes and therefore the overall payload length is 91 (82+9) bytes, as previously calculated, and represented by the byte 5B.

Figure 9

Figure 9 shows each element of the payload highlighted alternately in green and blue.  The first element is http://www.sqlite.org/fileformat2.html 38 bytes in length, the next element is File Format For SQLite Databases 32 bytes in length.  The next element is represented by the byte 01 which denotes the visit_count of 1.  This is followed by the byte 00 denoting the typed_count of 0.  Next are the eight bytes 00 2E 01 6B 41 06 BD D4 decoded as a 64 bit integer big endian giving a value of 12949409092779476, the last_visit_time (stored in the Google format).  The next byte is 00, the hidden flag, followed lastly by 2E decoded as 46, the favicon_ID.  The next record in this case immediately follows at offset 3052 within the database page.

Notes
I have glossed over some possible combinations of data found in stored records in order to try and simplify things a little.  It is possible for a record to require more space than the space available in a cell within one database page.  In this eventuality pages known as overflow pages come into play.  I will leave any commentary on this to another day :-)

Carving Considerations
It can be seen that each record of the Google Chrome History URL table may vary in content and length.  This precludes simple carving of records using known headers.  It may be possible to define a scheme to assist with carving however by focussing on the parameters of each element of the record.  It is clear that for the Google Chrome History URL table the scheme would be fairly complicated, allowing for very large URLs and Page titles which may well induce many false positives.  For databases using a simpler record structure things are a bit easier.  A presentation presented by Alex Caithness of CCL details an approach that can be adopted for carving iPhone calls.db databases.

Deleted Data within Live Databases
This area will require another blog post on another day!  I am aware of two programs  possibly written to recover this deleted data. SQL Undeleter from Chirashi Security and Epilog from CCL.  If the developers will let me test these programs out I will report the results to you.


References
http://www.sqlite.org/fileformat.html
http://www.sqlite.org/fileformat2.html
http://www.ccl-forensics.com/images/f3%20presentation3.pdf
http://mobileforensics.wordpress.com/2011/04/30/sqlite-records/


Wednesday, 4 May 2011

SQLite Pointer Maps pages

This blog post complements and should be read in conjunction with the previous post Carving SQLite databases from unallocated clusters. In that post I looked at the information available within an SQLite database that may assist in carving one from unallocated clusters.

You will remember from the earlier blog post that SQLite databases are divided into equally sized pages and SQLite database files always consist of an exact number of pages. The page size for a database file is determined by the 2 byte integer located at an offset of 16 bytes from the beginning of the database file. The first page in an SQLite database is numbered page 1.

Auto-vacuum capable
Auto-vacuum capable SQLite databases make use of Pointer Map pages along with the other page types detailed in the earlier blog post. It is probably helpful to provide some information about what an auto-vacuum capable database is.

In a non-auto-vacuum-capable SQLite database when information is deleted the pages where it was stored are added to a list of free pages and these pages can be reused the next time data is inserted. Therefore, should data be deleted the file size of the database does not decrease. If a lot of data is deleted and it becomes necessary to shrink the database size the SQL VACUUM command can be run. This has the effect of reorganising the database from scratch and removing any free pages completely, thus making the database smaller.

When Auto-vacuum is enabled all free pages are moved to the end of the database file and the database file is truncated to remove the free pages at every transaction commit, thus removing free pages automatically.

Pointer Map Pages
The purpose of the Pointer Map is to facilitate moving pages from one position in the database file to another as part of auto vacuum. When a page is moved, the pointer in its parent must be updated to point to the new location. Pointer Maps are used to provide a lookup table to quickly determine what a pages parent page is. They only exist within auto-vacuum capable databases, which require the 32 bit integer value, read big endian, stored at byte offset 52 of the database header to be non-zero.

In auto-vacuum-capable SQLite databases page 2 of the database is always a Pointer Map page. Pointer Map pages store a 5 byte record relating to every page that follows the Pointer Map page. For example if we have an auto-vacuum-capable database that has 24 pages (each of 4096 bytes in size) in total, page 1 will contain the database header and the database schema and the next page, page 2, will be a Pointer Map page. This Pointer Map page will contain a 5 byte record for every one of the remaining 22 pages taking up 110 bytes of space within the page. The first 5 byte record begins at the very beginning of the Pointer Map page and therefore in a 4096 byte page a maximum of 819 (4096/5) records can be stored. If the database has more than 821 pages (when using a page size of 4096 bytes) page 822 would be an additional Pointer Map page that would contain records for the next 819 pages following this second Pointer Map page. Further additional Pointer Map pages can be added in the same way. Pointer Map pages do not store records relating to Pointer Map pages or page 1 of the database.

Pointer Map 5 byte records are structured with 1 byte indicating a Page Type and then 4 bytes, decoded big endian, referencing the parent page number as follows:

  • 0x01 0x00 0x00 0x00 0x00
    This record relates to a B-tree root page which obviously does not have a parent page, hence the page number being indicated as zero.
  • 0x02 0x00 0x00 0x00 0x00
    This record relates to a free page, which also does not have a parent page.
  • 0x03 0xVV 0xVV 0xVV 0xVV (where VV indicates a variable)
    This record relates to the first page in an overflow chain. The parent page number is the number of the B-Tree page containing the B-Tree cell to which the overflow chain belongs.
  • 0x04 0xVV 0xVV 0xVV 0xVV (where VV indicates a variable)
    This record relates to a page that is part of an overflow chain, but not the first page in that chain. The parent page number is the number of the previous page in the overflow chain linked-list.
  • 0x05 0xVV 0xVV 0xVV 0xVV (where VV indicates a variable)
    This record relates to a page that is part of a table or index B-Tree structure, and is not an overflow page or root page. The parent page number is the number of the page containing the parent tree node in the B-Tree structure.

Figure 1




The screenshot at Figure 1 shows the Pointer Map page of an iPhone SMS.db which uses a page size of 4096 bytes. The Page Type bytes are highlighted in light blue and occur at every fifth byte. The byte highlighted in green (0x00) is the first byte of the sequence that is not one of the page type bytes as described above and therefore indicates that there are no more records stored in this pointer map as can be seen within the highlighted darker blue area.

Extrapolating the database size from the Pointer Map page
If you count the Page Type bytes highlighted within Figure 1 in light blue you will find there are 22. This is because we have 22 pages following the Pointer Map page and therefore require 22 records. This allows us to conclude that there are 24 pages in total within this database (page 1, the Pointer Map page and then the 22 pages). By examining the 2 byte integer located at an offset of 16 bytes from the beginning of the database file we have determined that the page size within this database is 4096 bytes. 24 multiplied by 4096 equals 98304. The file size of this particular database is therefore 98,304 bytes which can also be seen within Figure 1.

Carving Considerations
To carve auto vacuum capable databases the following steps would be needed:

  • Identify first page of database by detecting the SQLite format 3 header
  • Establish page size by reading the 2 bytes at offset 16 as a 16 bit integer big endian
  • Check the 4 byte 32 bit integer at offset 52 for a non zero value indicating that the database is auto vacuum capable
  • Go to page 2 of the database at Offset page size
  • If value is 0x01 or 0x02 or 0x03 or 0x04 or 0x05 set a counter to 1
  • Move five bytes forward and if value is 0x01 or 0x02 or 0x03 or 0x04 or 0x05 increment the counter by 1
  • or if the value is not 0x01 or 0x02 or 0x03 or 0x04 or 0x05 begin the calculation of database file size using the formula

  • database size = (counter value + 2) x page size

The above discounts the possibility of their being more than one pointer map page. Some additional logic would be needed to cater for this eventuality. Pointer map pages may contain page size/5 records. If the counter increments to a point where it equals this value it would be necessary to locate the next pointer map page using the formula:

next pointer map page number = (page size/5) + 2 + number of existing pointer map pages.

To calculate the offset to this page use the formula:

offset = (page number-1) x page size.

References
http://www.sqlite.org/fileformat.html
http://www.sqlite.org/fileformat2.html
http://www.sqlite.org/src/artifact/cce1c3360c


Wednesday, 27 April 2011

Carving SQLite databases from unallocated clusters

Have you missed me?

Background
Carving SQLite databases from unallocated clusters is problematic because although these types of database have a header, there is no footer and the length of the file is not normally stored within the file either. Given that SQLite databases are used by so many programs now (e.g.Firefox, Google Chrome and numerous Mac OSX and IoS applications) to store data of forensic interest it would be useful to recover them from unallocated clusters. I am aware that there has been some comment in this area within our industry blogs and forums. A paper entitled Forensic analysis of the Firefox 3 Internet history and recovery of deleted SQLite records written by Murilo Tito Pereira became available in 2009 (at a cost) but my interest was piqued more recently when Rasmus Riis (who I believe works for Law Enforcement in Denmark) posted an enscript he had written - Chrome SQlite db finder v1.4 to Guidance Software's enscript download center. Rasmus Riis's approach to carving SQLite files used by Google Chrome is to identify the header and then carry out additional checking throughout the file for known values within page headers. I had mixed results using this enscript and also wondered whether it could be adapted to search for other SQLite databases, in particular the SMS.db used by the iPhone. So as a result I took a closer look at the problem.

SQLite Database Structure and File Format
The SQLite file format is well documented at http://www.sqlite.org/fileformat.html and also at http://www.sqlite.org/fileformat2.html. What I will try to do here is pick out the salient points that may be relevant to carving SQLite files from unallocated clusters, and also provide some commentary where it may be useful.

Size and numbering

  • The entire database file is divided into equally sized pages - SQLite database files always consist of an exact number of pages
  • The page size is always a power of two between 512 (29) and 65536 (216) bytes
  • All multibyte integer values are read big endian
  • The page size for a database file is determined by the 2 byte integer located at an offset of 16 bytes from the beginning of the database file
  • Pages are numbered beginning from 1, not 0
  • Therefore to navigate to a particular page when you have a page number you have to calculate the offset from the start of the database using the formula:
    offset = (page number-1) x page size

Possible Page Types
Each page is used exclusively to store one of the following:

  • An index B-Tree internal node
  • An index B-Tree leaf node
  • A table B-Tree internal node
  • A table B-Tree leaf node
  • An overflow page
  • A freelist page
  • A pointer map page
  • The locking page

However not every database will include all of these items.

The first page (page 1)
The first page of the database is a special page for two reasons; it contains within the first 100 bytes of the page the database header and it also contains the Database Schema (the structure of the database [tables, indexes etc.] described in formal language).

The database header begins with the following 16 byte sequence:

0x53 0x51 0x4c 0x69 0x74 0x65 0x20 0x66 0x6f 0x72 0x6d 0x61 0x74 0x20 0x33 0x00

which when read as UTF-8 encoded text reads SQLite format 3 followed by a nul-terminator byte.

Other significant values at offsets within the database header are as follows:

  • Offset 16 2 bytes 16 bit integer read big endian
    Page Size
  • Offset 28 4 bytes 32 bit integer read big endian
    The logical size of the database in pages which is only populated when the database was last written by SQLite version 3.7.0 or later. This field is only valid if it is nonzero and in all examples of SQLite databases I have examined this value was zero, so unfortunately not as exciting as it first appears!
  • Offset 32 4 bytes 32 bit integer read big endian
    Page number of first freelist trunk page
  • Offset 36 4 bytes 32 bit integer read big endian
    Total number of freelist pages including freelist trunk pages
  • Offset 52 4 bytes 32 bit integer read big endian
    The highest numbered root page number if the database is auto-vacuum capable, for non-auto-vacuum databases, this value is always zero. The majority of the databases we are likely to be interested in are non-auto-vacuum databases.

B-Tree pages

The SQLite.org file format notes helpfully state that:

A large part of any SQLite database file is given over to one or more B-Tree structures. A single B-Tree structure is stored using one or more database pages. Each page contains a single B-Tree node. The pages used to store a single B-Tree structure need not form a contiguous block.

So from a carving perspective we note that most of what we wish to carve are B-Tree pages but they are not necessarily stored contiguously. We can however identify B-Tree pages because they have a page header 8 bytes in length if the page is a leaf node page and 12 bytes in length if the page is an internal node page. In all pages, with the exception of page 1, the header starts at the beginning of the page at offset 0. On page 1 the header starts at offset 100.

The first byte of all B-Tree page headers is a flag field, each flag is detailed as follows:

  • 0x02
    flag indicating index B-Tree internal node
  • 0x0A
    flag indicating index B-Tree leaf node
  • 0x05
    flag indicating table B-Tree internal node
  • 0x0D
    flag indicating table B-Tree leaf node

These flags therefore allow us to potentially identify B-Tree pages (of all types) by examining the first byte of each page to see if it is either 0x02, 0x0A, 0x05 or 0x0D.

The B-Tree page header also contains some other potentially useful values (offsets from start of page):

  • Offset 1 2 bytes 16 bit integer read big endian
    Byte offset of first block of free space on this page (0 if no free blocks)
  • Offset 3 2 bytes 16 bit integer read big endian
    Number of entries (cells) on this page
  • Offset 5 2 bytes 16 bit integer read big endian
    Byte offset of the first byte of the cell content area relative to the start of the page. If this value is zero, then it should be interpreted as 65536

Freelist pages

Once again the SQLite.org file format notes help us out:

A database file might contain one or more pages that are not in active use. Unused pages can come about, for example, when information is deleted from the database. Unused pages are stored on the freelist and are reused when additional pages are required.

Each page in the freelist is classified as a freelist trunk page or a freelist leaf page. All trunk pages are linked together into a singly linked list. The first four bytes of each trunk page contain the page number of the next trunk page in the list, formatted as an unsigned big-endian integer. If the trunk page is the last page in the linked list, the first four bytes are set to zero.

The operation of the freelists might be better understood by a quick forensic examination of them:

The four bytes highlighted in blue are at offset 32 within the database header and decoded as a 32 bit integer big endian give a decimal value of 61 - this is the page number of the first freelist trunk page. The four bytes highlighted in green are at offset 36 within the database header and decoded as a 32 bit integer big endian give a decimal value of 70 - this is the total number of free pages including freelist trunk pages.

The page number of the first freelist trunk page is 61. To calculate the offset from the start of the database for this page we use the formula offset = (page number-1) x page size which in this case is (61-1) x 4096 = 245760.

The offset 24570 takes us to start of the first freelist trunk page. There we find an array of 4 byte big endian integers. The first four bytes (highlighted in green) decoded big endian denote the page number of the next freelist trunk page - in this case the value of the first four bytes is zero indicating that there is no more free freelist trunk pages. The second four bytes (highlighted in blue), in this case 0x00 0x00 0x00 0x45 when decoded as a 32 bit integer big endian give a value of decimal 69 - this is the number of leaf page pointers to follow. The remaining 69 blocks of 4 bytes (alternately highlighted in green and blue in the screen shot) each represent, when decoded as a 32 bit integer big endian, the page number of a free page. Examination of each free page revealed that the entire page had been zeroed out - although this may be a function of the application using the database, not a function of SQLite itself.


Pointer map pages
Pointer map pages will only exist if the database is auto-vacuum capable and the value within the 4 bytes of the database header at offset 52 is non zero. In a database with pointer map pages, the first pointer map page is page 2. The first byte of a pointer map page is one of five values 0x01, 0x02, 0x03, 0x04 or 0x05. Many of the databases we have a forensic interest in are not auto-vacuum capable and therefore do not have pointer map pages, however where they do (iPhone SMS.db for example) it is possible to calculate the length of the of the SQLite database by extrapolating information from the pointer map page entries. I will cover this in my next blog post.


Locking Page
The locking page is the database page that starts at byte offset 230 (1,073,741,824) and always remains unused (i.e all zeros). Most databases will not be big enough (> 1GB) to require a locking page.


Overflow pages
Once again the SQLite.org file format notes help us out:

Sometimes, a database record stored in either an index or table B-Trees is too large to fit entirely within a B-Tree cell. In this case part of the record is stored within the B-Tree cell and the remainder stored on one or more overflow pages. The overflow pages are chained together using a singly linked list. The first 4 bytes of each overflow page is a big-endian unsigned integer value containing the page number of the next page in the list. The remaining usable database page space is available for record data.

We can calculate that for the first byte to be any value other than 0x00 there must be at least 16,777,216 pages within the database (0x01 0x00 0x00 0x00 decoded big endian). At a page size of 4096 bytes this would equate to a database size of 64 Gigabytes. In most cases therefore we can discount the first byte of overflow pages being anything other than 0x00.


So how does this help with carving SQLite databases then?
We can use the database header and the first byte value of each page thereafter to determine whether the data within each page size block is valid. So from a carving perspective we can identify the first page with the database header, calculate the page size, read the next page and validate the first byte and so on until the first byte validation fails.

This essentially is what Rasmus Riis's Chrome SQlite db finder v1.4 enscript does for SQLite databases created by the Google Chrome web browser. His description of the enscripts functionality states that it checks the first byte of each page, other than the first page, for the values 13,10, 2, 5 or 0. Convert these values into hex and you get 0x0D, 0x0A, 0x02, 0x05 and 0x00. The first four of these values are the flags found at the first byte of the B-Tree page headers discussed above. Additionally he checks for 0x00 which may well be the first byte of either a free page or an overflow page. The script does not however allow 0x01, 0x03 or 0x04 to be the first page byte. This therefore does not allow for an auto-vacuum capable database. The databases used by Google Chrome are not auto-vacuum capable, however other databases such as the iPhone SMS.db database are.

Rasmus's enscript also carries out a test of the two bytes at offset 100 within the first database page. The enscript according to the description in the download center looks for the values 2243, 3853 or 3331. The coding however shows that it checks for 3343 or 3853 or 3331 decoded big endian. Converted to hex the first two bytes would be 0x0D 0x0F or 0x0F 0x0D or 0x0D 0x03. I am not sure what Rasmus had in mind for the second value 3853 (which is 0x0D 0x0F converted little endian) but focussing on the first and the third pairs of two bytes the script is taking the flag byte and the first byte of the two bytes used to store the Byte offset of first block of free space on the page. The first page of all SQLite databases beyond the 100 byte database header store the database schema and are therefore constant (in other words because each particular database has a unique set of tables and indexes the first page of a particular database does not change from instance to instance). Because the database schema does not change the combining of the first and second bytes seems to work in order to identify Chrome databases.

This lead me on to consider whether an analysis of the following values within the B-Tree page header of the first page containing the database schema would allow the identification of other types of SQLite databases. The following table appears to suggest that it would be possible to establish a fingerprint for each database type. I have not collected enough test data to be certain about this yet.

With regards to Rasmus's enscript because he was kind enough to share and also not Enpack it is possible to make some small changes to the code to allow it to parse unallocated for all types of SQLite.db. I am in touch with my programming friends to create a script that can carve and identify SQLite databases from the fingerprint discussed above and also include a greater amount of error checking.

Other stuff of note
When you examine data within SQLite databases have you noticed that most of the meaningful stuff is bunched up at the end of each page? The SQLite.org file format notes help us out here:

The total amount of free space on a b-tree page consists of the size of the unallocated region plus the total size of all freeblocks plus the number of fragmented free bytes. SQLite may from time to time reorganize a b-tree page so that there are no freeblocks or fragment bytes, all unused bytes are contained in the unallocated space region, and all cells are packed tightly at the end of the page. This is called "defragmenting" the b-tree page.

To Do
I have not as yet covered the part the journal files may play in the recovery of SQLite data from unallocated, a future post perhaps.

Thanks
to Rasmus Riis for sharing his enscript
and to Chris Vaughan for his help in firming up some of the theory.

References
http://www.sqlite.org/fileformat.html
http://www.sqlite.org/fileformat2.html
http://www.sqlite.org/requirements.html
http://www.sqlite.org/hlr30000.html
https://support.guidancesoftware.com/forum/downloads.php?do=file&id=1116




Tuesday, 4 January 2011

Reporting and Exporting Emails from Encase

Regular readers will know that I champion Encase for most forensic tasks but I have to admit that my favourite forensic tool does not handle the investigation of email very well.

My friend Oliver Smith, over at Cy4or, had cause to run a keyword search across a number of emails. These emails were contained in a number of email containers including dbx and pst files and the Encase email search had been carried out. The emails were reviewable in the records tab. There was a need however for the client to review emails that contained certain keyword hits. Encase provides an export to .msg facility whereby emails can be exported in the .msg format allowing a review using Microsoft Outlook. It is a therefore a simple task to filter the records tab to display only email with search hits (that is those with a value in the Search Hits column). Then by selecting those records you can export them as .msg files.

The problem with this approach is that it is difficult to marry up the exported .msg files to a report detailing each msg files provenance. So in a case where many thousands of emails have been exported it is a real pain to provenance the relevant emails after the client's review. Depending on the email container concerned (e.g. pst, dbx etc.) Encase names the .msg file either by its subject or by some arbitrary description (Inbox.dbx-0.msg, Inbox.dbx-1.msg and so on). In situations where the client has copied notable emails out of the original export directory it can be very difficult to quickly trace the source email container.

To address this problem Oliver has written an enscript to export selected emails to .msg along with a report detailing their provenance.


The html report contains hyperlinks to each message along with the emails provenance and it's metadata. The file name of each exported email marries up to the reference number in the report.


As always if you would like a copy please email me.

Richard Drinkwater