Coding of PSION-S5-DATA file

Note: All Information inside this document is based on my own tests because there is no kind of public documentation of PSION about the file structures of the S5. Due to this fact the informations in this document may be wrong. However I thought that a buggy but in most cases working converter is better then no converter.

This document is an addition to the documentation of Frodo Looijard about PSION file formats so you will have to read this document first to understand all terms inside this document.

File type 0x10000050

Files of type 0x10000050 are used for files which remain on disc like databases. Data is read and modified directly on disc. There may be additional data structures in RAM but the structure on the disc is the leading one. Therefore don't use this structure inside your own programs except from/to converters because it may be seen as a part of the program and so you may provoke Psion/Symbian if you are using it in your own databases or organizers.

File type 0x10000037 is used for files which are loaded from disc to RAM first and the modified there and afterwards written back to disc. Structure in RAM may be completly different to the structure on disc. Even I don't see this structure as a part of a program but as a public interface concept don't use it in your own programs because you may provoke Psion/Symbian.

Fundamental structures

MLIST

There seems to be a new kind of list. Because I wasn't able to decoded it completely I called it a mystery coded list. Normally a word at the beginning of the list is coding the number of bytes inside the List. But Bit 14 is set and Bit 15 clear and must be masked out to get the true length. For very long sections must be encoded with this structur (eg all the information inside the Dataset of the DB) it is likely that also long decoded lengths are existing. My name for this list would be MLISTE or MLISTB. It is not clear to me why there is an additional list because all could be coded by the existing ones.

SStrings

There is an reduced form of strings which is coded as a BLISTB List. I will call this SString (Short String).

IDs

Following IDs are used inside the Data file:

0x10000069 - Internal Table Storage Section

0x10000086 - Data-ID

0x10000089 - APP-ID

0x1000012E - Table Definition Section

0x10000131 - View Definition Section

0x10000132 - unknown Section 3

0x10000133 - unknown Section 5

0x10000137 - unknown Section 4

Type of Field

0x00 - Boolean

0x05 - Integer

0x09 - Double

0x0A - Date

0x0B - Text

0x0E - Memo

0x10 - Format. Occurs always with Text/Memo. May contain embedded object information

D

Double is stored using 8 Bytes according to IEEE double precision format. Note that more significant Long containing the exponent is stored at position 4-7 at disc and the minor Long containing mantisse at position 0-3.

Date

Date Information is stored as a 8 Bytes. Seems to be 2 longs. Lower long stored at the beginning, higher long at end. Now enter two dates with one day difference:

00E088F28F42E000 = 10.04.2000

- 0080B1D47B42E000 = 09.04.2000

Internal Form:

0x00E0428FF288E000

- 0x00E0427BD4B18000

------------------

0x000000141DD76000

Result is 8.64E10 in decimal. A hour has 60 Minutes and a minute 60 seconds. Divide result by 3600 so you will get 24,000,000 and note that our difference was one day which are 24 hours. That means time unit is 1/1,000,000 of a second! Base day is the 1.1.0000. There is an important point to note. PSION implements Gregorian calender scheme. That means:

1. Every 4 four years is a switching year (that means a year has 366 days)

2. If the year can be divided by 100 it is not a switching year

3. If the year can be divided by 400 it is a switching year

Gregorian Calender was used since 1582. To correct difference to the Julian calender the 5.10.1582 until the 14.10.1582 were officially skipped. This has not be implemented at PSION. You can enter these days here, but PSION implements the rules 2. and 3. from 1600 (included) on! That means eg that the year 1500 is a switching year which is not in accordance to the rule 2. and 3. of Gregorian Calender but in accordance to Julian calender.

System is able to handle dates before birth of christus (BC) too. All switching years are handled in accordance to Julian calender in this range of time:

00207C65EA16EEFF = 01.04.-160

Internal Form:

FFEE16EA657C2000

This is the usual negativ integer format. That means similar to that in bytes:

-1 -> 0xFF

-2 -> 0xFE

...

General Data file concept

The Data file consists of three parts. First there is a Header section, then a data section is following. These data section contains several subsections. These addresses of these subsections are managed by a Section Location Table which follows after the data section.

Header Section

The header setion is always 0x1C bytes long.

Offset Size Data Description

0x0000 ID 0x10000050 UID1: Heaader Section Layout

0x0004 ID 0x1000006D UID2: File kind

0x0008 ID 0x10000086 UID3: Application ID

0x000C L 0x5508A1FE UID4: Checksum of UID1, UID2 and UID3

0x0010 L unknown

0x0014 L 0x00000000 or contains the number of entries in the Section Location Table

0x0018 L Offset from 0x0014 on to the Section Location Table if 0x00000014 is 0x00000000

Data Section

The data section consists of subsections. The subsections are encapsulated each using a MLISTB list. Note that there may be unused areas between the subsections. If possible these areas will be removed if the program packs a table. If you are generating a new database you must build it without such unused areas.

There is an important silly effect to note. Data section is divided into logical blocks of 0x4000 bytes! 2 Bytes are preceeding such a block (so a block is totally 0x4002 bytes). May be this is a checksum or a space usage indicator. The first bock has a special behaviour. Its Bytes are located at 0x0000001C then 0x4002 Bytes of data follows! Note that the special Bytes are always placed at fixed locations. They are inserted into the real information. That means that you have to strip them out of the data stream if reading bytes and much more horrible the lists are not counting these bytes so you have to calculate them out! This may be done by using special basically i/o operations which are hiding the bytes to the rest of the program. Following documentation is written as if these bytes are not present. There is one remarkable additional silly effect too at the MLISTB entry which encapsulates the section. If it encapsulates a section which passes over such a block change its length is zero! (not sure !!! check after reorganization of DB too, not sure at end).

Offset Size Data Description

0x0000 LLISTB

Overview over the information stored in the Data Section

The database is defined by the declaration of tables. A table is defined by its fields. All datasets of a table will have the same fields even not every field must really filled with any data. There are three definitions of the fields. First there is the basically declaration of the fields, that means the field name and the type of the data it is containing. I will call this Table Definition Section. Second there are the definitions how to display tables on the screen. I will call it View Definition Section. At last there is an internal definition which give an information how the fields are stored. I will call this section Internal Table Storage Section. Of course you must store the information of the datasets itself. I will call this the DB-Dataset Content Section. Note that there will be normally a lot of such sections.

Section Location Table

The Section Location Table is containing offsets (without the block bytes!) to the special subsections inside the data section. Some sections seemed to be placed at fixed positions. I am not thinking that this is really the case but until now I can't proof this. I am assuming that there are references from one section to another in every case but until now I did not find this references in every case. There are references inside ID-Binding Table, Table Definition Section, unknown Section 2 and Table Content Sections which tells to what kind of subsection an element of the Section Location Table is poiting to. Until now there are the following kinds of entries known (not every kind of entry must really occur in each data file):

unknown Section 1 (always positon 1?, or may be referenced by Table Definition Section)

Internal Table Storage Section (always positon 2?, or may be referenced by Table Definition Section)

Table Content Sections (First is referenced by unknown Section 2)

unknown Section 2 (always position 4?)

Application Section (Referenced by ID-Binding Table)

ID-Binding Table (Referenced by an entry at the beginning of the Section Location Table)

Table Definition Section (Referenced by ID-Binding Table)

View Definition Section (Referenced by ID-Binding Table)

unknown Section 3 (Referenced by ID-Binding Table)

unknown Section 4 (Referenced by ID-Binding Table)

Memo Content Section (Referenced by a dataset memo field)

Format Content Section (Referenced by a dataset format field)

unknown Section 5 (Referenced by ID-Binding Table)

The known references to this table may have in some cases additional information in the top byte of the long containg the reference. The meaning of this information is unknown.

Note that there are two ways how the beginning of the Section Location Table can be determined. If 0x00000014 contains 0x00000000 0x00000018 contains the offset from 0x00000014 on to the start of the Section Location Table. If 0x00000014 is not zero then it contains the numbers of elements in the Section Location Table. Multiply this by 5 and add 12 (3*4 Bytes for 3 Longs). Go this amount of Bytes backward from end of file to determine the beginning of the Section Location Table.

Offset Size Data Description

0x0000 L Number of the element of the section table section which points to ID-Binding Table.

0x0004 L unknown

0x0008 LLISTE E=constant 5

Section Location Table Element

Offset Size Data Description

0x0000 B unknown. normally 0 but other values possible. May be a usage tag or something like that

0x0001 L 0x1C added to value gives the offset to the beginning of a special subsection of the Data Section

Unknown Section 1

Offset Size Data Description

0x0000 MLISTB

content of unkown Section 1

Offset Size Data Description

0x0000-0x0008 unknown 00 00 00 00 00 00 00 00 00

Internal Table Storage Section

It contains the information how the informations from the tables are stored from the tables. There is a list which does contain one or more entries belonging to a field of a table.

Offset Size Data Description

0x0000 MLISTB

content of Internal Table Storage Section

Offset Size Data Description

0x0000 ID 0x10000069

0x0004 B unknown 00

0x0005 L unknown

0x0009 XLISTE Table Storage Definition Table

0xXXXX-0xXXX6 unknown 20 04 00 00 00 00

Coding of an Element of the Table Storage Definition Table

Offset Size Data Description

0x0000 SLISTB Name of the table

0xXXXX XLISTE Table of DB-Field Storage Binding Field Elements of various length

Coding of a DB-Field Storage Binding Field Element

Offset Size Data Description

0x0000 SLISTB field identification

0xXXXX B type of field

[0xXXX1- Depends on field type]

The coding of a DB-Field Storage Binding Field Element depends on the type of the field information which is coded. The first two entries are the same for all types.

The first entry is a String and gives the name of the element. Beginning of the name is always "Col" followed by a letter which indicates the belonging to a particular field. A new field starts always if the letter is A. Further elements may follow which will be named "B", "C" and so on. After this the number of the database field is following in digital letters!

The second entry is a byte which detects the type of the element. All afterwards depends on the type of the element.

Boolean

Offset Size Data Description

0x0000 String internal Name of the field

0xxxxx B 0x00 type of element: Boolean

0xxxx1 B unknown always 0

Integer

Offset Size Data Description

0x0000 String internal Name of the field

0xXXXX B 0x07 type of element: Integer

0xXXX1 B unknown always 0

Double

Offset Size Data Description

0x0000 String internal Name of the field

0xXXXX B 0x09 type of element: Double

0xXXX1 B unknown always 0

Date

Offset Size Data Description

0x0000 String internal Name of the field

0xXXXX B 0x0A type of element: Date

0xXXX1 B unknown always 0

Text

Offset Size Data Description

0x0000 String internal Name of the field

0xXXXX B 0x0B type of element: Text

0xXXX1 B unknown always 0

0xXXX2 B maximal length of the string

Memo

Offset Size Data Description

0x0000 String internal Name of the field

0xXXXX B 0x0E type of element: Memo

0xXXX1 B unknown always 0

Format

Offset Size Data Description

0x0000 String internal Name of the field

0xXXXX B 0x10 type of element: Format

0xXXX1 B unknown always 0

Table Content Section

These sections contain the informations stored in the database. Because of internal reasons explained later the database is splitted up into several Table Content Sections. Every section is containing a fraction of the datasets. The first Table Content Section will contain a pointer to the next one and so on. Up to 16 datasets may be stored inside one section. Note that the address of a section may change during changing the content of the datasets. The old address of the section will be lost and the space occupied before the changes will become an unused area. If there are too much unused areas in the DB it is automatically packed. This means that the program tries to move used section from the end to the front into unused areas. If these does not fit exactly there may be still unused areas after the packing process too. Some sections seems to never change their locations.

Offset Size Data Description

0x0000 MLISTB

content of Table Content Section

Offset Size Data Description

0x0000 L Number of element of Section Location Table which gives the offset to the next Table Content Section or 0 if this is the last Table Content Section. (Top byte contains special Information of unknown meaning, so mask it out).

0x0004 W Dataset representation flags

0x0006-0xXXXX Dataset length table

0xXXX1 content of datasets

Dataset representation flags

To handle the information stored inside these Table Content Section you will have to know how many datasets are stored inside it. Note also, that there may be the need to adress a dataset by its internal unique number. If a dataset is deleted you must not renumber the remaining ones. To fit these requirements there is a Word of 16 Bits. Each of the bits is representing a dataset. A "1" detects that a dataset exists. In this case its length will be stored in the Dataset length table and its content in the content of datasets. A "0" detects that the dataset is not present. There will be no entries in the other parts of the Table Content Section belonging to it. This scheme gives you the ability to keep the internal numbering of the datasets and let you detect where a new dataset can be placed.

Dataset length table

The Dataset length table will contain the length of the content of each dataset which is stored inside these Table Content Section (that means corresponding Bit in Dataset representation flags is "1"). The length itself is encoded like the length of an EXTRA-Encoded-List. That means that the length may be coded using 1, 2 or 4 Bytes. If want to jump to the third set inside this Table Content Section you can easily do this by adding the content of the first two entries of this list, scanning to the end of the Dataset length table and from there on jump the sum of bytes forward without scanning the content of the other datasets.

Content of datasets

The datasets are stored dataset by dataset and field by field. To save memory there is an information which tells whether a field is present at a dataset or not. Each field is represented by a bit, so 8 fields are encoded in one Byte (first Bit represent field 1, second bit field 2 and so on). After this Byte the field contents are following if present for up to 8 fields. Then the next present byte is following and so on. One exception to this rule is made at coding of boolean fields. There is also one bit which tells whether a field is existing or not. If it is existing then the next bit of the representation byte tells you whether the field is true or false. Another exception is made at Memo fields. Here first bit tells whether a field is existing or not. If it is existing the next field tells how the content of the memo field is stored. The same scheme is also used at format fields. Note that some field existing bit may be missing when there is no field information after them. This can be detected when end of actual dataset is reached! It is still open what will happen if a new present byte must declared if a field is added. There seems to be the following rule. No empty fields will be added to existing datasets even the new field can be represent by a constant amount of memory. It seems also that if a dataset is added space will be allocated for every content which can be stored by a constant amount of memory. If such fields are changed this will be simply done by changing the content of the according piece of storage. Strings are stored by using an amount of memory which corresponds to their actual length. Therefore the storage must be reorganised if the length is changed. It seems that the following scheme will be used. The total Table Content Section which does contain the changed dataset will be copied to the end of the file with the changed content of the field. Section Location Table will modified afterwards and the address of the old now invalid Table Content Section will be replaced by the actual one. This copy process is the reason for splitting up the database into several Table Content Sections. Else in worst case you would have to duplicate the whole content of the database which cannot be tolerated at large databases. Note that Memo fields may be stored by using a separat section.

Boolean

Offset Size Data Description

Nothing, but coded with an extra bit inside existence mask

Integer

Offset Size Data Description

0x0000 L

Double

Offset Size Data Description

0x0000 D

Date

Offset Size Data Description

0x0000 Date

Text

Offset Size Data Description

0x0000 SString

Memo

May be coded in two ways. An extra bit inside existence mask tells what kind of coding is valid. If it is false then a Memo Content Section is used to keep the content of the Memo field. If it is true the content of the Memo field is stored as a normal Text.

Offset Size Data Description

0x0000 L Number of the Section Content Table element which points to the according Memo Content Section. (Attention Top Byte has a special unknown meaning, so ignore by masking out!)

0x0004 L Length of Memo

0x0000 SString

Format

May be coded in two ways. An extra bit inside existence mask tells what kind of coding is valid. If it is false then a Format Content Section is used to keep the content of the Format field. If it is true the content of the Format field is stored directly in the data set. See Notes about Format Content Section for details.

Offset Size Data Description

0x0000 L Number of the Section Content Table element which points to the according Format Content Section.

0x0000 L Offset to Text Layout Section

[0x0004 embedded Objects]

0xXXXX Text Layout Section

Unknown Section 2

Is an important section because it contains a pointer to the first Table Content Section. The meaning of the other information is unknown.

Offset Size Data Description

0x0000 MLISTB

content of unkown Section 2

Offset Size Data Description

0x0000 L Pointer to first Table Content section (Top byte may contain other information

and must be masked out)

0x0004-0x000C unknown

Application ID Section

The usual Application section.

Offset Size Data Description

0x0000 MLISTB

content of Application ID Section

Offset Size Data Description

0x0000 ID 0x10000086

0x0004 String Name of the Application.

ID-Binding Table

It contains IDs followed by an offset. This offset seems to be the position of the special subsection inside the Section Location Table which corresponds to the ID.

Offset Size Data Description

0x0000 MLISTB

content of ID-Binding Table

Offset Size Data Description

0x0000 BLISTL two Long gives a pair of data

Coding of an Element of the ID-Binding-Table

Offset Size Data Description

0x0000 ID

0x0000 L Position inside the Section Location Table

Table Definition Section

The Table Definition Section contains definitions of the fields of the tables (without internal aspects).

Offset Size Data Description

0x0000 MLISTB

content of Table Definition Section

Offset Size Data Description

0x0000 L 02 00 00 00 unknown. May be the number of the element of the Internal Table Storage Section in Section Table Section.

0x0004 LLISTL Search Definition Table

0xXXXX B unknown. Normally 0x00, 0x01 if file is sorted

0xYYYY LLISTE E=constant 5. Sort Definition Table

0xZZZ1 XLISTE unknown. May be a Table Definition Table

Coding of an Element of the Sort Definition Table

The Sort Definition Table contains the fields which are used to sort the table follwed by the kind of sorting (ascending/descending)

Offset Size Data Description

0x0000 L tag of the field to sort by

0x0004 B Sortorder. 0x00 ascending, 0x01 descending

Coding of an Element of the Search Definition Table

The Search Definition Table contains the fields which are searched during a searching process.

Offset Size Data Description

0x0000 L tag of the field to search

Coding of an Element of the Sort Definition Table

The Sort Definition Table contains the fields which are used to sort the table.

Offset Size Data Description

0x0000 tag of the field to search

Coding of an Element of the Table Definition Table

Offset Size Data Description

0x0000 String Name of the table

0xXXXX L internal counter which is used to produce an unique tag to each defined field. It does contain the number assigned of the field which has been defined last.

0xXXX4 XLISTE Table Field Definition Table

Coding of an Element of the Table Field Definition Table

The coding differs according to the kind of the field. The first entries are the same.

Boolean

Offset Size Data Description

0x0000 String Name of the field

0xXXXX B 0x00 type of the field: Boolean

0xXXX4 L unique number of field

0xXXX8-0xXX13 unknown FF FF FF FF 00 00 01 00 04 00 02 00

0xXX14 L unknown 9C FF FF FF

0xXX18 L unknown 64 00 00 00

Integer

Offset Size Data Description

0x0000 String Name of the field

0xXXXX B 0x05 type of the field: Integer

0xXXX4 L unique number of field

0xXXX8-0xXX13 unknown FF FF FF FF 00 00 01 00 04 00 02 00

0xXX14 L Minimal allowed value as Integer

0xXX18 L Maximal allowed value as Integer

Double

Offset Size Data Description

0x0000 String Name of the field

0xXXXX B 0x09 type of the field: Double

0xXXX4 L unique number of field

0xXXX8-0xXX13 unknown FF FF FF FF 00 00 01 00 04 00 02 00

0xXX14 L Minimal allowed value as Integer

0xXX18 L Maximal allowed value as Integer

Date

Offset Size Data Description

0x0000 String Name of the field

0xXXXX B 0x0A type of the field: Date

0xXXX4 L unique number of field

0xXXX8-0xXX13 unknown FF FF FF FF 00 00 01 00 04 00 02 00

0xXX14 L unknown 9C FF FF FF

0xXX18 L unknown 64 00 00 00

Text

Offset Size Data Description

0x0000 String Name of the field

0xXXXX B 0x0B type of the field: text

0xXXX1 L unique number of field

0xXXX5 B maximal length of the text in the field

0xXXX6 L unknown 00 00 00 00

0xXXXA B unknown some Flags, Bit 2 shows that this field will be shown at dialing.

0xXXXB B number of characters to use at sort

0xXXXC B unknown 00

0xXXXD B unknown 01

0xXXXE B unknown 00

0xXXXF B unknown 02

0xXX10 B unknown 00

Memo

Offset Size Data Description

0x0000 String Name of the field

0xXXXX B 0x0E type of the field: Memo

0xXXX4 L unique number of field

0xXXX8-0xXX13 unknown FF FF FF FF 00 00 01 00 04 00 02 00

0xXX14 L unknown 9C FF FF FF

0xXX18 L unknown 64 00 00 00

View Definition Section

The View Definition Section is containing the definitions how to display information of the datasets on the screen. In detail this means in which order the fields are displayed, which field are displayed at all, what a kind of character set is used to display them and mmmso on. Note that there are two kind of views (table view, dataset view) and this section is containing both definitions.

Offset Size Data Description

0x0000 MLISTB

content of View Definition Section

Offset Size Data Description

0x0000 View Definition: Displayed as a table

0xXXXX View Definition: Displayed as a card

Content of View Definition

Offset Size Data Description

0x0000 ELISTE View Field Definition List

0xXXXX BLISTE View Field Layout List

0xYYYY Sequence of bytes. Each byte seems to belong to one field. Meaning is unknown

Element of View Field Definition List

Each Element seems to be 9 Bytes long.

Offset Size Data Description

0x0000 W subsequent number of the view field

0x0002 L unique tag of the field

0x0006 W unknown: direct relation to unique tag

0x0008 B unknown: always 01

Element of View Field Layout List

Each Element seems to be 18 Bytes long.

Offset Size Data Description

0x0000-0x0012 unknown: 02 4F 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00

Unknown Section 3

Offset Size Data Description

0x0000 MLISTB

content of unkown Section 3

Offset Size Data Description

0x0000-0x0039 unknown. Value 0010 changed if one field added, increased by one but does not contain the right field number

Unknown Section 4

Offset Size Data Description

0x0000 MLISTB

content of unkown Section 4

Offset Size Data Description

0x0000-0x0006 unknown 03 37 02 02 02 00

Unknown Section 5

Offset Size Data Description

0x0000 MLISTB

content of unkown Section 5

Offset Size Data Description

Memo Content Section

Inside a Memo Content Section the content of exactly one Memo field of a dataset is stored if its contents is too large to store it inside a SString. An element inside the Section Location Table contains a pointer towards the according Memo Content Section. The number of this element is stored inside the memo field content of the dataset. Afterwards the length of the Memo Content Section is stored. Note that data is stored immediately after the MLISTB descriptor! Due to Memory management technics it may not contain the true length of the section.

Offset Size Data Description

0x0000 MLISTB

Format Content Section

Inside a Format Content Section the content of exactly one Format field of a dataset is stored if its contents is too large to store it inside a SString. Each Format corresponds to a text (stored as Text or Memo-Field). The usual Text Layout Section is used to store the formats belonging to each paragraph of the text. Note that there is a Text Layout Section for each Format field and not only one like in the Word format. Another problem is to store embedded objects. This is also done in the Format Content Section. Objects are stored in Front of the Text Layout Section. To skip this information at the beginning of each Format Content Section there is a Long offset to the beginning of the Text Layout Section.

Offset Size Data Description

0x0000 L Offset to Text Layout Section from 0x0000 on.

[0x0004 embedded Objects]

0xXXXX Text Layout Section

Open Standing Questions

To clearify all that where a "there seems", "it is likely" or "unknown" is found in the text.

It seems that there can be more than one table. From where the system is knowing where the first content section of the table is starting?

I see that there are big additional structures if you are sorting the database. Because this is not influencing the converting process I did not analyze this structures.