CS 403 File Structures

SYLLABUS

CH: 2-0-0
CR: 2
Storage device structures and input output mechanisms.

Concepts of record, block, and file.

File Structures: Pile, Sequential, Indexed Sequential, Direct access, Inverted files; Indexing structures- B-tree and its variations.

Books:

1. Loomis, Marry; Data Management and File Structures, PHI
2. Horowitz and Sahni, Data Structures, Narosa Publishing House.


SUMMARY of matter in the course

Data Structures - A summary

Record

Roughly, a record is a collection of pieces of information pertaining to a single object. The object may be physical or conceptual. For example a student record may comprise items of information such as student's name, roll number, registration number, age, grade in semester 1, grade in semester 2, etc. A bank account record (a bank account is a conceptual object) may comprise account holder's name, present balance, overdraft limit, etc. The individual items of information in a record are often called fields of the records. This notion of a record also provides for the existance of multiple instances of a type of record, i.e., there may be several student records (each about a distinct student object), or several bank account records (each about a distinct account). The sizes of individual instances of records of the same type may be same or allowed to vary.

Representing a record in computer

Each field of a record may be represented (i.e. held in memory and processed) in a computer system by the data abstraction mechanisms provided by the programming languages. Data abstraction mechanisms mean built-in data types in a language and user defined data types. Further, most programming languages also allow user to define composite data types (viz., struct and union in C, class in C++, record in Pascal). In these mechanisms, when a record is identified, its constituent fields can be identified (accessed). So records can be represented in a computer using such mechanisms.

Exercise
1. Define a record type in C language for student records to store the name, roll number, and communication address of a student. The address can be either a single email address or a record in itself with the fields house number, locality, post office and state.

File

It can be easily imagined that in most situations a computer system (i.e. a system comprising hardware and software) has to handle multiple instances of some type of record. Such as records about students in an University, records about books in a library, and so on. In this case the common part among the several instances of a record is the type of the record. Again, we might find that two completely independent software in a computer dealing with a collection of records of the same type, say two departments of an University, each maintaining the records of their students in the same computer. In such case we are not interested in considering the two groups of records (of the two departments) together. So we can roughly define a file as a collection of records of the same type and with a well understood reason to be considered together.

Operations with a file

Representing a file in computer

As mentioned earlier, most programming languages provide some mechanisms to represent individual records. The programming languages also provide mechanisms to represent a file, that is a collection of related records. A first approximation can be like this - for each record get some area in the memory to hold the record. But to implement the concept of a file, we need to be able to identify all records in a file when the file is identified. In programming languages this is facilitated in the data structures such as arrays, linked lists, trees, etc. Usually, in all these representations, the individual fields of a record are placed in contigous memory area, whereas the individual records may be scattered. But whenever necessary, when the file is identified (in case of an array, the first location of the array, in linked list, the first node of the list, in tree, the root node of the tree, etc.) all the records of the file can also be identified.

Exercise
2. Implement a file of student records of the type described in the earlier exercise using linked lists. Write a routine to sort the list in non-decreasing order of roll numbers. How is non-decreasing order different from increasing order ?

Some important observations about files that affect their representation

  1. The size of a file can possibly be very large, say thousands, or millions of records. In terms of number of bytes, a file can be of Gigabytes.
  2. A file may need to be maintained in a computer for a long period, such as several days, months, years, decades, or virtually, forever. During this period the programs that access the file may be invoked and terminated multiple times. Also, the computer system holding the file may be shutdown and brought up multiple times.
  3. During a particular run of a program that accesses a file, only a part of the file may be actually accessed.
  4. The sizes of files, i.e., number of records in a file, can vary during the lifetime of the file.

Exercise
3. Describe a scenario for each of the above observations.

File representation in secondary storage devices

The first three of the above observation about files brings out the shortcomings of the use of main-memory as an adequate means of storing a file. Hence, secondary storage devices are used to hold files. This provides-
  1. capability of holding very large files.
  2. permanance, to some extent, of storage of files.
Here it needs to be mentioned that a secondary storage device is used by various programs to store different bodies or blocks of information which may be mutually unrelated. From that point of view, the file that we have been discussing here is one such block of information. Now, each such block of information in a secondary storage device is almost universally called a file. This notion of a file is different from the one that we have been dealing with here. To avoid this confusion we shall refer to such a file as a disk file as against our file that is a collection of records. Disk file management is an elaborate topic and is usually discussed in the context of Operating Systems. Each operating system enforces some naming conventions for disk files to unambigously identify each such entity in a secondary storage device. We shall see later (while discussing indexed sequential file structure) that to represent a file in a secondary storage device, we might use more than one disk-files.

Physically a disk file may be spread over non-contiguous portions in the storage medium. But the operating systems hides this fact and allows programs to assume that a disk file exists as a single continuous block. Further, the operating systems also provides primitives to access a particular portion of the disk file by specifying the offset from the beginning of the disk file. With such assumtion and access primitives it is possible to implement data structures such as arrays, trees, etc. within a disk file. But random access in a disk-file is much slower than in case of main memory.

A CPU cannot read from or write to a secondary storage device directly. So, for actual operations involving files, the contents have to be moved between main memory and the secondary storage devices. If the file is too large to fit in the available main memory, part of it, say a subset of the records, may be held in the memory at a given time.

Exercise
4. Write a C program that creates an array of records in memory, and then appends the records into a file. How are the functions fprintf and fwrite different ?
5. How are the file opening modes "r+", "w+" and "a+" different ? What does b in "rb" signify ?

Nature of secondary storage devices

Exercise
6. What are the structures of magnetic tapes, floppy disks, hard disks, compact disks, and how does a computer store and retrieve information in these devices ?

Searching records in a file

Searching records underly most operations involving a file. So, the representation of a file in a computer should facilitate efficient searching of records. The physical layout of the records of a file in a computer is referred to as file structure. Different file structures have different characteristics and are suitable in different situations. Some structures are easier to implement/enforce than others. Similarly, some structures facilitate more efficient searching than others. Ease of enforcing a structure basically implies - economy of storage space, efficiency in inserting, modifying and updating a record, and simplicity of layout in the physical storage device. These aspects greatly depend on the type of storage device being used. For example, for a file stored in main memory economy of space is very important, but in secondary storage devices efficiency of access needs more attention than economy of space.

File Structures

Sequential File

A very natural way to store a file is in the form of an array, or a linked list of the records. In these representations, the entire file may be traversed in a linear fashion. This file structure is called sequential file. It is simple to implement and can be economic in space.

On the negative side, most search operations in such a file are likely to be inefficient, since searching requires traversing of the sequence of records according to the storage sequence. In case of large files, particularly when the file is in disk, such traversal times can be too long. If the structure used is an array (and not a linked list), and the records are of equal sizes, then searching can be made efficient by keeping the records sorted on one key field. In that case searching based on that key field can be carried out using binary search. But search based on any other field value cannot derive the benefit of binary search. Moreover, using an array for storing a file makes growth or shrinking of the file (due to insertions and deletions) inconvenient. To overcome this one may use some kind of search tree structure to store records instead of a simple sequential structure. However, even in a search tree we can assume an ordering among the records only on one of the fields.

Exercise
7. Even though it may be required to retrieve records based on a particular key value always, what could be the disadvantages of having a sequential file structure in disk ?

Indexed Sequential File

To overcome the shortcomings of the sequential file structures, indexed sequential file structure is used. In this the records of the file are stored like in a sequential file. But in addition to this, some index of the records is (are) maintained. An index is a collection of (key_value, record_address) pairs. The index is represented using a separate data structure. A search for a record using a key field shall now be carried out in the index based on that key value. Once the index entry is located, the record_address part of the entry can be used to directly access the record. Due to smaller size of the index, it can be structured in a way that facilitates efficient search, eg. sorted array (for binary search), hash table, search-trees, etc. Also, for the same file several indexes using different key field can be maintained to help search based on different key fields. Since an index can contain many entries (as many as the number of records in the file), structuring them appropriately to facilitate an efficient search is important.

Some Index structures

1. Cylinder Surface Indexing - This index structure is for files stored in hard disks. It needs the records to be sorted on some field. Then assuming a sequence of the cylinders in the disk and a sequence of surfaces (tracks) in each cylinder, the sorted records are stored starting from the first cylinder and proceeding in the assumed sequence through the cylinders. An index of the first record in each cylinder is maintained using the field on which the records are sorted as the key field. This is the cylinder index. The key values of this index defines the ranges of the key value present in each cylinder, so that for a search of a particular record the cylinder in which the record may be present can be found out. So the search for the record can be performed only in that cylinder. Again within a cylinder, records are stored starting from the first surface and through the assumed sequence of surfaces. Here an index of the first record in each surface is maintained using the same key field. This is the surface index. There is a cylinder index for the entire file and one surface index for each cylinder that the file spreads over.

The idea of cylinder surface index can be used even without considering cylinders and surfaces. If the records are sorted and kept in a disk file, then an index containing the (key-value, offset) of equally spaced records records can help a sequential search by requiring the search first in the index, and then in the proper segment of the disk file.

2. Hash Index - In this indexing the (key_value, record_address) pairs of each record in the file are stored in a hash table which is an array. Insertion and search of such an index element is done using a suitable hash function and collision resolution mechanisms. The good performance associated with hashing can be achieved in this method, but requires selection of appropriate hash function and collision resolution mechanisms.

Exercise
8. Discuss the merits and demerits of storing the entire records in the hash table instead of the index entries.
9. Why must a hash table be stored in an array and not a linked list ?

3. Tree index - The basic idea in searching using any search tree is to successively divide the searching domain (set of elements where search is to be done), untill either the search is successful or it is found that the element is not present in the given set. An m-way search tree is a very general search tree where the elements are ordered on a key field and m denotes the maximum number of partitions that the set can be divided into at any level. In other words it is the maximum number of children that each node can have. At each level the root of the tree can contain upto m - 1 key values that define the partition boundaries. A binary search tree is a m-way search tree of order 2.

A balanced search tree gives a search performance of logmN. But unless care is taken a search tree tends to get skewed as insertions and deletions are performed in it. Maintaining the balanced property of a tree after each insertion or deletion is somewhat difficult. There are some m-way search tree structures on which insertions and deletions can be carried out with logmN complexity and still retain a fair degree of balance in the tree structure. One such tree is the B-tree where the number of children of a node is at least the ceiling of m/2 , and all leaf nodes are at the same level. Actually the important thing to be noted in such tree organisations is the insertion and deletion methods which retains the required properties of the tree. A B-tree of order 3 is called 2-3 tree and is frequently used. Some basic algorithms for B-trees implemented in C are presented in http://www.tezu.ernet.in/~utpal/daa_code/b_tree.c

Some variations of B-trees are B* tree, B+ tree, and B' tree.

B' Tree - We note that in the index structures that we are considering, each key value in the structure should provide us the physical location of the record in the file. To achieve that we may store in the nodes the (key_value, record_address) pairs, and traverse the tree by considering the key values of these pairs in the nodes. Sometimes an additional requirement is placed on a B-Tree whereby composite values such as (key_value, record_address) pairs are kept only in the leaf nodes of the tree. All other nodes keep only the key values that are used in the tree traversal during a search. These key values of the non-leaf nodes are repeated in leaf nodes at appropriate places. Thus the left to right listing of the leaf nodes of the tree gives a complete ordered sequence of the key_values. This form of a B tree is sometimes called a B' tree.

B* Tree - Searching in an m-way search tree is more efficient if each node contain more key values and children, since this will tend to reduce the height of the tree. In case of a B tree, during insertion when the number of key values in a node becomes m, we split the node. This creates two nodes which are about half filled, i.e., they have about the minimum number of key values in them. To improve the situation, when during insertion we are about to split a node to two when the number of key value in the node becomes m, instead of indiscriminately splitting we may resort to redistribution of key values of that node with its immediate sibling(s). In that case if there is room for more key values in the sibling, the splitting can be avoided. If this redistribution is attempted with one immediate sibling, then splitting shall be required only if that sibling is also already full. Then we split 2 to 3, i.e., we make three nodes out of two. The number of descendents (subtrees) in these nodes shall be ceiling of (2m - 1)/3. Such a tree is called a B* tree.

Example : Consider a B-Tree of order 3 where two siblings have 2 and 3 children respectively. Suppose an insertion of a new element requires one more child to be accomodated in the latter. Than instead of splitting the node into two nodes with 2 children each, we can redistribute the 6 children (2 + 3 + 1) in the two nodes. Splitting shall be required only now, if it is required to insert another new element in either of these nodes both of which are now full. And during a split, too, redistribution of elements in two siblings is performed.

B+ Tree - Sometimes it is useful to maintain the actual records in a sequential structure in some order. This is called a sequence set. In such a case once one record is located in the file it is possible to get to the next record in constant time without much difficulty. Now, suppose the file is present inside a disk file in such a way that each block contains a continuous sub-sequence of the sorted sequence of records, and the sets of records in the blocks are disjoint. In such a situation, we may opt to maintain a B' tree of index, only upto one level before the leaf nodes. In other words, the index actually takes us to the first record in each block of the file. The key value of these records act as separators in the sequence of ordered records. The records after a separator record until the next separator record (or the end of the file, in case of the last separator record) can be easily reached from these separator records since they exist in the same disk block. We can see that during a search operation the index mechanism will lead to the block of records where the record being searched must be present if it is at all in the file. Since the entire set of records in that block must be anyway loaded into the memory, the rest of the search can be easily carried out in the block in memory. Such a tree is called B+ tree.

Example : Consider a file with records whose key values are K1, K2, ...KN, in ascending order of the key values. Suppose, we store them in such a way that record with key Ki+1 can be accessed easily after accessing the record with key Ki (possible if the file is stored in an array, or as a linked list). We can have a B+ tree for such a file, in which the leaf nodes will contain the (key_value, address) pairs for some of the records which are separators in the sequence list, and the other records can be reached easily from the records corresponding to these.
Leaves of B+ tree : K2 K5 K7...KN-1
Keys in Sequence set : K1 K2 K3 ... KN

Exercise
10. Implement a file of student records with an index structures as a 2-3 tree.
11. What would you say about the searching efficiency of two m-way search trees each with a different value of m ?
12. What does sibling mean in a tree ?