Concepts of record, block, and file.
File Structures: Pile, Sequential, Indexed Sequential, Direct access, Inverted files; Indexing structures- B-tree and its variations.
Books:
1. Loomis, Marry; Data Management and File Structures, PHI
2. Horowitz and Sahni, Data Structures, Narosa Publishing House.
Exercise
1. Define a record type in C language for student records to store the name,
roll number, and communication address of a student. The address can be
either a single email address or a record in itself with the fields house
number, locality, post office and state.
Exercise
2. Implement a file of student records of the type described in the earlier
exercise using linked lists. Write a routine to sort the list in
non-decreasing order of roll numbers. How is non-decreasing order
different from increasing order ?
Exercise
3. Describe a scenario for each of the above observations.
Physically a disk file may be spread over non-contiguous portions in the storage medium. But the operating systems hides this fact and allows programs to assume that a disk file exists as a single continuous block. Further, the operating systems also provides primitives to access a particular portion of the disk file by specifying the offset from the beginning of the disk file. With such assumtion and access primitives it is possible to implement data structures such as arrays, trees, etc. within a disk file. But random access in a disk-file is much slower than in case of main memory.
A CPU cannot read from or write to a secondary storage device directly. So, for actual operations involving files, the contents have to be moved between main memory and the secondary storage devices. If the file is too large to fit in the available main memory, part of it, say a subset of the records, may be held in the memory at a given time.
Exercise
4. Write a C program that creates an array of records in memory, and then
appends the records into a file. How are the functions fprintf
and fwrite different ?
5. How are the file opening modes "r+", "w+" and "a+" different ? What does
b in "rb" signify ?
Exercise
6. What are the structures of magnetic tapes, floppy disks, hard disks,
compact disks, and how does a computer store and retrieve information in
these devices ?
On the negative side, most search operations in such a file are likely to be inefficient, since searching requires traversing of the sequence of records according to the storage sequence. In case of large files, particularly when the file is in disk, such traversal times can be too long. If the structure used is an array (and not a linked list), and the records are of equal sizes, then searching can be made efficient by keeping the records sorted on one key field. In that case searching based on that key field can be carried out using binary search. But search based on any other field value cannot derive the benefit of binary search. Moreover, using an array for storing a file makes growth or shrinking of the file (due to insertions and deletions) inconvenient. To overcome this one may use some kind of search tree structure to store records instead of a simple sequential structure. However, even in a search tree we can assume an ordering among the records only on one of the fields.
Exercise
7. Even though it may be required to retrieve records based on a particular
key value always, what could be the disadvantages of having a sequential file
structure in disk ?
The idea of cylinder surface index can be used even without considering cylinders and surfaces. If the records are sorted and kept in a disk file, then an index containing the (key-value, offset) of equally spaced records records can help a sequential search by requiring the search first in the index, and then in the proper segment of the disk file.
2. Hash Index - In this indexing the (key_value, record_address) pairs of each record in the file are stored in a hash table which is an array. Insertion and search of such an index element is done using a suitable hash function and collision resolution mechanisms. The good performance associated with hashing can be achieved in this method, but requires selection of appropriate hash function and collision resolution mechanisms.
Exercise
8. Discuss the merits and demerits of storing the entire records in the hash
table instead of the index entries.
9. Why must a hash table be stored in an array and not a linked list ?
3. Tree index - The basic idea in searching using any search tree is to successively divide the searching domain (set of elements where search is to be done), untill either the search is successful or it is found that the element is not present in the given set. An m-way search tree is a very general search tree where the elements are ordered on a key field and m denotes the maximum number of partitions that the set can be divided into at any level. In other words it is the maximum number of children that each node can have. At each level the root of the tree can contain upto m - 1 key values that define the partition boundaries. A binary search tree is a m-way search tree of order 2.
A balanced search tree gives a search performance of logmN. But unless care is taken a search tree tends to get skewed as insertions and deletions are performed in it. Maintaining the balanced property of a tree after each insertion or deletion is somewhat difficult. There are some m-way search tree structures on which insertions and deletions can be carried out with logmN complexity and still retain a fair degree of balance in the tree structure. One such tree is the B-tree where the number of children of a node is at least the ceiling of m/2 , and all leaf nodes are at the same level. Actually the important thing to be noted in such tree organisations is the insertion and deletion methods which retains the required properties of the tree. A B-tree of order 3 is called 2-3 tree and is frequently used. Some basic algorithms for B-trees implemented in C are presented in http://www.tezu.ernet.in/~utpal/daa_code/b_tree.c
Some variations of B-trees are B* tree, B+ tree, and B' tree.
B' Tree - We note that in the index structures that we are considering, each key value in the structure should provide us the physical location of the record in the file. To achieve that we may store in the nodes the (key_value, record_address) pairs, and traverse the tree by considering the key values of these pairs in the nodes. Sometimes an additional requirement is placed on a B-Tree whereby composite values such as (key_value, record_address) pairs are kept only in the leaf nodes of the tree. All other nodes keep only the key values that are used in the tree traversal during a search. These key values of the non-leaf nodes are repeated in leaf nodes at appropriate places. Thus the left to right listing of the leaf nodes of the tree gives a complete ordered sequence of the key_values. This form of a B tree is sometimes called a B' tree.ExerciseB* Tree - Searching in an m-way search tree is more efficient if each node contain more key values and children, since this will tend to reduce the height of the tree. In case of a B tree, during insertion when the number of key values in a node becomes m, we split the node. This creates two nodes which are about half filled, i.e., they have about the minimum number of key values in them. To improve the situation, when during insertion we are about to split a node to two when the number of key value in the node becomes m, instead of indiscriminately splitting we may resort to redistribution of key values of that node with its immediate sibling(s). In that case if there is room for more key values in the sibling, the splitting can be avoided. If this redistribution is attempted with one immediate sibling, then splitting shall be required only if that sibling is also already full. Then we split 2 to 3, i.e., we make three nodes out of two. The number of descendents (subtrees) in these nodes shall be ceiling of (2m - 1)/3. Such a tree is called a B* tree.
Example : Consider a B-Tree of order 3 where two siblings have 2 and 3 children respectively. Suppose an insertion of a new element requires one more child to be accomodated in the latter. Than instead of splitting the node into two nodes with 2 children each, we can redistribute the 6 children (2 + 3 + 1) in the two nodes. Splitting shall be required only now, if it is required to insert another new element in either of these nodes both of which are now full. And during a split, too, redistribution of elements in two siblings is performed.
B+ Tree - Sometimes it is useful to maintain the actual records in a sequential structure in some order. This is called a sequence set. In such a case once one record is located in the file it is possible to get to the next record in constant time without much difficulty. Now, suppose the file is present inside a disk file in such a way that each block contains a continuous sub-sequence of the sorted sequence of records, and the sets of records in the blocks are disjoint. In such a situation, we may opt to maintain a B' tree of index, only upto one level before the leaf nodes. In other words, the index actually takes us to the first record in each block of the file. The key value of these records act as separators in the sequence of ordered records. The records after a separator record until the next separator record (or the end of the file, in case of the last separator record) can be easily reached from these separator records since they exist in the same disk block. We can see that during a search operation the index mechanism will lead to the block of records where the record being searched must be present if it is at all in the file. Since the entire set of records in that block must be anyway loaded into the memory, the rest of the search can be easily carried out in the block in memory. Such a tree is called B+ tree.
Example : Consider a file with records whose key values are K1, K2, ...KN, in ascending order of the key values. Suppose, we store them in such a way that record with key Ki+1 can be accessed easily after accessing the record with key Ki (possible if the file is stored in an array, or as a linked list). We can have a B+ tree for such a file, in which the leaf nodes will contain the (key_value, address) pairs for some of the records which are separators in the sequence list, and the other records can be reached easily from the records corresponding to these.
Leaves of B+ tree : K2 K5 K7...KN-1
Keys in Sequence set : K1 K2 K3 ... KN