Tape Reels and COBOL Records

In a previous post I looked at the hardware side of open-reel (or “reel-to-reel”) data storage in mainframes of the ’50s to the ’80s. I promised a look at exactly how these reels were used though, which I didn’t provide there. On top of the rather impressive hardware there must be some kind of software layer, of course, and you can imagine that linear-access devices like tape are not exactly suitable for btrfs.

The data format on the tapes was generally entirely up to software. The hardware only specified one logical element of the data structure, which was that a certain length (often 1″ to 1/3″ depending on the hardware) should be left on the tape between blocks. This “interblock gap” allowed sufficient time for the tape to be stopped or started in between logical sections, so that the tape would be at full speed when data was read. This had implications on the design of any software scheme, but still left a great deal of room for storing actual data.

In this post I’ll look at three common software schemes used with tapes: native file support, COBOL record files, and the tape archiver (tar), which is still often used today.

Mainframe and minicomputer operating systems generally fully supported the modern idea of writing files to a tape. However, this was generally severely limited. Hierarchical storage of files was nonexistent (and remains a bolt-on feature to IBM mainframe operating systems today), leading to very simple organization, and the system was generally also space-inefficient. In the most common case, each file was written (usually as a bit of metadata followed by the contents) and then separated by an interblock separator. This generally required that the tape be stopped and then restarted between each file when searching the tape, which was slow especially on drives that offered a high-speed advance feature, but a simple and reasonably low-overhead solution. Another approach was to store all files in a fixed-size space, which sped up access since files could be skipped en-masse when searching for a file farther into the tape, but was very space inefficient if there was any dissimilarity between the size of the files on the tape.

The implementation details of file storage depended on the architecture and software in use on the computer and varied widely even within identical hardware, depending on who had written the system software in use. As common operating systems like UNIX (and quite a number which did not survive to this day) became common, it became less common to use any operating system functionality to format tapes. On UNIX, tapes were typically not mounted at all, but instead handled directly as devices by userspace utilities intended for that purpose. The random-access nature of the hard disks for which file systems were designed was simply too dissimilar to the sequential-access nature of tapes.

COBOL, the Common Business Oriented Language, was designed in 1959 with the express purpose of simplifying the development of business software. As you can imagine, business software frequently involves storing and reading records of various formats. COBOL was also used for scientific computing applications that were data-heavy (in the 1960s, “Big Data” started at a few megabytes), demanding the same capabilities. To support these users, COBOL had a somewhat unusual feature: native support for a rather sophisticated data format.

COBOL could write to an output device (today, usually a file, but originally, a tape) in one of several different formats which ranged from a sequence of lines of text (what we would call a text file today) to a collection of multi-field records indexed for random access based on multiple fields (of course, the underlying hardware was very sequential, so this involved multiple operations and could be slow). Indexed files, however, required random access and so were generally not used on tapes. These files are often referred to as COBOL files, COBOL records, or by various similar names, and a huge portion of the open-reel computer tapes you may find in trash heaps today will hold COBOL records.

Except in the case of sequential text files, in which case each “record” was simply a line of characters, COBOL files consisted of a number of “records” each of which contained various “fields”. This might sound a bit like the columns and rows of a relational database, and the idea is much the same, but don’t get too excited: there were no relational features available, and indeed relational databases wouldn’t exist conceptually until Edgar Codd’s paper in 1970.

COBOL files consist of records, in order, with an interblock separator between each block of records, with the number of records per block being a configurable value. The format of each record was defined in a “COBOL Layout”, a sort of schema provided to the COBOL compiler when files were written or read. The layout is chunk of text with a fairly simple format (like all of COBOL): each line of the layout gives a level, a field name, and a data type.

First, let’s look at levels. COBOL records are actually organized in a hierarchical fashion, as defined by the level number. The level number ranged from 01-49, although the records themselves existed at level 01 and so it’s used to name the layout. Typically you will see a level 05 field, followed by several level 10 fields. These level 10 fields are subordinate to the level 05 field before them, and so the level 05 field is considered a “group” (and is not actually a field, so it has no data type). There are also a couple of special levels greater than 49. 66 gives an alternate name to the record before it, and more interestingly, level 88 is used to define enumerations. We’ll look at an example of this later.

The field type is oddly defined by the keyword PIC, apparently for “picture”. This doesn’t mean that the field contains a picture, it means that the following characters will be a “picture” of the field, with a certain number of characters indicating the type and width. For example, “PIC 999” would define a field with three digits, while “PIC AAA” would define a field of three characters. This can be simplified by specifying a number for the field width, e.g. “PIC A(3)” means the same thing as “PIC AAA” but is more readable for large numbers of characters. Finally, the data type “X” means any arbitrary byte, and can be used for binary data or various other purposes.

Note that numbers are stored as the individual digits, in the simplest case. Frustratingly, numbers can also be stored in proper floating-point formats or in a format called “packed decimal”. The floating-point format used would be specific to the architecture of the machine that wrote the file and is generally not IEEE, and packed decimal is an odd scheme to store numbers digit-wise more space efficiently.

So, with those explanations in mind, let’s look at a simple COBOL layout. This example is partially drawn from this article from a storage conversion company in that now that I have seen their example I can’t come up with a different one.

01 CUSTOMER.
   05 NAME.
      10 FIRST-NAME PIC A(10).
      10 MIDDLE-INITIAL PIC A.
      10 LAST-NAME PIC A(15).
   05 SEX PIC X.
      88 MALE VALUE "M".
      88 FEMALE VALUE "F".
   05 PHONE PIC 9(10).

Note the following features of this layout:

  1. “Name” is a group containing two subfields with lower level numbers, which are text fields of lengths 10, 1, and 15 characters.
  2. “Sex” is a field with arbitrary type that has special level 88 subfields which define an enumeration.
  3. “Phone” is a numeric field of ten digits.

On the actual tape, a COBOL record file will consist of a header, records, and then a trailer. The header contains layout information, while the trailer contains a record count and checksums to verify the tape if needed. This whole thing is then written directly to tape in order, and can be later read back in order.

The ‘tar’ utility is still widely used today when multiple files need to be packed into one file, usually to distribute a package of multiple files more easily. One of the things that newcomers to the Linux world often find odd about this arrangement is that the “tarball” file format produced by tar is not capable of compression – instead, when compression is desired, tarballs are fed through a separate compression algorithm such as gzip (producing .tar.gz files) or XZ (producing .tar.xz files). This is quite a bit different from formats like zip or 7z, common in the Windows world, which combine ‘archiving’ multiple files and compressing a stream into one file format.

The reason for this separation is essentially historical. tar stands for ‘tape archive’, and its original purpose was to write multiple files to a sequential-access medium – a tape. Because of some of the limitations noted earlier, it is often undesirable to store files on a tape using the simple native systems offered by many platforms. The tar utility was designed to pack multiple files onto a tape in a way that was space-efficient and allowed for (relatively) fast access to arbitrary files.

tar achieves this by breaking up the file into blocks, which are further subdivided into records Records are always 512 bytes in length, and blocks are generally 20 records in length, or 10 kilobytes. A tarball consists of header records for files, followed by the file contents, padded to a whole number of records. Each file header record contains the basic metadata you would expect from a UNIX file – a name, permissions, owner and group, and a type, which could indicate a normal file or various types of special UNIX files (such as links, directories, etc). At the end of the archive, padding is provided to make the archive an even number of blocks.

This format makes a great deal of sense on a tape. Padding to whole numbers of records per file, and whole numbers of blocks per archive, allows for efficient use of fast advancing and rewinding in tape drives. A major downside of the format, however, is that the basic metadata on each file (such as name) are only found in the file header which immediately preceeds each file. This means that enumerating the files in a tarball requires reading the entire tarball, which is irritating for very large tarballs even on modern storage devices, and was particularly irritating on relatively slow tapes. Most newer archive formats provide a metadata table at the beginning of the archive, removing this need. We have to cut tar some slack: it is very old, originating with UNIX V7 in 1979.

If we are honest, tar is now significantly outdated, mostly in its lack of efficient random access. However, its standardization in POSIX.1-1988 ensures that a tar implementation is available on every POSIX-compliant platform, making it an exceptionally safe option for distributing groups of files. It is composable with a number of compression implementations (including the state of the art XZ, based on LZMA2), and modern storage devices are fast enough that the lack of efficient random access simply isn’t that noticeable. In short, there are several reasons to use tar despite its obsolescence, and not so many reasons to eliminate it.

Remarkably, manufacturing of open-reel tapes continues until at least 2002, although Wikipedia says that was the last manufacturer. I’ve had a hard time determining what the last open-reel tape drive made was. It may have been the HP 88781, a high-end 9-track tape drive launched in 1995. Later tape drives supported fairly modern SCSI interfaces and were often auto-threading. Many of these drives are still sold used and recertified today, typically for use by organizations who have a large amount of legacy data on 9-track tapes that they need converted to more modern formats. Several companies also offer this as a service.

On the whole, though, 9-track tapes are part of history, entirely replaced by the more modern LTO standard. LTO tapes were initially used with either tar or a proprietary format to whatever backup software was in use, but the LTFS (Linear Tape File System) standard is now taking over the LTO market and provides fairly modern file-system features including relatively efficient random access. Of course, random access remains very slow. Additionally, the LTO standard specifies compression algorithms called ALDC and SLDC that are implemented in hardware in the tape drives. These compression formats reduce throughput but roughly double tape capacity (2.5:1 is widely advertised, but not always achieved).

Tapes face a significant disadvantage against other storage media because of their less efficient access characteristics. However, up to now they seem to maintain a significantly better price per capacity that random-access storage formats, and this means that tapes will continue to have a use in backup and other high-latency storage applications for the foreseeable future.

  • history/computers/cobolformat.txt
  • Last modified: 2020/11/16 23:46
  • (external edit)