1

Is it a plain text file or binary file or just character file? Can someone explain what a flat file actually means?

Girish Sunkara
  • 753
  • 4
  • 9
  • 20
  • 2
    In what context have you seen it? Usually it means a "non-structured" file, i.e. something not containing records (database) or other structures like trees etc. It doesn't matter if binary, text, or whatever. – dirkt Dec 30 '16 at 12:40

4 Answers4

10

dirkt is right to ask for context, but is very wrong about flat files not being databases or containing records.

Flat file databases

In the context of databases, a flat file database satisfies the following:

  • It is a database that comprises exactly one table in one file.
  • The table has no indexes.
  • The structure is not relational, hierarchical, or networked.
  • The file may comprise fixed length records, with fields denoted by column positions, or variable length records, with records and fields separated from one another by delimiter characters.

One might encounter terms such as Basic Sequential Access Method (BSAM) and Queued Sequential Access Method (QSAM) bandied about in the literature when discussing this. This relates to how one accesses flat file databases. One has to read and write them linearly and sequentially, because the concept does not require records to be sorted or even to be keyed.

Insertions and deletions involve processing the whole file. Flat file databases used to be commonly stored on media which were good for sequential access, such as magnetic tape, and database updates would sometimes take the form of reading from input tapes A and B and merging to output tape C. (For example: A could be the master file to the start of today, B could be today's transactions, and C would be the master file to date for tomorrow's run.)

Everyday flat file databases

You might think that you do not encounter such things any more, or at the very least that you do not encounter them on Unix and Linux. You'd be very wrong, too. Here are some flat file databases that you encounter daily:

  • The user account database on Linux is a collection of flat files.
    • The Version 7 /etc/passwd file is a single table, with variable length records, the colon character as field separator, and the linefeed as record separator.
    • So too are /etc/group, /etc/shadow, and /etc/gshadow.
  • The /etc/fstab file is a single table, with variable length records, non-linefeed whitespace characters as field separator, and the linefeed as record separator. Filesystem table — it is in the name.
    • So too are /etc/services, /etc/crontab, /etc/phones, /etc/ttys, /etc/hosts, and /etc/protocols.
  • The login databases (/run/utmp and /var/log/wtmp on Linux; /run/utx.active, /var/log/utx.lastlogin, and /var/log/utx.log on FreeBSD et al.) are flat file databases with fixed length records, no field nor record separators, and fields denoted by column position.

You might be thinking that you don't read the entire file in and then re-write the entire file back out in order to perform insertions and deletions of records in the aforementioned variable-length record databases. You move a cursor and perform line deletion and insertion operations. But you are overlooking that whole file I/O is exactly what your text editor itself actually does when it loads and saves the file. The actual access method of the file itself is a flat file database one.

Everyday databases that are not flat file

These flat file databases are examples of the poor properties of flat file databases when one wants something other than sequential access. Looking up a host in /etc/hosts, or a user account in /etc/passwd, involves reading the file sequentially. There's no index, and entries are not sorted in order of the key that one uses to search. Look at the C library routines used to search these flat file databases (such as gethostent(), getpwent(), getfsent(), getgrent(), and getutxent()) and, with an exception that we'll get to in just a moment, you'll see sequential access methods. (The various getXbyY() routines are built on top of these. They simply call the sequential access routines until a match is found, you will find.)

Hence, on the BSDs the actual user account databases are not flat file databases. They are Berkeley DB files, which are indexed by UID and by username. They are compiled from a flat file database, which is stored in /etc/master.passwd, by the pwd_mkdb program. The C library actually reads /etc/pwd.db or (if it can) /etc/spwd.db.

The BSD "capability database" source file structure, that you will find in the likes of /etc/gettytab, /etc/login.conf, and /etc/termcap, isn't a really a flat file. (The compiled file structure, found in /etc/login.conf.db and /etc/termcap.db, most definitely is not.) Records can include other records by reference, forming chains that have to be followed in order to find all of the fields for a given record. Indeed the compiler, cap_mkdb, does that very thing.

Flat files are not "ASCII text"

ASCII defines specific control characters for file, group, record, and unit (i.e. field) separators. They go mainly unused on Unices and Linux in favour of characters like space, TAB, LF, and colon.

People sometimes state that "flat file databases comprise simple ASCII text". As should be clear from some of the aforegiven examples, this is not so. This is only the case for one particular common type of variable-length record flat file database. However the equally-widespread login database on your Unix or Linux system is a flat file database, too, but one where various fields are most definitely not interpreted as ASCII character encodings.

(And from the wider world when one has cast aside the Unix/Linux tunnel vision: This is a misconception that was helped by the fact that xBase stored field contents encoded in ASCII in dbf files. People used to talk of xBase, once touted as the most popular database system in the world, as a "flat file" system, although they were really using the term in opposition to "relational" or "object-oriented" and mis-using it in the way that people abuse "legacy" for "old". Thus "flat file" systems "used ASCII". But that was not even true. dbf files had quite a lot of stuff that was, again, most definitely not interpreted as ASCII character encodings.)

Further reading

  • Donald K. Burleson (1998). Inside the Database Object Model. CRC Press. ISBN 9780849318078.
  • Rob Mattison (1998). Understanding Database Management Systems. McGraw-Hill. ISBN 9780070499997.
JdeBP
  • 66,967
  • 12
  • 159
  • 343
2

There may not be a very good answer to this, I doubt there is one exact meaning for the term.

In general, it means something with only a little structure and useful by itself (without required indices, etc), but from there the definitions diverge:

FOLDOC defines a flat file as something containing (ASCII) text, apparently as opposed to binary data. The (unsourced) definition on Wikipedia seems to be "something that has to be read and written to/from memory in entirety", as opposed to something that could be updated by only modifying the relevant parts. A usual text file would fit that definition, as lines have variable lengths, and are stored back-to-back, so most changes would change the positions of the following files.

A text dump of a database would count as a "flat file" for me, but I'm not so sure about a binary file. YMMV.

ilkkachu
  • 133,243
  • 15
  • 236
  • 397
2

There is no such thing as a "flat file" in the POSIX standard for Unix, except a single mentioning of it:

The format of the system mailbox is intentionally unspecified. Not all systems implement system mailboxes as flat files, particularly with the advent of multimedia mail messages. Some system mailboxes may be multiple files, others records in a database.

In my line of work (bioinformatics), a "flat file" is usually a plain text ASCII file which contains data, usually genomic data in Fasta format (or in one of the derivative formats).

This is opposed to data stored in a database, or in a non-ASCII file (such as files in BAM or CRAM format), although non-ASCII files are also called "flat files" by some.

The Wikipedia definition of "flat file" that @ikkachu alludes to is not entirely appropriate here, as genomic data often comes in huge amounts. Reading the entire DNA sequence of a species from a flat file into RAM may be both slow, unnecessary, and in many cases impossible.

Bioinformaticians also seems to use the term "flat file" loosely for any type of data file that they use, even if the data file is structured in such a way that its records may contain references within the file, as may be the case in GTF/GFF and GeneBank files, for example.

Maybe the most generic definition of a "flat file" is a file containing data which requires the whole file to be rewritten if a change to the data has to be made. This also covers some binary formats, I believe.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
0

Joint Database Technology - Flat files for data storage, working in tandem with Perl SDBM files of key/value pairs tied to program hash tables, for persistent random access indexing to fixed-length record offsets (in bytes). This is a ISAM/NoSql database system implementation. SEE: http://www.perlmonks.org/?node_id=1121222

Flat File databases of "text" data stored in fixed-length records can be set up for instantaneous random access by employing SDBM binary files of key/value pairs tied to program Hash tables, where the "value" is the record offset in bytes to position the file pointer, and where the key is: a single field, partial field, or multiple single and/or partial fields concatenated together, from the data contained within the Flat File records.

Social Security Number, or Loan Number, or Account Number, would be examples of fields that could be used as a UNIQUE key.

You can set up multiple indexes of SDBM files to your Flat Files, each SDBM file containing a different set of key/value pairs based upon what fields or partial fields make up the "key". ALTERNATE indexes, with DUPLICATES,can be employed by adding a sequence nbr to the "key".

A separate Flat File could be employed that held multiple CHILD records relative to each single PARENT record held in another Flat File. Example: A LOAN flat file (Parent) and a COLLATERAL flat file (CHILDREN), where the CHILDREN are all the collateral securing each particular installment loan.

I use FLAT FILES to 4GIG each and SDBM files to 2 GIG each.

Data can be logically segregated for ease of both random and sequential access. Example: US_Census_2010_TX_A.dat could be a flat file that held the data for just citizens of Texas whose last name begins with the letter "A". You could have 50 STATE subdirectories holding 26 flat files each (one for A-Z). A batch application or user-interface could access the correct file by understanding the Flat File and SDBM file naming convention.

NOTE: A similar technique can be used to build a huge MS-ACCESS(actually MS-Jet "Red" Engine) database, accessed via ODBC, so that each *.MDB file only contains one or more back-end TABLE OBJECTS, and each MDB file acts as a: single table, group of tables, or partial table common to all MDB files. The MDB files are not a database in and of themselves - as is typical. They are simply containers to store huge amounts of data, logically segregated, and indexed, for optimized sequential and random access. MS-Access has its own indexing capabilities, so SDBM would not be employed here for indexing. MS-Access software is not needed since the MDAC and ODBC Administrator utility come factory installed on Windows 7. Use ODBC Administrator to create the empy MDB files, and use a programming language to issue the SQL statements which build the tables and indexes for those tables.

user13782
  • 11
  • 1