File Formats

Summary/Best Practices

Ideally files for long term preservation should be in open (non-proprietary) formats, have relatively complete embedded metadata, and be on non copy-protected media. Embed metadata in files when possible.

For guidelines on determining sustainable formats, see : "Sustainability of Digital Formats Planning for Library of Congress Collections". The 'Content Categories' section is particularly useful.




Step by Step


Level 0 to Level 1

  • When you can give input into the creation of digital files encourage use of a limited set of known open formats and codecs

Formats

Provide users with recommended file formats:

For guidelines on determining sustainable formats, see : "Sustainability of Digital Formats Planning for Library of Congress Collections". The 'Content Categories' section is particularly useful.

An example set of formats is provided in the Table of Preferred Formats table linked above in the summary. It is loosely based off of the Bentley Historical Library "Format Conversion Strategies For Long-Term Preservation."

See other useful example tables at:

File Naming

Provide users with ‘best practice’ guidelines for naming files:

  • Use only letters, digits, hyphens and underscores. No spaces, ampersands, apostrophes, parentheses, etc.
  • Keep names short and easy to read, using camel case to distinguish words (e.g. StudentGovernmentBylaws.docx)
  • When including dates in filenames, use iso 8601 date format. This greatly enhances sorting. (e.g. MeetingNotes_2014-03-27.docx)
  • Think about how file names will sort when naming large numbers of files, e.g. considering using leading zeros to bring files to the top, or adjust numbers for sorting, e.g. Page2.docx will often sort after Page100.docx. Consider using ‘Page002.docx’ instead.

See Stanford Libraries’ File Naming Guidelines for additional suggestions.


Level 1 to Level 2

  • Inventory of file formats in use

Depending on where your items are stored, you may be able to use some of the following tools to inventory file formats. Several graphical and command line tools are available for file identification, including DROID, JHOVE and FITS.

For a quick inventory of file extensions in use in a directory, simply sort the folder by file type (usually by clicking on the 'type' column').

Command line tools can also be used to find and count file types in a directory and all subdirectories:

Windows

prompt> dir *jpg /b /s | find /c /v ""

Returns the number of files with a .jpg extension at or below the current directory.   Replace .jpg with any file extension you are interested in inventorying.   This only reports on file extensions, and does not test if the file is actually of that type.

Mac OSX / Unix

prompt> find . -name "*.jpg" |wc -w



Level 2 to Level 3

  • Monitor file format obsolescence issues

Stay abreast of the FADGI (Federal Agency Digitization Guidelines Initiative) and Library of Congress  sustainability guidelines. Useful sites include:


Level 3 to Level 4

  • Perform format migrations, emulation and similar activities as needed

Start by identifying file formats that don’t conform to your recommended list of preservation formats.

Below are a few select tools, many freely available, to perform format migrations.

File types Command Line and Batch Editing tools Graphical User Interface for small number of conversions, though some programs do have batch processing features.
Image Files (including PDF) ImageMagick Gimp , Pixlr, Photoshop,  IrfanView
PDF files Ghostscript Adobe Acrobat
Audio/Video Files FFmpeg VLC Media Player,  ffmprovisr
Illustrator vector graphic files ImageMagick Inkscape