File Formats
Summary/Best Practices
Ideally files for long term preservation should be in open (non-proprietary) formats, have relatively complete embedded metadata, and be on non copy-protected media. Embed metadata in files when possible.
For guidelines on determining sustainable formats, see : "Sustainability of Digital Formats Planning for Library of Congress Collections". The 'Content Categories' section is particularly useful.
Step by Step
Level 0 to Level 1
- When you can give input into the creation of digital files encourage use of a limited set of known open formats and codecs
Formats
Provide users with recommended file formats:
For guidelines on determining sustainable formats, see : "Sustainability of Digital Formats Planning for Library of Congress Collections". The 'Content Categories' section is particularly useful.
An example set of formats is provided in the Table of Preferred Formats table linked above in the summary. It is loosely based off of the Bentley Historical Library "Format Conversion Strategies For Long-Term Preservation."
See other useful example tables at:
- University of Michigan Deep Blue Preservation and Format Support Policy
- Virginia Tech Digital Library and Archives
File Naming
Provide users with ‘best practice’ guidelines for naming files:
- Use only letters, digits, hyphens and underscores. No spaces, ampersands, apostrophes, parentheses, etc.
- Keep names short and easy to read, using camel case to distinguish words (e.g. StudentGovernmentBylaws.docx)
- When including dates in filenames, use iso 8601 date format. This greatly enhances sorting. (e.g. MeetingNotes_2014-03-27.docx)
- Think about how file names will sort when naming large numbers of files, e.g. considering using leading zeros to bring files to the top, or adjust numbers for sorting, e.g. Page2.docx will often sort after Page100.docx. Consider using ‘Page002.docx’ instead.
See Stanford Libraries’ File Naming Guidelines for additional suggestions.
Level 1 to Level 2
- Inventory of file formats in use
Depending on where your items are stored, you may be able to use some of the following tools to inventory file formats. Several graphical and command line tools are available for file identification, including DROID, JHOVE and FITS.
For a quick inventory of file extensions in use in a directory, simply sort the folder by file type (usually by clicking on the 'type' column').
Command line tools can also be used to find and count file types in a directory and all subdirectories:
Windows |
prompt> dir *jpg /b /s | find /c /v "" |
Returns the number of files with a .jpg extension at or below the current directory. Replace .jpg with any file extension you are interested in inventorying. This only reports on file extensions, and does not test if the file is actually of that type. |
Mac OSX / Unix |
prompt> find . -name "*.jpg" |wc -w |
Level 2 to Level 3
- Monitor file format obsolescence issues
Stay abreast of the FADGI (Federal Agency Digitization Guidelines Initiative) and Library of Congress sustainability guidelines. Useful sites include:
Level 3 to Level 4
- Perform format migrations, emulation and similar activities as needed
Start by identifying file formats that don’t conform to your recommended list of preservation formats.
Below are a few select tools, many freely available, to perform format migrations.
File types | Command Line and Batch Editing tools | Graphical User Interface for small number of conversions, though some programs do have batch processing features. |
---|---|---|
Image Files (including PDF) | ImageMagick | Gimp , Pixlr, Photoshop, IrfanView |
PDF files | Ghostscript | Adobe Acrobat |
Audio/Video Files | FFmpeg | VLC Media Player, ffmprovisr |
Illustrator vector graphic files | ImageMagick | Inkscape |