Wherever possible, MiModD tries to adhere to the official
specification of each file format used. Certain formats, however, have
ambiguous specifications or have specifications that are incompatible
with those of other formats in MiModD analysis workflows. In these
cases, we are forced to interpret or modify the specs for the sake of
clarity and user experience.

We tried hard to make error messages, and in particular, those related
to file format violations, as clear and user-friendly as possible so,
in most situations, when you pass data to a MiModD tool that cannot
process it, the tool should complain loud and clearly. If this error
message, in some cases, is not enough to understand the exact cause of
the problem, it may be helpful to study the definition of the
problematic file format in the list below.

In addition, the format specifications can help you avoid
incompatibilities should you want to pass MiModD-generated files to
other software tools.


Character Encoding
==================

The definitions of many text-based file formats used in bioinformatics
do not specify their expected character encoding explicitly, but
implicitly almost all expect English alphabet characters to be ASCII-
encoded. Characters outside the ASCII-range, if not explicitly
disallowed in a format, are a common cause of problems and
incompatibilities. In recent years, UTF-8, a superset of ASCII, has
become the emerging standard for universal character encoding, and
some modern bioinformatics file format specifications (see the vcf
format below) already define UTF-8 as the file encoding of the format.

Since version 0.1.7.2, **all output produced by MiModD is UTF-8
encoded**. Any input is expected to be UTF-8 encoded as well, but most
tools will handle decoding errors flexibly. For users this means:

When passing data to MiModD:

* if your data (including sequence/chromosome names, annotations,
  sample names, etc.) contains only ASCII-range characters, you will,
  in most circumstances, not have to worry about its encoding. If the
  data has been stored in an ASCII-compatible encoding (UTF-8,
  latin-1, pure ASCII, etc.), things will just work.

* if your data contains characters outside the ASCII-range (like
  non- English alphabet characters) and the input format allows them,
  you should make sure the input data has been saved in UTF-8
  encoding. If not, you should be prepared to see your special
  characters being misdecoded and not be readable anymore.

When using MiModD-generated data with other tools.

* if your data contains only ASCII-range characters, you can pass
  output from MiModD to any other tool that expects ASCII-encoded
  input.

* if your data contains characters outside the ASCII-range, you are
  at the mercy of the downstream tool, which may or may not expect
  UTF-8 encoding.

Hint: To increase the chance of successful communication with other
  tools, make sure UTF-8 is set as default encoding in your system
  locale settings. MiModD reads and writes UTF-8, independent of your
  system encoding settings, but other programs may use the system
  encoding.


MiModD File Format Specifications
=================================


FASTA
-----

FASTA is a text format that can store multiple sequences in a single
file.

Each sequence begins with a single-line description, followed by lines
of sequence data. Description lines are distinguished from sequence
data lines by a greater-than ">" symbol at the beginning of the line.

Lines can be terminated with either "CR+LF" (Windows-style) or "LF"
(Unix/Linux-style). Blank lines are not allowed.

The sequence data lines should be formatted as blocks of equal line
length.

In MiModD, FASTA format is used exclusively for reference genome input
files and MiModD-specific restrictions apply to the description lines
found in the files. Specifically, **description lines must not
contain**:

* non-printable or non-ASCII characters

* whitespace characters

* any of the characters: "<>[]*;=,"

This restriction is enforced by all tools that require a fasta
reference genome. The MiModD.sanitize tool can be used to substitute
illegal characters in description lines and also ensures that sequence
data lines are block-formatted.

Note: The character restriction exists because MiModD will use the
  full content of the description line as the sequence name and we
  must ensure that this name is a valid sequence name in all
  downstream data formats generated during any analysis.

See also: MiModD tools that use fasta input files

  snap, snap-batch, index, varcall

  MiModD tools to manipulate fasta files

  MiModD.sanitize

======================================================================


SAM
---

The Sequence Alignment/Map format is a TAB-delimited text format
defined as part of the hts-specs project.

MiModD sticks quite closely to the official specification, in
particular:

* SAM headers must follow the official format, where every line
  (except comment lines) must conform to the following regular
  expression:

     /^@[A-Z][A-Z](\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/

  , which also means that only printable characters from the ASCII
  range are allowed in non-comment header lines.

  Comment lines must conform to the regular expression:

     /^@CO\t.*/

  meaning they have to start with "@CO" and a "TAB", but the actual
  comment text can consist of any characters (which MiModD will encode
  using UTF-8).

* Body lines consist of eleven TAB-delimited columns with the exact
  character restrictions given in the official specs.

See also: MiModD tools that accept SAM input files

  snap, snap-batch

  MiModD tools that produce sam output files

  snap, snap-batch, header,

  MiModD tools to manipulate sam files

  convert, reheader

======================================================================


BAM
---

The Binary Alignment/Map format is the binary companion of the SAM
format and also defined as part of the hts-specs project.

MiModD sticks to these specifications as it does for the SAM format.

See also: MiModD tools that accept bam input files

  snap, snap-batch, varcall, delcall, index

  MiModD tools that produce bam output files

  snap, snap-batch

  MiModD tools to manipulate bam files

  convert, reheader, sort

======================================================================


VCF
---

The Variant Call Format is defined as part of the hts-specs project.

See also: MiModD tools that use vcf input files

  map, varreport

  MiModD tools that produce vcf output files

  varextract

  MiModD tools to manipulate vcf files

  vcf-filter, rebase, annotate

======================================================================


BCF
---

The Binary Call Format is the binary counterpart of VCF and defined as
part of the hts-specs project as well.

See also: MiModD tools that use bcf input files

  varextract, delcall, covstats

  MiModD tools that produce bcf output files

  varcall

  MiModD tools to manipulate bcf files

======================================================================


CloudMap-style sequence dictionary
----------------------------------

This format is defined by the CloudMap analysis pipeline, in which it
is referred to as an *Other Species Configuration File*.

It is used to specify the names of the contigs (or chromosomes) along
with their sizes that make up a given reference genome.

Where needed MiModD embeds this information into its output files so,
as long as you are working with MiModD-generated files, a sequence
dictionary file is never needed and, currently, only the MiModD map
tool provides an option for specifying such a file as input to enable
its use with external input files.

A sequence dictionary file has a simple two-column tab-delimited
format, in which each line consists of a contig name (exactly as it
appears in the corresponding reference file) in the first column and
the length of that contig in megabases (rounded up) in the second.

As an example, this is what a sequence dictionary for the six
chromosomes of the roundworm C. elegans could look like:

   I     16
   II    16
   III   14
   IV    18
   V     21
   X     18

, but remember that you may have to modify the sequence names to match
those defined in your reference file before using any pre-made
sequence dictionary.
