RepeatSeq Introduction

RepeatSeq is a microsatellite repeat genotyper for reference-mapped Illumina reads. Genotyping repeats is fundamentally distinct from calling SNPs or indels in non-repetitive sequence in that there is no sound basis for inferring homology between pairs of aligned repeat units. RepeatSeq therefore scores reads in terms of allele length, or the number of sequenced bases within a read separating the non-repetitive flanking boundaries aligned to the reference, irrespective of intervening alignment gaps. Thus, although several separate reads of the same allelic variant might have been aligned with the gap/insertion at a different location within the repeat, they will all yield the same allele length call with this method. This approach effectively negates the well-known problem of large numbers of false positive SNP and indel calls resulting from inconsistent alignment of ambiguously positioned indels. RepeatSeq outputs genotypes in several formats, including VCF, for compatibility with other variant callers.


The current version of RepeatSeq is 0.5.0 and can be downloaded here. You will need to first download bamtools and fastahack, in order to build RepeatSeq:

[..navigate to repeatseq directory..]
Download bamtools, extract to a directory named "bamtools" within repeatseq/
Download fastahack, extract to a directory named "fastahack" within repeatseq/
$ mkdir bamtools/build
$ cd bamtools/build/
$ cmake ..
$ make
$ cd ../.. [working directory now repeatseq/]
$ make

RepeatSeq has been tested on Ubuntu 11 and several other linux variants; as well as OS X Snow Leopard and OS X Lion.


To use RepeatSeq, you will need a BAM file, a reference FASTA file, and a region file. The region file is a two-column, tab delimited text file in which each line specifies the chromsomal location of the repetitive region in the first column and annotation about the region in the second column. We recommend using Tandem Repeat Finder to identify repetitive regions in reference genomes. As a convenience, we provide region files for the human reference (HG19) reference and the Drosophila melanogaster reference (release 5.13). Future region lists based on additional reference genomes will be available shortly. Feel free to contact the developers with requests for reference genomes.

You can customize the RepeatSeq method as well as the output with the following command line parameters:

-f
-L
-R
-M
-multi
-pp
-error
-haploid
-repeatseq
-calls
-t
-o
use only a specific read length or range of read lengths [use all reads]
required number of reference matching bases BEFORE the repeat [3]
required number of reference matching bases AFTER the repeat [3]
minimum MapQ for a read to be used for allele determination [0]
exclude reads flagged with the XT:A:R tag
exclude reads that are not properly paired (for PE reads only)
set a constant error rate
assume a haploid rather than diploid genome
write .repeatseq file
write .calls file
include user-defined tag in the output filename
number of flanking bases to output from each read

Refer to the included documentation for more a detailed explanation of the parameters and the output formats.


1. Does RepeatSeq only work with Illumina data?

In principle, RepeatSeq will work with any BAM file. However, we buit our error model using whole genome sequencing data generated on an Illumina GAII instrument. In the future we plan to build other platform specific error models.

2. What is the minimum read size necesary for genotyping repeats?

Appropriate read sizes depend on the distribution of repeat lengths in target genome. For human genomes we recommend at least 75 base reads. You can define what read sizes RepeatSeq considers by using the -f flag. For example, "-f 75:150" limits genotyping to reads in a BAM file that are sized 75-150 bases.

3. Can I use RepeatSeq with genomes for which there is no reference sequence?

No, RepeatSeq requires both a reference genome as well as region list of reference repetitive regions.

Refer to the included documentation for more information or contact the developers if you have futher questions or need additional assistance.