How to Annotate Genome Sequences With Prokka
An easy-to-follow intro to Prokka, sharing both the steps and my experience learning how to annotate prokaryotic genomes.
Over the past few months, I started my first bioinformatics project and as a beginner, I often spend weeks figuring out tasks that might seem simple to others. One of these tasks that I have successfully figured and I daresay mastered is genome annotation with Prokka.
To commit the workflow to memory and share with anyone interested, I’m writing this tutorial on how to annotate genome sequences with Prokka.
In plain terms, genome annotation is the process of identifying and attaching biological meaning to raw genome sequences.
Without annotation, there is only so much we can learn from genome sequences. A raw genome file usually contains thousands of lines of A, T, G, and C, which by themselves don’t reveal much about an organism.
This is where annotation tools come in. Prokka (rapid prokaryotic genome annotation) interprets DNA sequences to reveal both the structural and functional elements within a genome. As the name suggests, Prokka is designed mainly for prokaryotic genomes (bacteria and archaea).
Rather than reinventing the wheel, Prokka functions as a wrapper, integrating several command-line tools to handle different parts of the annotation process. It predicts coding sequences, identifies genes, and assigns likely functions to them based on similarity to known databases.
In practice, Prokka performs two key tasks:
Structural annotation – finding genes and other genomic features, and mapping their precise locations.
Functional annotation – determining what those genes encode and assigning putative roles.
In this article, we’ll walk through the steps for running Prokka and cover some of the common challenges (and troubleshooting tips) I encountered along the way.
Installing Prokka
There are various ways to install Prokka and Conda is recommended because it handles dependencies automatically and works across all operating systems. However, I found using Docker images to be easier, so we’ll be exploring both methods of installing Prokka.
Option 1: Installing Prokka with Conda
To use this method, you need to first install Conda or miniconda. Linux or MacOS users can directly install miniconda directly from the terminal or the official website.
Then install Prokka through the terminal with the bash script below;
conda install -c conda-forge -c bioconda prokka
Tip: While optional, it's recommended to install Prokka inside an environment to keep dependencies organized. For example, the commands below will:
Create a new environment named prokka_env
Install Prokka inside it
Activate the environment
conda create -n prokka_env -c conda-forge -c bioconda prokka
conda activate prokka_env
Windows User Note
You can't install Prokka directly on Windows because it relies on several Linux-specific tools. However, you can set up a Linux environment on your Windows system using Windows Subsystem for Linux (WSL).
To install WSL, open PowerShell as Administrator and run:
wsl --install
Once WSL is installed and set up, open your Linux terminal (e.g., Ubuntu) and install Prokka inside WSL using Conda as described earlier.
This roundabout method is one of the reasons I found using Docker images much easier, especially since I work on a Windows laptop.
Option 2: Installing Prokka with Docker Images
If you prefer a simpler setup or want to avoid installing Prokka and all the dependencies manually, you can use Docker.
Docker allows you to run Prokka in a container without worrying about operating system compatibility.
Step 1: Install Docker
Download and install Docker desktop for Windows or Mac.
Follow the setup instructions and make sure Docker is running.
Step 2: Pull the Prokka Docker Image
Open your terminal (PowerShell, Command Prompt, or Linux terminal inside WSL) and run:
docker pull staphb/prokka
This command downloads the official Prokka image maintained by StaphB.
Step 3: Run Prokka inside Docker
You can now annotate genomes by running:
docker run --rm -v /path/to/your/data:/data staphb/prokka prokka /data/your-genome.fasta
Replace /path/to/your/data with the directory on your computer that contains your FASTA file.
/data is the folder inside the container where your files will be accessible.
your-genome.fasta is your input file.
Preparing Input Files
Once Prokka is installed, the next step is ensuring your genome file is in the correct format. The acceptable file format for Prokka is .fasta.
The sequences must also be DNA (no protein or RNA) and the Fasta file header should be clean with no spaces or special characters that could confuse the parser.
Running Genome Annotations with Prokka
With your input genome ready, you're now set to run Prokka and generate annotations results.
Quick Start:
For a basic run with default settings, use:
Prokka genome.fasta
By default, this will:
Annotate the genome
Create a new output directory named PROKKA_YYYYMMDD (with the analysis date)
Generate output files in multiple formats (.gff, .gbk, .faa, .ffn, .fna, .sqn, etc.)
However, to get the most out of Prokka especially if you need specific metadata, you’ll want to customize your run.
prokka --outdir ecoli_annotation --prefix ecoli --genus Escherichia --species coli --strain K12 genome.fasta
There are other extensions for Prokka and you can find out more on the Prokka GitHub ReadMe page.
Understanding Prokka Output Files
A Prokka annotation run generates multiple output files, each capturing different aspects of the genome’s annotation. Understanding the content and purpose of these files is important for accurately interpreting the biological significance of genes and other genomic features, as well as for carrying out downstream analyses. The main files in a Prokka output folder include;
.gff (General Feature Format):
Contains all predicted features (genes, CDS, rRNA, tRNA, etc.) in one standardized format. This is the key file for most downstream analyses and is widely supported by genome browsers and bioinformatics tools.
.gbk (GenBank Format):
Provides the annotation in GenBank’s standard format. Useful for visualization in tools like Artemis or for submission to NCBI..faa (Protein Fasta):
Contains the amino acid sequences of all predicted coding sequences (CDS). Often used for functional annotation (e.g., BLASTp, protein family/domain searches).
.ffn (Nucleotide Fasta of genes):
Contains the nucleotide sequences of all predicted genes. Useful for codon usage studies or gene-level alignments.
.fna (Nucleotide Fasta of genome):
A copy of the input genome sequence in FASTA format, stored with the Prokka results for convenience.
.sqn (Sequin Submission File):
Formatted for direct submission to GenBank via NCBI’s Sequin tool, saving time during official submissions.
.log (Log File):
Records details of the run, including parameters used and any issues encountered. Useful for troubleshooting and reproducibility.
.txt (Summary Statistics):
A human-readable summary of the annotation, reporting counts of coding sequences, tRNAs, rRNAs, and other features.
.err (Error File):
Captures error messages or warnings that occurred during the run. If your Prokka job fails or produces unexpected results, check this file for clues.
Overall, I found Prokka fairly easy to get running once the installation was sorted out, and it’s impressively fast at turning raw genome sequences into something more interpretable. I’m still learning how to make the most of the output files, so if you have tips on downstream uses of Prokka results, I’d love to hear them.
Further Reading
Torsten Seemann, Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30, Issue 14, July 2014, Pages 2068–2069, https://doi.org/10.1093/bioinformatics/btu153