Direct Inference predictions by steps

See the scripts used for Direct Inference prediction here .

Singularity container

For installation, we suggest that you to download the singularity container . This container includes all tools required to run the pipeline.

To use the container, add singularity run --cleanenv evidence.sif before the tools in each script. For example:

1singularity run --cleanenv evidence.sif hisat2

Input files for DirectInf

RNA-Seq raw reads fastq files: *_1.fastq.gz, *_2.fastq.gz
reference genome fasta file: TAIR10_chr_all.fas
list of input prediction: list.txt

Do RNA-Seq alignment by Hisat2

1while read line; do
2      ./01_runHisat2.sh ${line} TAIR10_chr_all.fas;
3done < SRR_Acc_List.txt

Note: Cufflinks and Class2 may takes a long time to process BAM file with deep depth region. Better to generate single bam file for each RNA-Seq data for next step.

Generate high confidence splice sites

1./05_runPortCullis.sh TAIR10_rnaseq.bam

Note: The merged bam file ( TAIR10_rnaseq.bam ) obtained from the process of braker. You can also provide multiple single bam files got from 01_runHisat2.sh , but it takes some time to merge bam files. It’s faster to provide merged bam if you already run braker.

Here generate bed and tab file in portcullis_out/3-filt, we only use bed file.

Consolidate all the transcripts, and predict potential protein coding sequence

  1. Make a configure file and prepare transcripts:

    You should prepare a `list.txt as below to include gtf path (1st column), gtf abbrev (2nd column), stranded-specific or not (3rd column):

    1SRRID1_class.gtf    cs_SRRID1    False
    2SRRID1_cufflinks.gtf cl_SRRID1     False
    3SRRID1_stringtie.gtf st_SRRID1    False
    4SRRID2_class.gtf    cs_SRRID2    False
    5SRRID2_cufflinks.gtf cl_SRRID2     False
    6SRRID2_stringtie.gtf st_SRRID2    False
    7...
    

    Then run the script as below:

    1./06_runMikado_round1.sh TAIR10_chr_all.fas junctions.bed list.txt DI
    

    This will generate DI_prepared.fasta file that will be used for predicting ORFs in the next step.

2. Predict potential CDS from transcripts:
1./07_runTransDecoder.sh DI_prepared.fasta

We will use DI_prepared.fasta.transdecoder.bed in the next step.

Note: Here we only kept complete CDS for next step. You can revise 07_runTransDecoder.sh to use both incomplete and complete CDS if you need.

3. Pick best transcripts for each locus and annotate them as gene:
1./08_runMikado_round2.sh DI_prepared.fasta.transdecoder.bed DI

This will generate:

1mikado.metrics.tsv
2mikado.scores.tsv
3DI.loci.gff3

Optional: Filter out transcripts with redundant CDS

1./09_rm_redundance.sh DI.loci.gff3 TAIR10_chr_all.fas

Optional: Filter out transcripts whose predicted proteins mapped to transposon elements

1./10_TEsorter.sh filter.pep.fa DI.loci.gff3

Note: filter.pep.fa is an output from previous step for removing redundant CDSs. You can also use all protein sequence if you don’t want to remove redundant CDSs.