Direct Inference predictions by steps
See the scripts used for Direct Inference prediction here .
Singularity container
For installation, we suggest that you to download the singularity container . This container includes all tools required to run the pipeline.
To use the container, add singularity run --cleanenv evidence.sif
before the tools in each script. For example:
1singularity run --cleanenv evidence.sif hisat2
Input files for DirectInf
RNA-Seq raw reads fastq files: *_1.fastq.gz, *_2.fastq.gz
reference genome fasta file: TAIR10_chr_all.fas
list of input prediction: list.txt
Do RNA-Seq alignment by Hisat2
1while read line; do
2 ./01_runHisat2.sh ${line} TAIR10_chr_all.fas;
3done < SRR_Acc_List.txt
Note: Cufflinks and Class2 may takes a long time to process BAM file with deep depth region. Better to generate single bam file for each RNA-Seq data for next step.
Run Transcriptome assemblies using Class2, Cufflinks and Stringtie
1while read line; do
2 ./02_runClass2.sh ${line}_sorted.bam;
3 ./03_runCufflinks.sh ${line}_sorted.bam;
4 ./04_runStringtie.sh ${line}_sorted.bam;
5 done < SRR_Acc_List.txt
Note: Class2, Cufflinks and StringTie generates a gtf format transcripts for each SRR sample. You can use more transcriptome assembler as you need.
Generate high confidence splice sites
1./05_runPortCullis.sh TAIR10_rnaseq.bam
Note: The merged bam file ( TAIR10_rnaseq.bam
) obtained from the process of braker. You can also provide multiple single bam files got from 01_runHisat2.sh , but it takes some time to merge bam files. It’s faster to provide merged bam if you already run braker.
Here generate bed and tab file in portcullis_out/3-filt
, we only use bed file.
Consolidate all the transcripts, and predict potential protein coding sequence
Make a configure file and prepare transcripts:
You should prepare a
`list.txt
as below to include gtf path (1st column), gtf abbrev (2nd column), stranded-specific or not (3rd column):1SRRID1_class.gtf cs_SRRID1 False 2SRRID1_cufflinks.gtf cl_SRRID1 False 3SRRID1_stringtie.gtf st_SRRID1 False 4SRRID2_class.gtf cs_SRRID2 False 5SRRID2_cufflinks.gtf cl_SRRID2 False 6SRRID2_stringtie.gtf st_SRRID2 False 7...
Then run the script as below:
1./06_runMikado_round1.sh TAIR10_chr_all.fas junctions.bed list.txt DI
This will generate
DI_prepared.fasta
file that will be used for predicting ORFs in the next step.
1./07_runTransDecoder.sh DI_prepared.fasta
We will use
DI_prepared.fasta.transdecoder.bed
in the next step.Note: Here we only kept complete CDS for next step. You can revise
07_runTransDecoder.sh
to use both incomplete and complete CDS if you need.
1./08_runMikado_round2.sh DI_prepared.fasta.transdecoder.bed DI
This will generate:
1mikado.metrics.tsv 2mikado.scores.tsv 3DI.loci.gff3
Optional: Filter out transcripts with redundant CDS
1./09_rm_redundance.sh DI.loci.gff3 TAIR10_chr_all.fas
Optional: Filter out transcripts whose predicted proteins mapped to transposon elements
1./10_TEsorter.sh filter.pep.fa DI.loci.gff3
Note: filter.pep.fa
is an output from previous step for removing redundant CDSs. You can also use all protein sequence if you don’t want to remove redundant CDSs.