Variant Call Using FASTA

Gepoliano Chaves, Ph. D.

September 13, 2023

1.1) Introduction

In Chaves (2019), we discuss on the application of knowledge on the Biological Association between genotype (the DNA polymorphisms) and phenotype using statistical software (PLINK). We also had an introduction to programming, with the installation of libraries in R and and Bash and a brief discussion on how to process files using R and Bash. Here, we expand the notion of a text file, to the notion of information stored in the form of text. Therefore, in a file, we can also store a biological sequence. The first biological sequence storage file that we should discuss should be the FASTA file. Our discussion should also include information stored in FASTQ, SAM, BAM, BCF and VCF files. In this notebook, we will use information stored in a VCF file to generate commands that allow us to extract useful information from the files we analyze.

1.2) Introduction: what will we do today

1) FASTA file

FASTA file formatting: a text file containing a biological sequence.

1) Explaining the Pipeline

In this notebook, we deepen the analysis of sequences with the beginning of the Identification of Genetic Variants. Here, we treat genetic variants as DNA polymorphisms. A DNA polymorphism is a mutation or variant, which differs from the nucleotide found at the same position in the FASTA sequence used as a reference. Thus, the present pipeline defines the computational protocol for Biological or Genetic Association with a phenotypic trait. In English this computational protocol is called Variant Call.

1) Explaining the Pipeline: SNPs

In this Booklet, we will make a Variant Call using FASTA files that contain the SARS-CoV-2 genome, which is the biological sequence of the virus genome, in a text file, specifically called FASTA (Figure 1). We identified SARS-CoV-2 variants using samtools and bcftools to extract genetic polymorphisms from FASTA files. The FASTA file must be downloaded from GISAID and stored locally, in a folder on the researcher’s local system. Genetic variants, also called SNPs (Single Nucleotide Polymorphisms), are determined by comparing a FASTA file used as a reference, with the isolated sequence of the geographic region of interest.

1) Explaining the Pipeline: GISAID Database

By comparing the FASTA sequence of each region of the planet to a reference FASTA sequence, for example the first FASTA sequence corresponding to the initial patient presenting symptoms of respiratory syndrome, later identified as caused by SARS-CoV-2, we can note each position where there is a nucleotide different from the FASTA reference sequence obtained from the first Chinese patient. We say, then, that in the other regions, there are virus mutations, or, more technically, polymorphisms of the viral genetic material. FASTA files from virtually every region of the planet can be obtained from the German GISAID (Global Initiative on Sharing Avian Influenza Data) database. Information about the GISAID database can be obtained from its website:

https://www.gisaid.org

1) Explaining the Pipeline: Gene Expression Omnibus (GEO)

In Computational Biology Notebook Number 6, we will learn how to identify genetic variants using FASTQ files instead of FASTA files. FASTQ files can be analyzed directly from a database called Gene Expression Omnibus (GEO). The advantage of GEO compared to GISAID is that if the proper pipeline of identification of genetic variants in GEO is used, local storage of sequencing files is not required.

1) script submission to computational server

Often, the amount of sequencing data to be analyzed is so large that it is necessary to use computers with super-powerful storage capacity. The part below, encoded in Bash, is made using the Linux Operating System. The code represents a type of “header” that must be included for submitting scripts using the Linux system on computer servers.

1) script submission to computational server

#!/bin/bash
#SBATCH --partition=128x24
##SBATCH --job-name=Variants_BWA # Job name
##SBATCH --mail-type=ALL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
##SBATCH --mail-user=gchaves@ucsc.edu  # Where to send mail 
##SBATCH --nodes=1                    # Use one node
##SBATCH --ntasks=1                   # Run a single task   
##SBATCH --cpus-per-task=4            # Number of CPU cores per task
##SBATCH --output=Variants_BWA     # Standard output and error log
# 
# module load gcc/5.4.0
source ~/.bashrc

2) Dealing with your computer system architecture: Creating Folders for Data Storage

File organization is essential for clear and organized steps in the various procedures used, often without perfect visualization of how the computer performs. The excerpt below creates folders for the storage of the reference FASTA file, the FASTA files of the region of interest (Brazil), the SAM and BAM alignment files and finally, the BCF and VCF files, which contain information about the location in the linear genome , of the position of the identified mutations. VCF files are extracted in two ways, one containing SNPs and one containing indels.

# Store Figures
if (!dir.exists("../figures")) {
  dir.create("../figures")
}

# Store reference FASTA
if (!dir.exists("../fasta_reference_file")) {
  dir.create("../fasta_reference_file")
}

# Store reference FASTA
if (!dir.exists("../regions")) {
  dir.create("../regions")
}

# Store files from Brazil
if (!dir.exists("../regions/brazil")) {
  dir.create("../regions/brazil")
}

# Create the year for samples
if (!dir.exists("../regions/brazil/2020")) {
  dir.create("../regions/brazil/2020")
}

# Store FASTA for regions
if (!dir.exists("../regions/brazil/2020/fasta/")) {
  dir.create("../regions/brazil/2020/fasta/")
}


# Store BAM files
if (!dir.exists("../regions/brazil/2020/bam")) {
  dir.create("../regions/brazil/2020/bam")
}

# Store BCF files
if (!dir.exists("../regions/brazil/2020/bcf")) {
  dir.create("../regions/brazil/2020/bcf")
}

# Store FASTA files
if (!dir.exists("../regions/brazil/2020/fasta")) {
  dir.create("../regions/brazil/2020/fasta")
}

# Store SAM files
if (!dir.exists("../regions/brazil/2020/sam")) {
  dir.create("../regions/brazil/2020/sam")
}

# Store VFC files
if (!dir.exists("../regions/brazil/2020/vcf")) {
  dir.create("../regions/brazil/2020/vcf")
}


# Store VCF from SNPs
if (!dir.exists("../regions/brazil/2020/vcf_snp")) {
  dir.create("../regions/brazil/2020/vcf_snp")
}

# Store VCF from indels
if (!dir.exists("../regions/brazil/2020/vcf_indel")) {
  dir.create("../regions/brazil/2020/vcf_indel")
}

2) General Organization of the Pipeline

The following steps must be performed by the genetic variant identification pipeline:

2.1) Indexing Using BWA Program

~/anaconda3/bin/bwa index -a bwtsw ../fasta_reference_file/SARS-CoV-2.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [BWTIncCreate] textLength=59806, availableWord=76494
## [bwt_gen] Finished constructing BWT in 5 iterations.
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: /Users/gepolianochaves/anaconda3/bin/bwa index -a bwtsw ../fasta_reference_file/SARS-CoV-2.fasta
## [main] Real time: 0.010 sec; CPU: 0.015 sec

2.1) What is indexing for?

2.1) What is indexing for?

2.1) What is indexing for?

https://www.biostars.org/p/212594/

https://www.youtube.com/user/BenLangmead

2.2) Download and split FASTA

## Change directory to regions folder
cd ../regions

## Use splitfasta to split big FASTA file into its pieces.
/Users/gepolianochaves/anaconda3/bin/splitfasta Brazil_2020_07_01-2020_12_31.fasta

2.3) Move FASTA files

cp ../regions/Brazil_2020_07_01-2020_12_31_split_files/Brazil* ../regions/brazil/2020/fasta

2.4) Create FASTA file list

cd ../regions/brazil/2020/fasta
for i in Brazil*; do echo $i; done
## Brazil_2020_07_01-2020_12_31_1.fasta
## Brazil_2020_07_01-2020_12_31_10.fasta
## Brazil_2020_07_01-2020_12_31_100.fasta
## Brazil_2020_07_01-2020_12_31_101.fasta
## Brazil_2020_07_01-2020_12_31_102.fasta
## Brazil_2020_07_01-2020_12_31_103.fasta
## Brazil_2020_07_01-2020_12_31_104.fasta
## Brazil_2020_07_01-2020_12_31_105.fasta
## Brazil_2020_07_01-2020_12_31_106.fasta
## Brazil_2020_07_01-2020_12_31_107.fasta
## Brazil_2020_07_01-2020_12_31_108.fasta
## Brazil_2020_07_01-2020_12_31_109.fasta
## Brazil_2020_07_01-2020_12_31_11.fasta
## Brazil_2020_07_01-2020_12_31_110.fasta
## Brazil_2020_07_01-2020_12_31_111.fasta
## Brazil_2020_07_01-2020_12_31_112.fasta
## Brazil_2020_07_01-2020_12_31_113.fasta
## Brazil_2020_07_01-2020_12_31_114.fasta
## Brazil_2020_07_01-2020_12_31_115.fasta
## Brazil_2020_07_01-2020_12_31_116.fasta
## Brazil_2020_07_01-2020_12_31_117.fasta
## Brazil_2020_07_01-2020_12_31_118.fasta
## Brazil_2020_07_01-2020_12_31_119.fasta
## Brazil_2020_07_01-2020_12_31_12.fasta
## Brazil_2020_07_01-2020_12_31_120.fasta
## Brazil_2020_07_01-2020_12_31_121.fasta
## Brazil_2020_07_01-2020_12_31_122.fasta
## Brazil_2020_07_01-2020_12_31_123.fasta
## Brazil_2020_07_01-2020_12_31_124.fasta
## Brazil_2020_07_01-2020_12_31_125.fasta
## Brazil_2020_07_01-2020_12_31_126.fasta
## Brazil_2020_07_01-2020_12_31_127.fasta
## Brazil_2020_07_01-2020_12_31_128.fasta
## Brazil_2020_07_01-2020_12_31_129.fasta
## Brazil_2020_07_01-2020_12_31_13.fasta
## Brazil_2020_07_01-2020_12_31_130.fasta
## Brazil_2020_07_01-2020_12_31_131.fasta
## Brazil_2020_07_01-2020_12_31_132.fasta
## Brazil_2020_07_01-2020_12_31_133.fasta
## Brazil_2020_07_01-2020_12_31_134.fasta
## Brazil_2020_07_01-2020_12_31_135.fasta
## Brazil_2020_07_01-2020_12_31_136.fasta
## Brazil_2020_07_01-2020_12_31_137.fasta
## Brazil_2020_07_01-2020_12_31_138.fasta
## Brazil_2020_07_01-2020_12_31_139.fasta
## Brazil_2020_07_01-2020_12_31_14.fasta
## Brazil_2020_07_01-2020_12_31_140.fasta
## Brazil_2020_07_01-2020_12_31_141.fasta
## Brazil_2020_07_01-2020_12_31_142.fasta
## Brazil_2020_07_01-2020_12_31_143.fasta
## Brazil_2020_07_01-2020_12_31_144.fasta
## Brazil_2020_07_01-2020_12_31_145.fasta
## Brazil_2020_07_01-2020_12_31_146.fasta
## Brazil_2020_07_01-2020_12_31_147.fasta
## Brazil_2020_07_01-2020_12_31_148.fasta
## Brazil_2020_07_01-2020_12_31_149.fasta
## Brazil_2020_07_01-2020_12_31_15.fasta
## Brazil_2020_07_01-2020_12_31_150.fasta
## Brazil_2020_07_01-2020_12_31_151.fasta
## Brazil_2020_07_01-2020_12_31_152.fasta
## Brazil_2020_07_01-2020_12_31_153.fasta
## Brazil_2020_07_01-2020_12_31_154.fasta
## Brazil_2020_07_01-2020_12_31_155.fasta
## Brazil_2020_07_01-2020_12_31_156.fasta
## Brazil_2020_07_01-2020_12_31_157.fasta
## Brazil_2020_07_01-2020_12_31_158.fasta
## Brazil_2020_07_01-2020_12_31_159.fasta
## Brazil_2020_07_01-2020_12_31_16.fasta
## Brazil_2020_07_01-2020_12_31_160.fasta
## Brazil_2020_07_01-2020_12_31_161.fasta
## Brazil_2020_07_01-2020_12_31_162.fasta
## Brazil_2020_07_01-2020_12_31_163.fasta
## Brazil_2020_07_01-2020_12_31_164.fasta
## Brazil_2020_07_01-2020_12_31_165.fasta
## Brazil_2020_07_01-2020_12_31_166.fasta
## Brazil_2020_07_01-2020_12_31_167.fasta
## Brazil_2020_07_01-2020_12_31_168.fasta
## Brazil_2020_07_01-2020_12_31_169.fasta
## Brazil_2020_07_01-2020_12_31_17.fasta
## Brazil_2020_07_01-2020_12_31_170.fasta
## Brazil_2020_07_01-2020_12_31_171.fasta
## Brazil_2020_07_01-2020_12_31_172.fasta
## Brazil_2020_07_01-2020_12_31_173.fasta
## Brazil_2020_07_01-2020_12_31_174.fasta
## Brazil_2020_07_01-2020_12_31_175.fasta
## Brazil_2020_07_01-2020_12_31_176.fasta
## Brazil_2020_07_01-2020_12_31_177.fasta
## Brazil_2020_07_01-2020_12_31_178.fasta
## Brazil_2020_07_01-2020_12_31_179.fasta
## Brazil_2020_07_01-2020_12_31_18.fasta
## Brazil_2020_07_01-2020_12_31_180.fasta
## Brazil_2020_07_01-2020_12_31_181.fasta
## Brazil_2020_07_01-2020_12_31_182.fasta
## Brazil_2020_07_01-2020_12_31_183.fasta
## Brazil_2020_07_01-2020_12_31_184.fasta
## Brazil_2020_07_01-2020_12_31_185.fasta
## Brazil_2020_07_01-2020_12_31_186.fasta
## Brazil_2020_07_01-2020_12_31_187.fasta
## Brazil_2020_07_01-2020_12_31_188.fasta
## Brazil_2020_07_01-2020_12_31_189.fasta
## Brazil_2020_07_01-2020_12_31_19.fasta
## Brazil_2020_07_01-2020_12_31_190.fasta
## Brazil_2020_07_01-2020_12_31_191.fasta
## Brazil_2020_07_01-2020_12_31_192.fasta
## Brazil_2020_07_01-2020_12_31_193.fasta
## Brazil_2020_07_01-2020_12_31_194.fasta
## Brazil_2020_07_01-2020_12_31_195.fasta
## Brazil_2020_07_01-2020_12_31_196.fasta
## Brazil_2020_07_01-2020_12_31_197.fasta
## Brazil_2020_07_01-2020_12_31_198.fasta
## Brazil_2020_07_01-2020_12_31_199.fasta
## Brazil_2020_07_01-2020_12_31_2.fasta
## Brazil_2020_07_01-2020_12_31_20.fasta
## Brazil_2020_07_01-2020_12_31_200.fasta
## Brazil_2020_07_01-2020_12_31_201.fasta
## Brazil_2020_07_01-2020_12_31_202.fasta
## Brazil_2020_07_01-2020_12_31_203.fasta
## Brazil_2020_07_01-2020_12_31_204.fasta
## Brazil_2020_07_01-2020_12_31_205.fasta
## Brazil_2020_07_01-2020_12_31_206.fasta
## Brazil_2020_07_01-2020_12_31_207.fasta
## Brazil_2020_07_01-2020_12_31_208.fasta
## Brazil_2020_07_01-2020_12_31_209.fasta
## Brazil_2020_07_01-2020_12_31_21.fasta
## Brazil_2020_07_01-2020_12_31_210.fasta
## Brazil_2020_07_01-2020_12_31_211.fasta
## Brazil_2020_07_01-2020_12_31_212.fasta
## Brazil_2020_07_01-2020_12_31_213.fasta
## Brazil_2020_07_01-2020_12_31_214.fasta
## Brazil_2020_07_01-2020_12_31_215.fasta
## Brazil_2020_07_01-2020_12_31_216.fasta
## Brazil_2020_07_01-2020_12_31_217.fasta
## Brazil_2020_07_01-2020_12_31_218.fasta
## Brazil_2020_07_01-2020_12_31_219.fasta
## Brazil_2020_07_01-2020_12_31_22.fasta
## Brazil_2020_07_01-2020_12_31_220.fasta
## Brazil_2020_07_01-2020_12_31_221.fasta
## Brazil_2020_07_01-2020_12_31_222.fasta
## Brazil_2020_07_01-2020_12_31_223.fasta
## Brazil_2020_07_01-2020_12_31_224.fasta
## Brazil_2020_07_01-2020_12_31_225.fasta
## Brazil_2020_07_01-2020_12_31_226.fasta
## Brazil_2020_07_01-2020_12_31_227.fasta
## Brazil_2020_07_01-2020_12_31_228.fasta
## Brazil_2020_07_01-2020_12_31_229.fasta
## Brazil_2020_07_01-2020_12_31_23.fasta
## Brazil_2020_07_01-2020_12_31_230.fasta
## Brazil_2020_07_01-2020_12_31_231.fasta
## Brazil_2020_07_01-2020_12_31_232.fasta
## Brazil_2020_07_01-2020_12_31_233.fasta
## Brazil_2020_07_01-2020_12_31_234.fasta
## Brazil_2020_07_01-2020_12_31_235.fasta
## Brazil_2020_07_01-2020_12_31_236.fasta
## Brazil_2020_07_01-2020_12_31_237.fasta
## Brazil_2020_07_01-2020_12_31_238.fasta
## Brazil_2020_07_01-2020_12_31_239.fasta
## Brazil_2020_07_01-2020_12_31_24.fasta
## Brazil_2020_07_01-2020_12_31_240.fasta
## Brazil_2020_07_01-2020_12_31_241.fasta
## Brazil_2020_07_01-2020_12_31_242.fasta
## Brazil_2020_07_01-2020_12_31_243.fasta
## Brazil_2020_07_01-2020_12_31_244.fasta
## Brazil_2020_07_01-2020_12_31_245.fasta
## Brazil_2020_07_01-2020_12_31_246.fasta
## Brazil_2020_07_01-2020_12_31_247.fasta
## Brazil_2020_07_01-2020_12_31_248.fasta
## Brazil_2020_07_01-2020_12_31_249.fasta
## Brazil_2020_07_01-2020_12_31_25.fasta
## Brazil_2020_07_01-2020_12_31_250.fasta
## Brazil_2020_07_01-2020_12_31_26.fasta
## Brazil_2020_07_01-2020_12_31_27.fasta
## Brazil_2020_07_01-2020_12_31_28.fasta
## Brazil_2020_07_01-2020_12_31_29.fasta
## Brazil_2020_07_01-2020_12_31_3.fasta
## Brazil_2020_07_01-2020_12_31_30.fasta
## Brazil_2020_07_01-2020_12_31_31.fasta
## Brazil_2020_07_01-2020_12_31_32.fasta
## Brazil_2020_07_01-2020_12_31_33.fasta
## Brazil_2020_07_01-2020_12_31_34.fasta
## Brazil_2020_07_01-2020_12_31_35.fasta
## Brazil_2020_07_01-2020_12_31_36.fasta
## Brazil_2020_07_01-2020_12_31_37.fasta
## Brazil_2020_07_01-2020_12_31_38.fasta
## Brazil_2020_07_01-2020_12_31_39.fasta
## Brazil_2020_07_01-2020_12_31_4.fasta
## Brazil_2020_07_01-2020_12_31_40.fasta
## Brazil_2020_07_01-2020_12_31_41.fasta
## Brazil_2020_07_01-2020_12_31_42.fasta
## Brazil_2020_07_01-2020_12_31_43.fasta
## Brazil_2020_07_01-2020_12_31_44.fasta
## Brazil_2020_07_01-2020_12_31_45.fasta
## Brazil_2020_07_01-2020_12_31_46.fasta
## Brazil_2020_07_01-2020_12_31_47.fasta
## Brazil_2020_07_01-2020_12_31_48.fasta
## Brazil_2020_07_01-2020_12_31_49.fasta
## Brazil_2020_07_01-2020_12_31_5.fasta
## Brazil_2020_07_01-2020_12_31_50.fasta
## Brazil_2020_07_01-2020_12_31_51.fasta
## Brazil_2020_07_01-2020_12_31_52.fasta
## Brazil_2020_07_01-2020_12_31_53.fasta
## Brazil_2020_07_01-2020_12_31_54.fasta
## Brazil_2020_07_01-2020_12_31_55.fasta
## Brazil_2020_07_01-2020_12_31_56.fasta
## Brazil_2020_07_01-2020_12_31_57.fasta
## Brazil_2020_07_01-2020_12_31_58.fasta
## Brazil_2020_07_01-2020_12_31_59.fasta
## Brazil_2020_07_01-2020_12_31_6.fasta
## Brazil_2020_07_01-2020_12_31_60.fasta
## Brazil_2020_07_01-2020_12_31_61.fasta
## Brazil_2020_07_01-2020_12_31_62.fasta
## Brazil_2020_07_01-2020_12_31_63.fasta
## Brazil_2020_07_01-2020_12_31_64.fasta
## Brazil_2020_07_01-2020_12_31_65.fasta
## Brazil_2020_07_01-2020_12_31_66.fasta
## Brazil_2020_07_01-2020_12_31_67.fasta
## Brazil_2020_07_01-2020_12_31_68.fasta
## Brazil_2020_07_01-2020_12_31_69.fasta
## Brazil_2020_07_01-2020_12_31_7.fasta
## Brazil_2020_07_01-2020_12_31_70.fasta
## Brazil_2020_07_01-2020_12_31_71.fasta
## Brazil_2020_07_01-2020_12_31_72.fasta
## Brazil_2020_07_01-2020_12_31_73.fasta
## Brazil_2020_07_01-2020_12_31_74.fasta
## Brazil_2020_07_01-2020_12_31_75.fasta
## Brazil_2020_07_01-2020_12_31_76.fasta
## Brazil_2020_07_01-2020_12_31_77.fasta
## Brazil_2020_07_01-2020_12_31_78.fasta
## Brazil_2020_07_01-2020_12_31_79.fasta
## Brazil_2020_07_01-2020_12_31_8.fasta
## Brazil_2020_07_01-2020_12_31_80.fasta
## Brazil_2020_07_01-2020_12_31_81.fasta
## Brazil_2020_07_01-2020_12_31_82.fasta
## Brazil_2020_07_01-2020_12_31_83.fasta
## Brazil_2020_07_01-2020_12_31_84.fasta
## Brazil_2020_07_01-2020_12_31_85.fasta
## Brazil_2020_07_01-2020_12_31_86.fasta
## Brazil_2020_07_01-2020_12_31_87.fasta
## Brazil_2020_07_01-2020_12_31_88.fasta
## Brazil_2020_07_01-2020_12_31_89.fasta
## Brazil_2020_07_01-2020_12_31_9.fasta
## Brazil_2020_07_01-2020_12_31_90.fasta
## Brazil_2020_07_01-2020_12_31_91.fasta
## Brazil_2020_07_01-2020_12_31_92.fasta
## Brazil_2020_07_01-2020_12_31_93.fasta
## Brazil_2020_07_01-2020_12_31_94.fasta
## Brazil_2020_07_01-2020_12_31_95.fasta
## Brazil_2020_07_01-2020_12_31_96.fasta
## Brazil_2020_07_01-2020_12_31_97.fasta
## Brazil_2020_07_01-2020_12_31_98.fasta
## Brazil_2020_07_01-2020_12_31_99.fasta

2.4) Counting number of FASTA files in Brazil folder

cd ../regions/brazil/2020/fasta
for i in Brazil*; do echo $i; done | wc -l
##      250
cd ../regions/brazil/2020/fasta
for i in Brazil*; do echo $i; done > COVID_List_Region.txt
wc -l COVID_List_Region.txt ## Total number of FASTA files
##      250 COVID_List_Region.txt

3) Alignment in for loop


for fasta_file in $(cat ../regions/brazil/2020/fasta/COVID_List_Region.txt); do 

  ## Print names of the fasta file
  echo $fast_file
  
  ## Alignment
  bwa mem -M -R \
  '@RG\tID:SampleCorona\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:SampleCorona' \
  ../fasta_reference_file/SARS-CoV-2.fasta \
  ../regions/brazil/2020/fasta/$fasta_file > \
  ../regions/brazil/2020/sam/$fasta_file".sam"
  
  ## SAM to BAM
  samtools view -S -b .$fasta_file".sam" > \
  ../regions/brazil/2020/bam/$fasta_file".bam"
  
    ## This part is different in the VMs
  bcftools mpileup --fasta-ref ../fasta_reference_file/SARS-CoV-2.fasta ../regions/brazil/2020/bam/$fasta_file".bam" > \
  ../regions/brazil/2020/bcf/$fasta_file".bcf"
  
  ## Bcftools extracts SNPs
  bcftools view -v snps ../regions/brazil/2020/bcf/$fasta_file".bcf" > ../regions/brazil/2020/vcf_snp/$fasta_file"_snps.vcf"

  ## Bcftools extracts indels
  bcftools view -v indels ../regions/brazil/2020/bcf/$fasta_file".bcf" > ../regions/brazil/2020/vcf_indel/$fasta_file"_indels.vcf"

done

4) Counting the number of mutations

cd brazil_vcf_julDec_snp

echo "Mutation N501Y"
grep -w 23063 * | wc -l #2/250 files
grep -w 23064 * | wc -l # 0/250 files
grep -w 23065 * | wc -l # 0/250 files
echo "Mutation P681H"
grep -w 23603 * | wc -l # 0/250 files
grep -w 23604 * | wc -l # 2/250 files
grep -w 23605 * | wc -l # 0/250 files
echo "HV 69–70 deletion"
grep -w 21766 * | wc -l
grep -w 21767 * | wc -l
grep -w 21768 * | wc -l
grep -w 21769 * | wc -l
grep -w 21770 * | wc -l
grep -w 21771 * | wc -l
grep -w 21772 * | wc -l
echo "Mutations from Yin"
grep -w 241 * | wc -l
grep -w 3037 * | wc -l
grep -w 11083 * | wc -l
grep -w 14408 * | wc -l
grep -w 17747 * | wc -l
grep -w 17858 * | wc -l
grep -w 18060 * | wc -l
grep -w 23403 * | wc -l
grep -w 26144 * | wc -l
grep -w 27046 * | wc -l
grep -w 28144 * | wc -l
grep -w 28881 * | wc -l
grep -w 28882 * | wc -l
grep -w 28883 * | wc -l
## bash: line 0: cd: brazil_vcf_julDec_snp: No such file or directory
## Mutation N501Y
##        6
##        2
##        2
## Mutation P681H
##        2
##        2
##        2
## HV 69–70 deletion
##        2
##        2
##        2
##        2
##        2
##        2
##        2
## Mutations from Yin
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2
##        2

5.1) Implementation of Allelic Frequency Calculation (Original Script)

https://stackoverflow.com/questions/4651437/how-do-i-set-a-variable-to-the-output-of-a-command-in-bash

cd brazil_fasta_julDec
NumeroTotalFASTA=$(for i in Brazil*; do echo $i; done | wc -l)
echo "${NumeroTotalFASTA}"


cd ~/Desktop/Gepoliano/SARS-CoV-2_Analysis/brazil_vcf_julDec_snp
NumeroN501Y=$(grep 23063 * | wc -l )
echo "${NumeroN501Y}"

echo "scale=3; ${NumeroN501Y} / ${NumeroTotalFASTA} " | bc


##frequency= ( $NumeroTotalFASTA / $NumeroN501Y )
##echo " A Frequência é de N501Y = $frequency"
## bash: line 0: cd: brazil_fasta_julDec: No such file or directory
##        1
##        2
## 2.000

5.2) Calculate Several Frequencies

## Count total number of FASTA files
cd brazil_fasta_julDec
NumeroTotalFASTA=$(for i in Brazil*; do echo $i; done | wc -l)
echo "${NumeroTotalFASTA}"

## Check current directory
pwd

## Go back to the previous directory, because VCF files are stored inside it
cd ..

## Go to VCF file directory and count number of times a mutation is observed
cd brazil_vcf_julDec_snp
NumeroN501Y=$(grep 23063 * | wc -l )
echo "${NumeroN501Y}"

## Print resut of frequency calculation to screen
echo "scale=3; ${NumeroN501Y} / ${NumeroTotalFASTA} " | bc
## bash: line 1: cd: brazil_fasta_julDec: No such file or directory
##        1
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/2023-bioinfo-bridge-course/recombio/2-Variant-Call-Pipeline/scripts
## bash: line 12: cd: brazil_vcf_julDec_snp: No such file or directory
## grep: fasta_reference_file: Is a directory
## grep: figures: Is a directory
## grep: regions: Is a directory
## grep: scripts: Is a directory
##        0
## 0

5.3) Calculate Frequency of Variants per Region in for loop

cd brazil_fasta_julDec
NumeroTotalFASTA=$(for i in Brazil*; do echo $i; done | wc -l)

cd ..

for Variant in $(cat VariantList.txt.tmp); do
  Variant_Total_Occurrences=$(grep $Variant brazil_vcf_julDec_snp/* | wc -l )
  echo "scale=2; 100*${Variant_Total_Occurrences} / ${NumeroTotalFASTA} " | bc
  #echo "scale=3; ${Variant_Total_Occurrences} / ${NumeroTotalFASTA} " | bc
done
## bash: line 0: cd: brazil_fasta_julDec: No such file or directory
## cat: VariantList.txt.tmp: No such file or directory

6) Viewing Variants of Interest

Polar interactions between the SARS-CoV-2 Spike RBD protein (white) and the human ACE2 protein (blue) calculated by Pymol using the mutagenesis tool. A) Interaction of wild-type ASN501 (N501) residue in the Spike SARS-CoV protein binding domain (RBD) with tyrosine 41 in ACE2; B) When Asparagine is replaced by Tyrosine in RBD, the number of possible hydrogen interactions increases between Spike RBD and ACE2, possibly explaining the higher affinity between the virus Spike protein and ACE2.

6) Viewing Variants of Interest

  1. Interaction of wild-type ASN501 (N501) residue in the Spike SARS-CoV protein binding domain (RBD) with tyrosine 41 in ACE2;

Polar interactions between the SARS-CoV-2 Spike RBD protein (white) and the human ACE2 protein (blue) calculated by Pymol using the mutagenesis tool. A) Interaction of wild-type ASN501 (N501) residue in the Spike SARS-CoV protein binding domain (RBD) with tyrosine 41 in ACE2; B) When Asparagine is replaced by Tyrosine in RBD, the number of possible hydrogen interactions increases between Spike RBD and ACE2, possibly explaining the higher affinity between the virus Spike protein and ACE2.

6) Viewing Variants of Interest

  1. When Asparagine is replaced by Tyrosine in RBD, the number of possible hydrogen interactions increases between Spike RBD and ACE2, possibly explaining the greater affinity between the virus Spike protein and ACE2.

Polar interactions between the SARS-CoV-2 Spike RBD protein (white) and the human ACE2 protein (blue) calculated by Pymol using the mutagenesis tool. A) Interaction of wild-type ASN501 (N501) residue in the Spike SARS-CoV protein binding domain (RBD) with tyrosine 41 in ACE2; B) When Asparagine is replaced by Tyrosine in RBD, the number of possible hydrogen interactions increases between Spike RBD and ACE2, possibly explaining the higher affinity between the virus Spike protein and ACE2.

7) References: Identification of Variants

https://www.the-scientist.com/news-opinion/a-guide-to-emerging-sars-cov-2-variants-68387

https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2014/140217_AgriOmics/dan_bolser_snp_calling.pdf

7) References: 3D Protein Visualization with Pymol

https://www.youtube.com/watch?v=hcnnKrlqa9M

https://www.rcsb.org/structure/6VW1

https://www.youtube.com/watch?v=nFY3EjBNPBQ

https://www.youtube.com/watch?v=M-VCBz83nfs