In this part, we’ll try to extract specific region of sequence from genome and substitute the SNP sites in python
Basic Tech:
Let’s assume:
1: A → C
4: TAT → TA
5: A → C
10: C → AC,G
There are two things we have to highlight:
Conflict between 4:TAT and 5: A→C
In this situation, we simply select the TA because it can reduce the calculation.
10: C→ AC,G
we simply select the first one.
import pandas as pd VCF = pd.DataFrame([[1 ,"A" ,"C" ], [4 , "TAT" , "TA" ], [5 ,"A" ,"C" ], [10 , "G" , "AC,C" ]], columns=["Pos" , "Ref" , "Alt" ]) STR = 'ATCTATCACGCATAGCTAGCTAGTCA' TB = pd.DataFrame([*STR]) TB['Ref' ] = "" TB['Alt' ] = "" TB.iloc[VCF.Pos-1 ,1 ] = VCF.Ref.to_list() TB.iloc[VCF.Pos-1 ,2 ] = [i.split("," )[0 ] for i in VCF.Alt.to_list()] TB.head()
0 Ref Alt
0 A A C
1 T
2 C
3 T TAT TA
4 A A C
!!! Now, we can fill the blank in Alt
with Reference:
TB.Alt[TB.Alt=="" ] = TB[0 ][TB.Alt=="" ].to_list() TB.head()
0 Ref Alt
0 A A C
1 T T
2 C C
3 T TAT TA
4 A A C
!!! Last but not least step is fold the table based on the Ref (Exp, 4:TAT)
TB["Len" ] = [len (i)-1 for i in TB.Ref] TB.Len[TB.Len<0 ]=0 TB["DEL" ] = 0 TB["DEL_mark" ] = 0 TB.DEL =TB.Len.to_list()[:1 ] +TB.Len.to_list()[:-1 ] TB.DEL_mark += TB.DEL TB.DEL[TB.DEL>1 ] = TB.DEL_mark[TB.DEL>1 ]-1 while TB.DEL.max ()!=0 : TB.DEL =TB.DEL.to_list()[:1 ] +TB.DEL.to_list()[:-1 ] TB.DEL_mark += TB.DEL TB.DEL[TB.DEL>0 ] = TB.DEL[TB.DEL>0 ]-1 STR2 = "" .join(TB.Alt[TB.DEL_mark==0 ].to_list())from Bio.pairwise2 import format_alignment alignments = pairwise2.align.globalxx(STR, STR2) print(format_alignment(*alignments[0 ]))
A-TCTATCACG-C-ATAGCTAGCTAGTCA
|||| ||| | |||||||||||||||
-CTCTA-CAC-ACCATAGCTAGCTAGTCA
Score=23
index.js -let strRegExp = '(?<=^\n)(^!!! *)(note|info|todo|warning|attention|caution|failure|missing|fail|error)(.*\n)((^ {4}.*\n|^\n)+)'; +let strRegExp = '(?<=^\n)(^!!! *)(note|info|todo|warning|attention|caution|failure|missing|fail|error|question)(.*\n)((^ {4}.*\n|^\n)+)';
Why pandas?
Because we can now extract any part of sequence based on the annotation information from gtf
file.
Prepare a csv file which contains the information of location you want:
CHROM START END NAME Dirc
mitochondrion_genome 1480 2988 COX1 +
mitochondrion_genome 3083 3754 COX2 +
This script only works on specific example (Drosophila mitochondria Genes in genome V6.39 from Flybase)
echo -e "CHROM\tSTART\tEND\tNAME\tDirc" > Target.csv grep -i "^mitochondrion_genome" genes.gtf| \ grep -v transcript_symbol| \ awk '{OFS="\t"; print $1,$4,$5,$12,$7}' | \ sed 's/[";]//g' >> Target.csv
An script could be like:
import gzipimport pandas as pdfrom Bio import SeqIOfrom Bio import pairwise2from Bio.pairwise2 import format_alignmentdef get_vcf_names (vcf_path ): with gzip.open (vcf_path, "rt" ) as ifile: for line in ifile: if line.startswith("#CHROM" ): vcf_names = [x for x in line.split('\t' )] break ifile.close() return vcf_names VCF_file = "G1FFH.vcf.gz" Genome = "../Mito.fa" Target = "Target.csv" SAMPLE = VCF_file.replace(".vcf.gz" , "" ) names = get_vcf_names(VCF_file) vcf = pd.read_csv('G1FFH.vcf.gz' , compression='gzip' , comment='#' , header=None , names=names, sep='\t' ) TARGET = pd.read_csv(Target, sep='\t' )for seq_record in SeqIO.parse(Genome, "fasta" ): if seq_record.id in TARGET.CHROM.unique(): print("Chrom:" , seq_record.id ) TB = pd.DataFrame([*seq_record.seq]) TB['REF' ] = "" TB['ALT' ] = "" TB.iloc[vcf.POS-1 ,1 ] = vcf.REF.to_list() TB.iloc[vcf.POS-1 ,2 ] = [i.split("," )[0 ] for i in vcf.ALT.to_list()] TB.ALT[TB.ALT=="" ] = TB[0 ][TB.ALT=="" ].to_list() TB.head() TB["Len" ] = [len (i)-1 for i in TB.REF] TB.Len[TB.Len<0 ]=0 TB["DEL" ] = 0 TB["DEL_mark" ] = 0 TB.DEL =TB.Len.to_list()[:1 ] +TB.Len.to_list()[:-1 ] TB.DEL_mark += TB.DEL TB.DEL[TB.DEL>1 ] = TB.DEL_mark[TB.DEL>1 ]-1 while TB.DEL.max ()!=0 : TB.DEL =TB.DEL.to_list()[:1 ] +TB.DEL.to_list()[:-1 ] TB.DEL_mark += TB.DEL TB.DEL[TB.DEL>0 ] = TB.DEL[TB.DEL>0 ]-1 TMP = TB.ALT[TB.DEL_mark==0 ] F = open (SAMPLE + "_" + seq_record.id +".fa" , "w" ) F.close() F = open (SAMPLE + "_" + seq_record.id +".fa" , "a" ) LIST = [] for i in range (len (TARGET[TARGET.CHROM==seq_record.id ])): FROM =TARGET.START[TARGET.CHROM==seq_record.id ][i] END = TARGET.END[TARGET.CHROM==seq_record.id ][i] print(FROM,END) STR1 = "" .join(TB[0 ][TB.index.isin(range (FROM-1 ,END))].to_list()) STR2 = "" .join(TMP[TMP.index.isin(range (FROM-1 ,END))].to_list()) F.write(">" +TARGET.NAME[TARGET.CHROM==seq_record.id ][i] + "_" + SAMPLE + "\n" + STR2 + "\n" ) LIST += range (FROM-1 ,END) F.close()
reference: d.vitale199, 2021 , how to read the VCF files with python pandas.
This script was uploaded into GitHub:Karobben/Bio_tools
A demo would be like:
python vcf2fasta.py -g Genome.fa -v File.vcf.gz -t Target.csv
Other Methods
For a whole chromosome
vcf2fasta
rlebron88, 2016
git clone https://github.com/rlebron88/vcf2fasta.git python vcf2fasta/vcf2fasta.py -v <VCF file from the sample> -f <reference FASTA file> -o <FASTA file from the sample>
Limitations:
ONLY ONE SEQUENCE/CHROMOSOME PER VCF.
USE “X” TO REFER TO THE SEQUENCE OF REF IN THE VCF FILE.
IF THERE IS MORE THAN ONE ALT, THE FIRST IS USED.
DO NOT USE multiFASTA.
MatthewP, 2018 : vcftools has a command vcf-consensus which can convert VCF to FASTA file if you got the reference
bgzip -c file.vcf > file.vcf.gz tabix -fp vcf file.vcf.gz cat Reference.fasta | vcf-consensus file.vcf.gz > file.fa
Python scripts
Python Scripts : – Didn’t be test.
Peal scripte : – Didn’t be test.
Python code from Jared Andrews, 2017 :
consensus = FastaVariant('yourgenome.fasta' , 'variants.vcf.gz' , het=True , hom=True ) out = open ("variants.fasta" , "w" )for chrom in consensus.keys: for var in consensus[chrom].variants_sites: record = consensus[chrom][var-1 :var] print(record, file=out) out.close()