0 Posted 2022-09-16Updated 2024-01-11Biology / Bioinformatics / Protocol7 minutes read (About 1066 words)

Extract Fasta from VCF

Extract only a part of genes

In this part, we’ll try to extract specific region of sequence from genome and substitute the SNP sites in python

Basic Tech:

Let’s assume:

1: A → C
4: TAT → TA
5: A → C
10: C → AC,G

There are two things we have to highlight:

Conflict between 4:TAT and 5: A→C

In this situation, we simply select the TA because it can reduce the calculation.

10: C→ AC,G

we simply select the first one.

import pandas as pd

VCF = pd.DataFrame([[1,"A","C"],
            [4, "TAT", "TA"],
            [5,"A","C"],
            [10, "G", "AC,C"]],
            columns=["Pos", "Ref", "Alt"])

STR = 'ATCTATCACGCATAGCTAGCTAGTCA'

TB = pd.DataFrame([*STR])
TB['Ref'] = ""
TB['Alt'] = ""

TB.iloc[VCF.Pos-1,1] = VCF.Ref.to_list()
TB.iloc[VCF.Pos-1,2] = [i.split(",")[0] for i in VCF.Alt.to_list()]
TB.head()

   0  Ref Alt
0  A    A   C
1  T         
2  C         
3  T  TAT  TA
4  A    A   C

!!! Now, we can fill the blank in Alt with Reference:

TB.Alt[TB.Alt==""] =  TB[0][TB.Alt==""].to_list()
TB.head()

   0  Ref Alt
0  A    A   C
1  T        T
2  C        C
3  T  TAT  TA
4  A    A   C

!!! Last but not least step is fold the table based on the Ref (Exp, 4:TAT)

TB["Len"] = [len(i)-1 for i in TB.Ref]
TB.Len[TB.Len<0]=0

TB["DEL"] = 0
TB["DEL_mark"] = 0

TB.DEL =TB.Len.to_list()[:1] +TB.Len.to_list()[:-1]
TB.DEL_mark += TB.DEL
TB.DEL[TB.DEL>1] = TB.DEL_mark[TB.DEL>1]-1


while TB.DEL.max()!=0:
    TB.DEL =TB.DEL.to_list()[:1] +TB.DEL.to_list()[:-1]
    TB.DEL_mark += TB.DEL
    TB.DEL[TB.DEL>0] = TB.DEL[TB.DEL>0]-1

STR2 = "".join(TB.Alt[TB.DEL_mark==0].to_list())

# Print Alignment results
from Bio.pairwise2 import format_alignment
alignments = pairwise2.align.globalxx(STR, STR2)
print(format_alignment(*alignments[0]))

A-TCTATCACG-C-ATAGCTAGCTAGTCA
  |||| |||  | |||||||||||||||
-CTCTA-CAC-ACCATAGCTAGCTAGTCA
  Score=23

index.js-let strRegExp = '(?<=^\n)(^!!! *)(note|info|todo|warning|attention|caution|failure|missing|fail|error)(.*\n)((^ {4}.*\n|^\n)+)';
+let strRegExp = '(?<=^\n)(^!!! *)(note|info|todo|warning|attention|caution|failure|missing|fail|error|question)(.*\n)((^ {4}.*\n|^\n)+)';

Why pandas?

Because we can now extract any part of sequence based on the annotation information from gtf file.

Prepare a csv file which contains the information of location you want:

CHROM	START	END	NAME Dirc
mitochondrion_genome	1480	2988	COX1	+
mitochondrion_genome	3083	3754	COX2	+

This script only works on specific example (Drosophila mitochondria Genes in genome V6.39 from Flybase)

echo -e "CHROM\tSTART\tEND\tNAME\tDirc" > Target.csv
grep -i "^mitochondrion_genome" genes.gtf|  \
    grep -v transcript_symbol|              \
    awk '{OFS="\t"; print $1,$4,$5,$12,$7}'|   \
    sed 's/[";]//g' >> Target.csv

An script could be like:

import gzip
import pandas as pd
from Bio import SeqIO
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

def get_vcf_names(vcf_path):
    with gzip.open(vcf_path, "rt") as ifile:
          for line in ifile:
            if line.startswith("#CHROM"):
                  vcf_names = [x for x in line.split('\t')]
                  break
    ifile.close()
    return vcf_names

VCF_file = "G1FFH.vcf.gz"
Genome = "../Mito.fa"
Target = "Target.csv"
SAMPLE = VCF_file.replace(".vcf.gz", "")
names = get_vcf_names(VCF_file)
vcf = pd.read_csv('G1FFH.vcf.gz', compression='gzip', comment='#',  header=None, names=names, sep='\t')

TARGET = pd.read_csv(Target, sep='\t')

for seq_record in SeqIO.parse(Genome, "fasta"):
    if seq_record.id in TARGET.CHROM.unique():
        print("Chrom:", seq_record.id)
        TB = pd.DataFrame([*seq_record.seq])
        TB['REF'] = ""
        TB['ALT'] = ""
        TB.iloc[vcf.POS-1,1] = vcf.REF.to_list()
        TB.iloc[vcf.POS-1,2] = [i.split(",")[0] for i in vcf.ALT.to_list()]
        TB.ALT[TB.ALT==""] =  TB[0][TB.ALT==""].to_list()
        TB.head()
        TB["Len"] = [len(i)-1 for i in TB.REF]
        TB.Len[TB.Len<0]=0

        TB["DEL"] = 0
        TB["DEL_mark"] = 0

        TB.DEL =TB.Len.to_list()[:1] +TB.Len.to_list()[:-1]
        TB.DEL_mark += TB.DEL
        TB.DEL[TB.DEL>1] = TB.DEL_mark[TB.DEL>1]-1
        while TB.DEL.max()!=0:
            TB.DEL =TB.DEL.to_list()[:1] +TB.DEL.to_list()[:-1]
            TB.DEL_mark += TB.DEL
            TB.DEL[TB.DEL>0] = TB.DEL[TB.DEL>0]-1

        TMP = TB.ALT[TB.DEL_mark==0]
        F = open(SAMPLE + "_" + seq_record.id +".fa", "w")
        F.close()

        F = open(SAMPLE + "_" + seq_record.id +".fa", "a")
        LIST = []
        for i in range(len(TARGET[TARGET.CHROM==seq_record.id])):
            FROM =TARGET.START[TARGET.CHROM==seq_record.id][i]
            END = TARGET.END[TARGET.CHROM==seq_record.id][i]
            print(FROM,END)
            STR1 = "".join(TB[0][TB.index.isin(range(FROM-1,END))].to_list())
            STR2 = "".join(TMP[TMP.index.isin(range(FROM-1,END))].to_list())
            F.write(">"+TARGET.NAME[TARGET.CHROM==seq_record.id][i] + "_" + SAMPLE + "\n" + STR2 + "\n" )
            LIST += range(FROM-1,END)

        #F.write(">NonHit_" + SAMPLE + "\n" + "".join(TMP[~TMP.index.isin(LIST)]) + "\n" )

        F.close()

reference: d.vitale199, 2021, how to read the VCF files with python pandas.

This script was uploaded into GitHub:Karobben/Bio_tools
A demo would be like:

python vcf2fasta.py -g Genome.fa -v File.vcf.gz -t Target.csv

Other Methods

For a whole chromosome

vcf2fasta

rlebron88, 2016

git clone https://github.com/rlebron88/vcf2fasta.git
python vcf2fasta/vcf2fasta.py -v <VCF file from the sample> -f <reference FASTA file> -o <FASTA file from the sample>

Limitations:

ONLY ONE SEQUENCE/CHROMOSOME PER VCF.
USE “X” TO REFER TO THE SEQUENCE OF REF IN THE VCF FILE.
IF THERE IS MORE THAN ONE ALT, THE FIRST IS USED.
DO NOT USE multiFASTA.

vcftools

MatthewP, 2018: vcftools has a command vcf-consensus which can convert VCF to FASTA file if you got the reference

bgzip -c file.vcf > file.vcf.gz
tabix -fp vcf file.vcf.gz
cat Reference.fasta | vcf-consensus file.vcf.gz > file.fa

Python scripts

Python Scripts: – Didn’t be test.
Peal scripte: – Didn’t be test.

Python code from Jared Andrews, 2017:

consensus = FastaVariant('yourgenome.fasta', 'variants.vcf.gz', het=True, hom=True)

out = open("variants.fasta", "w")

for chrom in consensus.keys:
    for var in consensus[chrom].variants_sites:
        record = consensus[chrom][var-1:var]
        print(record, file=out)

out.close()

Extract Fasta from VCF

https://karobben.github.io/2022/09/16/Bioinfor/fastafromvcf/

Author

Karobben

Posted on

2022-09-16

Updated on

2024-01-11

Licensed under

Extract Fasta from VCF

Extract only a part of genes

Other Methods

For a whole chromosome

vcf2fasta

vcftools

Python scripts

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

Tags

Subscribe for updates

Links

Recommends

Categories