Gap from the head and tail could have huge effects on the result of the tree. So, we should avoid the side effects from the gaps because of the length-difference.
Here, we just counting the gaps from the head and tail of each sequences and retain the largest number. On the other hand, the numbers’ of gap extension were printed out so you can customize your own.
## reading the alignments align = AlignIO.read(OUTPUT + ".fasta", "fasta")
## Count the longest space on the head
Gap_head = 0 Gap_tail = 0 Gap_list_head = [] Gap_list_tail = [] for seq_tmp in align: Num = 0 while seq_tmp[Num] == "-": Num +=1 Gap_list_head += [Num] if Num > Gap_head: Gap_head = Num ## Count the longest space on the tail for seq_tmp in align: Num = 0 while seq_tmp[::-1][Num] == "-": Num +=1 Gap_list_tail += [Num] if Num > Gap_tail: Gap_tail = Num
print("\nGap head list: ", Gap_list_head, "\nGap head tail: ", Gap_list_tail) print("\nGap head: ", Gap_head, "\nGap head: ", Gap_tail)
## Count the longest space on the head
# Alignment objects can be manipulated
# slice alignment align_slice = align[:, Gap_head : len(seq_tmp)-Gap_tail] print ("Slice of alignment from position " + str(Gap_head) + " to " + str(Gap_tail) +"\n") print(align_slice)
## wirte it out F = open(OUTPUT + "_scliced.fa", 'w') F2 = open(OUTPUT + "_gap.fa", 'w') for Seq_tmp in align_slice: F.write(">" + Seq_tmp.id+"\n") F2.write(">" + Seq_tmp.id+"\n") Seq_seq = str(Seq_tmp.seq) F2.write(Seq_seq+"\n") Seq_seq = Seq_seq.replace("-","") F.write(Seq_seq+"\n") F.close() F2.close()
Gap head list: [115, 115, 86, 86, 86, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 115, 91, 170, 70, 115, 117, 0, 115, 115, 192]
Gap head tail: [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 0, 0, 12]
Gap head: 192
Gap head: 13
Slice of alignment from position 192 to 13
Alignment with 28 rows and 167 columns