Demystifying the odd base conundrum

The picture above has a lot of beads and their colors are primarily shades of blue or pink. But clearly, there is one orange bead that stands out making a difference.

In the same way, our DNA is a stitch and stretch of 4 primary chemical beads (bases)- Adenine (A), Thymine (T), Guanine (G), and Cytosine (C) in a certain ciphered order. Similar to the orange bead, some of the bases in DNA change from the particular order. The base could change to another base (substitution) or is lost(deletion) or a gain (insertion) to another base. And this change is mutation/variation/single nucleotide polymorphism (SNP). Some of these variations are commonly observed amongst all of us, called, high-frequency variations. Some of them are seen only amongst some of us, called, low-frequency variations. The high-frequency variations contribute to the diversity seen among us like hair color, eye color, perception of smell, response to drugs, etc. The low-frequency variations are found in a very less population and this could contribute to increased responses or stimuli and also be causative of certain rare disorders. These variations can be used to diagnose or treat the disorder.

The above picture has 18 beads of which one is an orange one which in terms of our DNA would be low frequency. We decipher our DNA in terms of reads. And wouldn’t it be nice if we could see the number of reads supporting one base? Before we find out if there’s a way to do that, let’s dig into knowing how to identify if there’s a variant.

Identifying the variant

The DNA is isolated and sequenced to get the sequence of the bases in small fragments-reads. Now the reads can be aligned/mapped with a reference DNA. A certain position has multiple reads and if all reads correspond to the same base as in the reference, then it is not a variant. Sometimes, while sequencing there could be errors and these errors have to be taken into account. So for every position, every read is checked for base quality, mapping quality, and error rates called pre-processing. Then the number of reads supporting a variant is calculated and the probability of the variant to occur is calculated and a log score is given. The position is said to have a variant if it has a higher score than the threshold score.

Visualizing the variant

Yes, the identified variant can be seen in the genome browser just like the orange bead. The reference sequence is up and any variant is indicated. In case if the base is erased-deleted or newly joined-insertion, it is shown using – in the read or reference.

Genome browser view-Visualizing the variant

Easy, but we don’t just have 18 reads. We have a lot of reads for one variant and knowing how many reads support one variant is difficult in the genome browser. In Figure (2) there’s a base deletion, however, we see only some reads supporting the deletion of all the reads. Is there a much easier way to look at this?

Yes, there is – The Variant Support view.

Right-click on the preferred SNP sample report and click “Show in Variant Support View”

This view shows all the variants. Click on the variant and the variant support view comes up for that variant.

You can also Right-click at the particular position on the read list in the genome browser and click “Show Variant Support View”

The selected variant is highlighted and the surrounding 10 bases on each side are shown. The reads are clustered for easier visualization. The size of each cluster is on the right side and a cluster with more than 1 read is shown. The color of the view is based on base quality and it can be changed to mapping quality/positional bias.

Changing the wizard options

The reference base (T) is shown up. For every cluster, the consensus base (T) match with the reference is shown as ‘.’, variation(A/G/C) base is shown using the variant base, deletion is shown as ‘-’ in the cluster and in case of insertion, ‘-’ is shown in the reference.

Substitution

The base ‘T’ is substituted with ‘C’. We could see 89 (right-hand side) reads out of 197 (from the bottom) reads supporting variant ‘C’. This states ~50% reads support the base change or substitution. This possibly means that it is a heterozygous variant with one allele having the mutation.

Insertion

Here, there is an insertion of ‘A’ after the base ‘T’. We can see that 29 reads are supporting the insertion of A. This states ~9% of reads support insertion. It is a low-frequency insertion. In the case of a tumor, the cell has high heterogeneity and this insertion could be seen in some cells. We can also see the read division based on positive or negative strand by selecting “Split Clusters by strand” in properties

Deletion

Here, there is a base deletion of ‘C’. 30 reads support deletion and constitutes ~25% supporting reads.

Complex variation

It is also possible that substitution and indels can co-occur and this is very difficult to see in a genome browser. Here we can see after base ‘G’ there are two insertions and one substitution followed by insertion and it’s very easy to identify the supporting reads of each.

How does this help in a real case scenario?

Tumors are treated with chemotherapy, anti-cancer drugs, etc. Anti-cancer drugs fail to reduce tumors in some cases. Some of the tumor cells overexpress certain genes that hamper the effectiveness of the drug. One such gene family is ABC (ATP-binding cassette). The mutations in ABC genes cause them to overexpress in tumor cells and hence the drugs are not effective against these cells(1).

The paper by Kadioglu, Onat, et al(1) has used Strand NGS 3.4 to identify the variants that cause ABC genes to overexpress and found that they are mostly rare variants. These variants were visualized using the variant support view.