Manuscript Title:

ANALYSIS OF FREQUENT NUCLEOTIDE PATTERNS IN COVID-19 GENOME SEQUENCES USING SPM ALGORITHMS

Author:

AQSA UMAR, SANIA BHATTI, AREEJ FATEMAH MEGHJI, NAEEM AHMED MAHOTO, SAPNA KUMARI

DOI Number:

DOI:10.17605/OSF.IO/D7GA3

Published : 2023-02-10

About the author(s)

1. AQSA UMAR - Research Scholar, Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan.
2. SANIA BHATTI - Professor, Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan.
3. AREEJ FATEMAH MEGHJI - Assistant Professor, Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan.
4. NAEEM AHMED MAHOTO - Associate Professor, Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan.
5. SAPNA KUMARI - Research Scholar, Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan.

Full Text : PDF

Abstract

COVID-19 was discovered in Wuhan, China on 19th December 2021. It has been declared a pandemic and has now spread over the globe, impacting millions of people. The genome sequence of COVID-19 strains must be examined to comprehend the behavior and origin of this virus. For this purpose, in this research, we have applied Sequential Pattern Mining (SPM) techniques; fast vertical mining of sequential patterns using co-occurrence information CM-SPAM, vertical mining of maximal sequential patterns (VMSP), closed SPM using sparse and vertical id-lists CloFAST, and efficient mining of top-k sequential patterns (TKS) onto the six strains of COVID-19 genome sequences (CGS) of Pakistan, India, Spain, United Kingdom, China, and Brazil; to investigate genome sequence. First, from the CGS frequent patterns from genome sequence are extracted using CM-SPAM, VMSP, and CloFAST algorithms and after that the frequent extracted patterns are checked whether patterns encode codons of the amino acids. Second, another algorithm TKS is used with a user-defined value k=500 to extract the most frequent patterns from the six genome strains. Third, the obtained results have shown that the availability of frequent pattern as the most of the codons in the six strains. The obtained results are encouraging and show that our study has provided an efficient way for the analysis of CGS as well as given a future direction to CGS analysis.


Keywords

COVID-19 Genome Sequences, Frequent Patterns, Nucleotide bases, Sequential Pattern Mining, Top-k Sequential Patterns, Vertical Mining of Maximal Sequential Patterns, and Nucleotides.