Ryan Avery


This program uses B-Trees to handle the large amount of data from a given GeneBank (genome) file, thus allowing the genome to be quickly searched in the future. Due to the large amount of data involved (that would likely not fit into memory on most machines), a B-Tree data structure that reads and writes its contents to physical memory was a good choice for this project.

The program converts the DNA sequence in a given GeneBank file into a B-Tree were each object is a subsequencelarge project picture of a specified length k (1 <= k <= 31). For example, when given the DNA sequence AATTCGACGT, the subsequences when k=4 are: AATT, ATTC, TTCG, TCGA, CGAC, GACG, ACGT. After the GeneBank file is broken into these subsequences, each possible one is stored in a new file.

Once a B-Tree is created with these subsequences, this program then has the capability of searching the file created for a given query sequences of length k. The search returns the frequency of the occurrences of the query string (which will be zero if it is not found) in a list format.

The B-Tree uses parent pointers and corrects oversized nodes all the way up to the root node on insertion. This was a design decision to make testing for an accurate B-Tree easier. The B-Tree also has a fully-functioning delete method. Keys that are deleted can lead to nodes being deleted, and currently, the file offsets of deleted nodes are not refreshed and are therefore never used again.

The B-Tree was tested against this solution to determine correct results. Once finished, the entire project was tested against the provided results for test3.gbk with query 7, as was required by the documentation for this project which can be found here.