So another aspect of data structures, thinking about data structures is to make,
is making them efficient.
As I said before, in the genomics world we're mostly dealing with sequence data.
Sequence data has a natural representation as letters and
computers represent letters typically as as one byte.
So any, any letter in the alphabet, a through z, any nu, any numeral zero
through nine is represented the same way inside the computer using one byte.
So a byte actually of, of the word byte comes from the word bit.
A bit is a binary digit.
And that's just a zero or a one,
and that's at the most fundamental level how computers represent information.
If you take eight bits in a row, you can con-, you can consider that as
an eight bit binary number, which, which can store up to 128 values.
Usually we'd consider those to be the values 0 to 127.
And the standard representation of text in the, in the, inside the computer
is to represent every letter as one of those values between 0 and 127.
So with that much space to, to represent information, we can
represent all the lower case letters, all the upper case letters, that's another 26.
We could represent, represent the ten single digits, that's another ten, and
then we have a room for all the special characters.
So basically everything on your computer keyboard is represented as a single byte.
However, if you look at DNA, you see right away, well,
there's actually only four letters there.
So we can do much, much better when we're representing DNA.
And this is how most serious, highly efficient programs for
processing lots of DNA operate internally.
Instead of representing the four DNA letters as one byte each,
we can represent them as just two bits.
So simply take A and call that, make that,
represent that by the the two bits 0 0, C is 01, G is 1 0 and T is 1 1.
And by doing it this way, we get a fourfold compression.
So instead of using eight bits per letter of DNA, we're only using two bits.
So, because we're storing gigabytes or even terabytes of DNA sequence data,
a four-fold compression right out, right out of the box,
is, is an important efficiency we can gain by that, that representation.
So, finally to look at a slightly more sophisticated way of
representing representing DNA when we're talking about the application of DNA,
one thing that we like to capture in,
in analyzing DNA are patterns of sequence that have some biological function.
And here I'm showing you a picture of the, the ends of an intron.
So introns are the, the interrupting sequences that are in the genes
in our genome that actually don't encode proteins, but get snipped out and
thrown away in the process of going from DNA to RNA to protein.
And introns almost always start with the letters GT and
they almost always end with the letters AG.
And if you collect lots of them together and, and notice how these patterns are in
common you can get the, you can create a probabilistic picture of
what letters are most likely to occur at the beginnings and ends of introns.
So these two pictures show you exactly those two pictures for
the beginning of an intron which is called the donor site and
the end which is called an acceptor site.
So now we could represent all the donor sites we've ever seen as a big set
of strings of say ten letters long, if we, if we chopped out a window of ten
bases around those sites, or we could be much more efficient about it and
capture much more interesting data by computing for every position in
that little window the probability that the letter was A, C, G, or T.
And these logos you see across the top use use the height of the letter to represent
the probability that letter appears at that location.
And with this kind of representation we've now compressed, essentially compressed,
the information from hundreds or even thousands of sequences that we've seen
into a simple pattern which we can then use to, to process other data to, for, for
example to recognize these patterns when we see them again.