0:07

Let's look at some of the technical challenges involved in

Â astronomical data and calculations and how we tackle those challenges.

Â We're going to want to look at how much data we want to store, how

Â long it takes to search through it, and how long it takes to do calculations.

Â And in each of those cases, we'll see

Â there's a brute force method, and a smart method.

Â And in fact we usually want to do both.

Â And then right at the end we'll wrap up with looking at how

Â we try to make data access over the internet as easy as possible.

Â [BLANK_AUDIO]

Â 0:48

So let's talk about how much data we need.

Â First of all, let's think about a single CCD image.

Â So maybe one CCD image is about 4,000

Â pixels across, and each one of those pixels,

Â there's a number stored with 16 bits of

Â information which is then 2 bytes of information.

Â And if you do the sums that adds up to 32 megabytes.

Â 1:24

So that's what a single CCD image might require for storage.

Â Not too much by modern standards.

Â However as you've seen we tend to use large mosaic

Â cameras with many CCD's put together in a big array.

Â Those gigapixel cameras could be much bigger.

Â You could be talking here about a a mosaic camera...

Â ... a single image might be 1 to a few gigabytes.

Â 1:55

So that's a big image. Now what about if we want to survey the whole sky?

Â Well you could ask how many pixels are there over the whole sky?

Â If we try to pave the sky with CCD images, what do we need?

Â Well it depends how fine the pixels are,

Â but let's suppose the pixels are about one third - 0.3 -

Â of an arc second, so that makes

Â the pixels small enough to get reasonably decent images.

Â 2:31

Then again, if you assume we have 16 bit numbers.

Â Oh, and by the way, that 16 bit number

Â is enough to store numbers up to about 65,000.

Â And, that's typically what CCDs do.

Â 2:53

Now actually of course we want to do the same thing at

Â several different wavelengths, different colors and also we want to repeat the sky.

Â So in practice a typical modern sky survey, the

Â whole sky will be something like petabyte scales.

Â Now a reminder about all of these petas and gigas and so on... so a factor of 10 to

Â the power of 3,000 in scientific notation, that's kilo, as in kilobyte, etcetera.

Â 10 to the 6, that's big M for mega.

Â 10 to the power 9, a billion, is a big G for giga.

Â 10 to the power 12, big T for tera.

Â And then we'd get up to 10 to the power 15, a thousand trillion.

Â That's a big P for peta.

Â So, a petabyte is a thousand trillion bytes and

Â remember each byte is 8 bits in computer speak.

Â So that's how big a sky survey needs to be.

Â 4:04

Now today if you've got a reasonably good laptop, you know, you could

Â get yourself a 1 terabyte disk, it's not that unusual to do so.

Â So maybe to store, for yourself, a sky survey, you need about 1000 laptops.

Â 4:24

So it's doable, but really a bit daft.

Â If every astronomer has to have 1,000 laptops to store

Â their own copies of all their favourite databases, that's just silly.

Â So here's the smart solution. What it drives

Â us into is what's known as a service economy.

Â Around the world, there's a handful of big data centers, one of

Â which is here in Edinburgh, and there're a number of others,

Â 4:49

where we store on big computers with lots of disks the major sky surveys.

Â Astronomers around the world through the internet go and get only the data

Â they want, or do calculations or searches on our servers here.

Â So, that's the smart way to do it.

Â [BLANK_AUDIO]

Â 5:13

It's not just the pixels, the images, that

Â astronomers want, but it's catalogues of objects.

Â If you imagine here's an image of the sky,

Â and there are lots of stars and galaxies on it.

Â We have software that goes over here, spots each of these objects.

Â Each of those becomes a line in a table and that makes a catalogue of objects.

Â And for each one of these objects then is a row, and there are lots

Â of columns, and, each one of these columns is

Â a different piece of information about this object here.

Â So, that, then, is our catalogue.

Â So, how big are these catalogues?

Â Well, it's not nearly as big as the pixel data.

Â 5:57

So for instance - it's a lot of objects,

Â out of a sky survey, we might have a billion objects, or a few billion -

Â and maybe there might be 50 of these columns.

Â And if each one of those is a couple of bytes, then

Â we end up with something like 100 gigabytes for a big sky survey.

Â So that's no problem to store, however searching through it is something else.

Â And that's what we'll look at next.

Â [BLANK_AUDIO]

Â So how do we search through a table of a billion objects and

Â find just the one we want, that redshift seven quasar or the killer rock or whatever?

Â So imagine here we've got lots of rows

Â in our table and this is sitting on our hard drive on

Â our computer, and then let's imagine over here is our

Â CPU in the computer, the bit that does the calculation, calculating.

Â Now in order to, do our search, essentially what

Â we have to do, is take one row of

Â data, bring it into the CPU, do a calculation

Â and decide whether we want that one or not.

Â And then we take the next row and do it again,

Â and the next row and do it again and so on.

Â Now, if imagine all this data streaming from the hard drive

Â to the CPU. A good PC will run at gigahertz rates,

Â So in principle, you can stream a billion rows

Â of data like this through in a split second.

Â It's not a problem.

Â However, it doesn't really work like that.

Â Any search process like this, any transfer from a hard drive to

Â the CPU - because you do it in lots of chunks, each one of those has

Â some kind of overhead, and that overhead may be only a few milliseconds, but

Â then when you multiply a billion times

Â a few milliseconds, you're into a much longer time.

Â It could take days to search through your big database.

Â 7:59

Now modern solid state disks as opposed to

Â spinning hard drives have much smaller overheads - they're

Â faster, but still they are very expensive. Data centers,

Â at least scientific ones, are not using those because they're more expensive.

Â 8:16

So the key point is that you don't

Â actually necessarily have to search through everything every time.

Â The first thing - and this is just the same

Â as say Google or Amazon do - you save the most

Â popular searches so they can be brought back quickly

Â the next time somebody asks pretty much the same thing.

Â The next thing is that the various columns here,

Â you figure out which of those are the most

Â common and put your database in the right

Â order so that you can search through them quickly.

Â 8:48

And then you can pick examples of these columns to

Â make an index on and search through those particularly quickly.

Â [BLANK_AUDIO]

Â So let's talk about astronomical calculations, and how long they take.

Â And I'll take as an example, so

Â called N-body calculations that cosmological theorists use.

Â And the idea here is that we take lots of

Â fake matter particles and if you take any two of

Â those particles, we can calculate the gravitational force between them

Â and that tells us how they're going to move overtime.

Â But we need to take every possible pair

Â of particles and calculate the forces between all

Â those particles, to understand how the whole ensemble

Â of particles is going to evolve with time.

Â Now, a big calculation might have a million fake matter

Â particles and a really big one might have a 100

Â million fake matter particles. That step from a million to a

Â 100 million makes a big difference as we'll see.

Â But first of all, let's make the basic point.

Â Let's imagine we've got one of

Â these big simulations with 100 million particles.

Â So, that's 10 to the power 8 particles.

Â 10:19

Now, on a fast computer, that's going to take less than a second.

Â It's not a problem.

Â Let's just say, that takes, 1 second.

Â However, as I just described, we need to do not one calculation per

Â particle, but one calculation for every pair of particles. So we need

Â to do 10 to the power 8 times 10 to the power 8 calculations.

Â Okay, every particle in that 10 to the 8 has to do all the others.

Â 10:53

So then, that's an enormous amount of time.

Â That's going to take years.

Â And that essentially would make one frame, one time step

Â in that simulation movie that we saw earlier.

Â And you need lots of those to see how the universe evolves.

Â So this is just very difficult.

Â So the brute-force solution is to say okay, if it, this

Â is what one PC can do, we need a super computer.

Â So a supercomputer is really just like

Â thousands of computers chained together working in parallel.

Â So a big supercomputer might have several thousand nodes, and every

Â one of those nodes might have 10 or 20 cores in.

Â So there could many thousands effectively of CPUs working in parallel.

Â 11:45

But even then there're two snags. It still

Â wouldn't be fast enough with this sort of calculation,

Â to get things done really quickly - and also the

Â other snag is that those machines cost millions of pounds.

Â We'd like to do something a bit smarter.

Â 12:03

So the smart solution is to do with being approximate.

Â So with what I've described here, it's what you

Â have to do if you're going to do this calculation exactly.

Â Every particle and its effect on every other particle.

Â But, there's shortcuts you can do which have kind of to do with

Â the fact that particles further away, as individual pairs, are less important.

Â (They add up to a lot.)

Â Now, we haven't got time to explain exactly how this works.

Â For the mathematically minded, instead of a problem that is

Â n times n, we can get a speed change so that this is n times the logarithm of n.

Â Now this makes a very serious

Â difference to the speed of our calculation.

Â So for instance let's start... if we imagine we have

Â our 10 to the 6 particles and let's say...

Â 14:04

If we do the n times the logarithm of

Â n method, that comes out to about 30 minutes.

Â Now notice that's still quite a long time - to calculate one timestep

Â in this simulation, but at least it's something plausible that that we can do.

Â So big calculations are very difficult.

Â [BLANK_AUDIO]

Â So we've been talking about technical challenges, how

Â much data there is to store, how hard that is, how long it takes to search, how long

Â it takes to do calculations, and all those problems are about machine time,

Â how long it takes a computer or a super computer to perform these tasks.

Â But the real world problem is not just

Â about machine time, it's about human time as well.

Â 15:00

So for example what we don't want is that every time we go and get some data we have

Â to spend a whole afternoon working out how this particular website works, what

Â we have to do, and when we get the data we have to write a special

Â piece of software to deal with that data and plot it on top of something else.

Â All of that sort of then just uses up

Â astronomer time even if the machines are very fast.

Â Now, the modern internet that we're used to - when we're looking

Â for information, doing our shopping, etc, is very point and click.

Â It's very easy.

Â It took a lot of effort to get it that way, but it's automated.

Â And we want astronomy to work the same way, grabbing data, mixing and matching it.

Â That ideal is known as the virtual observatory.

Â And it's the secret - to either the virtual observatory, or

Â to the internet as a whole - is the same thing.

Â It's standardization.

Â What we need essentially is for everything to have the

Â same screw threads so that, so the bits fit together.

Â The different web pages, the data sets, and so on.

Â So forthe internet that's about things like TCP/IP, the

Â basic internet protocols, or HTTP the way, how you define how

Â you speak to a website or HTML, how you write the content of a website.

Â So all those things are standards which were agreed

Â internationally, and that's what makes it all magic and easy.

Â In the same way in astronomy we want to standardize

Â the format of data, the data access protocols and and so on.

Â And we're in the middle of that process as we speak.

Â