My name is Ashish Mahabal, I am a senior research scientist at Caltech,
specifically at the Center for Data-Driven Discovery here
and I have been working on large-scale sky surveys, which means that
I've been using lot of mathematical and statistical techniques.
And that has gotten me interested in methodology transfer,
how other fields also use this, so I have also been applying some of these techniques
to Earth science and health care data and some cancer - cancer research data as well.
Since I came to Caltech in 1999, I started working on Big Data
because the surveys that I was involved in the Palomar-Quest Survey,
before that the Digitized Palomar Observatory Survey,
more recently the Catalina Real-Time Transient Survey and a little bit on
the Palomar Transient Factory.
So what we do is that we observe large parts of the sky again and again.
So essentially, this is like taking digital movies.
And that has become possible only in the last several years.
Until then, it was mainly that people would go out
and look at small parts of the sky, specific samples and come back and study those.
But these digital movies essentially give you lots of data and moreover,
you can find what is changing at different levels
in the universe, in our galaxy and our solar system - and also outside of our galaxy.
First of all, what one needs do is that one has to make sure
that the data are good quality, that there are not too many missing data points
in what you have. And we mainly work - when it comes to large data from these surveys,
is work on the time series of the data sets.
So the time series can be very gappy.
They are heteroscedastic in the sense that the error bars
can vary on the same object depending on when you are observing it.
And of course, the number of objects that we have vary in brightness quite a lot.
So what that means is that the time series that we deal with are quite different from what
the financial services people deal with, where you have very specific times
when the data are taken. And so that provides new challenges,
trying to - trying to figure out what objects are doing when you are not observing them.
And that's most of the time, because the amount of time we observe
is really small compared to the total time
where the variations in the objects are taking place.
Rather than study a single object which may be doing something weird,
you try to do it in a statistical nature. So for instance - and I don't work on those,
but let's take an example of how stars evolve.
You have a star that spends its time in, say, the main sequence for a long time
and then it will- it will evolve into a giant and so on.
And those time scales are so long that you don't get to see them in your own lifetime.
So what you instead do is observe millions of stars - and some of them are in one phase -
and some of them are in the other.
Similarly, when we are looking at objects that vary in brightness, consider supernova,
for instance, the supernova stage would last only for a few weeks,
but before that, if you had observations and if you're lucky,
you can find the star that was the progenitor of it.
And then by looking at the entire time series,
you can try to design specific statistical features,
which then you can look for in your entire data set.
So once you start understanding a little bit more about the kinds of objects
that you are interested in, you design these features or filters, then,
that you can use across the data set to try to find more of them.
And once you have a large enough sample, then you are in business.
Because then you can start applying many of the standard techniques
to the data set after that.
Okay.
So I can answer that on two different levels.
One level is getting a good data set in the first place.
Most surveys are designed with specific goals in mind
and what that essentially means is that you are trying to go after either some low-
hanging fruit, or some specific classes.
So, the data in other classes also exist in that data set,
but those may not have been observed optimally in order to go after those classes.
And so what would be useful is you could combine different data sets to do that.
And I'm also working on what is called Domain Adaptation or Model Adaptation,
where you try to combine these data sets.
And then that becomes very interesting.
Because when - for instance, if you want to do classification,
then you may find that objects that don't vary in brightness,
they hugely outnumber all other classes.
And within the classes that vary, there may be some classes like the flaring M stars,
which would be far more than some other class.
And so what that means is that the data sets are not balanced.
And if they're not balanced, most techniques don't work directly on them easily.
So what you need to do is then find artificial way to balance them
and make sure that the technique
that you are applying makes sense, because you don't want to find correlations
that don't really mean anything.
Because correlation is not causation and you are always going to find some correlation.
So getting a good data set - order it in a way so
hat you have good balanceness and there is proper meta data
that tells you enough about the data set.
I think those are the biggest challenges while you are pre-processing
and during the process itself; making sure
that you can follow-up each step with proven answer
and make it reproducible. So that's the other angle of what - the challenges.
So many times, what you do is that you start with simple correlations
and simple visualization. And languages like Python and R are great for that.
Python is becoming the workhorse for many, many things and things like scikit-learn
that they have. It's lovely to just start playing around with.
R has a large number of statistical libraries which have been written by statisticians.
So that's the good part. And so playing around for - with a bunch of these different ones,
I would say, should be one of the first things.
And I would advise people to learn both of them - Python and R -
because both of them have some good things
and they should have them in their repertory with them,I think.
Combining diverse data sets which were not taken with the same goals in mind.
There are huge data sets that are out there which are -
which have not been combined in that fashion
and doing something like that remains a big challenge.
And I hope to see more progress happening in that area.
And there are many, many new tools that are coming up
that are likely to help there; for instance, in the image domain.
Deep learning is getting popular everywhere
and there are very good tools out there to do that.
But again, the basics are of physics and mathematics and so students
would want to make sure that while it's easy to use online tools
and simply connect them each other - to each other and do a lot of things,
going back to the basic physics and statistics is something that they should keep in mind.
So one thing that has been good in astronomy is that we have been good at maintaining
meta data for our data sets - so data about data.
When we take images, for instance, we have been using what is called the FITS format.
And the FITS images have a very good header
which has all kinds of information - where were
the data taken and what telescope was it, what was the size of the mirror,
what was the filter and what time it was taken
and was the shutter open for this long or less than that and all that.
Now what we find is that because of that, we have been able to build structures or the
names of the columns that we use
and then be able to transfer information from one data set to another easily.
And the same is not true in some of the other fields;
like geoscience is still good, but when it comes to health care science, for instance,
then the meta data keeping there has been at least a few years behind what
we astronomers have been doing.
Astronomy is fantastic because you are trying to solve the origin of the universe,
you are trying to figure out where did we come from, why are we here and all that.
Whereas, in healthcare, especially when you work on something like cancer
- early detection research network is one area I'm
working in - you are trying to see how we can continue to be here longer.
And so that is, in fact, rewarding.
And when one sees that the same kinds of techniques can be applied, that's fantastic.
Because once you take a data set and abstract it enough,
then the tools that you are using
are - they don't care where the data came from,
so long as you are sure and you are careful about maintaining the domain knowledge
and not going to, as I said before,
noise levels that are too much or don't find trivial correlations.
So it's highly rewarding to be able to work on these two completely different scales
from the universe level to the cell level.