In this tutorial, we approach the case of a very large event file, larger than the memory of our computer. Will we be able to analyze it?
[1]:
%load_ext autoreload
%autoreload 2
%load_ext memory_profiler
import psutil
import os
import numpy as np
import gc
from stingray import EventList, AveragedPowerspectrum
[2]:
pid = os.getpid()
python_process = psutil.Process(pid)
memory_use = python_process.memory_info()[0]/2.**20
print(f"Current memory use ({pid}): {memory_use:.2f} MB")
Current memory use (29237): 91.16 MB
Data preparation¶
Now we simulate and load a full dataset. Let’s simulate a large observation, about 2GB. We use HENDRICS, you can install it with pip install hendrics
[4]:
fname = "events_large.evt"
[5]:
!HENfake -c 20000 --tstart 0 --tstop 10000 --mjdref 56000 -o events_large.evt
/Users/meo/devel/StingraySoftware/hendrics/hendrics/io.py:38: UserWarning: Warning! NetCDF is not available. Using pickle format.
warnings.warn(msg)
/Users/meo/devel/StingraySoftware/hendrics/hendrics/fold.py:38: UserWarning: PINT is not installed. Some pulsar functionality will not be available
warnings.warn(
Naive procedure: create light curve, then calculate PDS¶
[6]:
%memit events = EventList.read(fname, fmt="ogip")
peak memory: 5645.70 MiB, increment: 5347.78 MiB
Loading the observation into memory takes about 5 GB. Now, we want a power spectrum with very high frequencies. Let us do the traditional way, creating first a light curve, then analyzing it with AveragedPowerspectrum
[7]:
fine_sample_time = 0.00001
segment_size = 128
[8]:
%memit lc = events.to_lc(dt=fine_sample_time)
peak memory: 15315.70 MiB, increment: 9670.00 MiB
This very finely sampled light curve will take a lot of memory: 10000 s, sampled at 10 \(\mu\)s, will give ~1B float objects, or 8 GB, for the time array and the same for the counts array. Here, the value that comes out is slightly smaller because the operating system is using swap! Some of the swapped data will come back in the main memory when calculating the power spectrum.
[9]:
%memit ps = AveragedPowerspectrum.from_lightcurve(lc, segment_size=segment_size)
ps.power
78it [00:28, 2.69it/s]
peak memory: 14324.80 MiB, increment: 2063.22 MiB
[9]:
array([1.02539377e-04, 1.03036920e-04, 9.54479649e-05, ...,
1.10340774e-04, 1.07141777e-04, 9.10072280e-05])
[10]:
del events, lc, ps.power, ps
gc.collect()
[10]:
0
[11]:
memory_use = python_process.memory_info()[0]/2.**20
print(f"Current memory use ({pid}): {memory_use:.2f} MB")
Current memory use (29237): 1876.75 MB
So, if we want to take the maximum memory usage for the full procedure, we can profile the three steps done until now:
[12]:
def legacy_pds_procedure(fname, sample_time, segment_size):
events = EventList.read(fname, fmt="ogip")
lc = events.to_lc(dt=fine_sample_time)
return AveragedPowerspectrum.from_lightcurve(lc, segment_size=segment_size)
%memit ps = legacy_pds_procedure(fname, fine_sample_time, segment_size)
ps.power
78it [00:28, 2.70it/s]
peak memory: 16715.56 MiB, increment: 14837.95 MiB
[12]:
array([1.02539377e-04, 1.03036920e-04, 9.54479649e-05, ...,
1.10340774e-04, 1.07141777e-04, 9.10072280e-05])
Let’s clean up the memory a little bit.
[13]:
del ps.power, ps
gc.collect()
python_process.memory_info()[0]/2.**20
print(f"Current memory use ({pid}): {memory_use:.2f} MB")
Current memory use (29237): 1876.75 MB
Slightly better: PDS from events¶
What if we get the power spectrum directly from the events, without previous binning of the full light curve? In this case, the binning will happen only on a segment-by-segment basis, freeing memory.
[14]:
def pds_from_events(fname, sample_time, segment_size):
events = EventList.read(fname, fmt="ogip")
return AveragedPowerspectrum.from_events(events, dt=sample_time, segment_size=segment_size)
%memit ps = pds_from_events(fname, fine_sample_time, segment_size)
ps.power
78it [00:28, 2.71it/s]
peak memory: 7543.30 MiB, increment: 5701.02 MiB
[14]:
array([1.02539377e-04, 1.03036920e-04, 9.54479649e-05, ...,
1.10349785e-04, 1.07137039e-04, 9.09968704e-05])
Much better! The memory increment is now dominated by the loading of events, so just about 5.7 GB. Let’s clean up the memory a little bit again
[15]:
del ps.power, ps
gc.collect()
memory_use = python_process.memory_info()[0]/2.**20
print(f"Current memory use ({pid}): {memory_use:.2f} MB")
Current memory use (29237): 2245.31 MB
Let’s be “lazy”: lazy loading with FITSTimeseriesReader¶
Now, let’s try not to even pre-load the events. What will happen? First of all, we use the new class FITSTimeseriesReader
to lazy-load the data, meaning that the data remain in the FITS file until we try to access them. This occupies very little memory.
[16]:
from stingray.io import FITSTimeseriesReader
%memit fitsreader = FITSTimeseriesReader(fname, data_kind="times")
peak memory: 2245.34 MiB, increment: 0.00 MiB
[17]:
from stingray.gti import time_intervals_from_gtis
start, stop = time_intervals_from_gtis(fitsreader.gti, segment_size)
%memit interval_times = np.array(list(zip(start, stop)))
peak memory: 2245.36 MiB, increment: 0.00 MiB
Let’s create an iterable that uses the FITSTimeseriesReader to send AveragedPowerspectrum the pre-binned light curves for each segment. Events will be read in chunks from the FITS file, and streamed as light curve segments on the fly.
[18]:
from stingray.utils import histogram
def fits_times_iterable(fname, segment_size, sample_time):
"""Create light curve iterables to be analyzed by AveragedPowerspectrum.from_lc_iterable."""
fitsreader = FITSTimeseriesReader(fname, data_kind="times")
start, stop = time_intervals_from_gtis(fitsreader.gti, segment_size)
intvs = [[s, e] for s, e in zip(start,stop)]
times = fitsreader.filter_at_time_intervals(intvs, check_gtis=True)
for ts, (s, e) in zip(times, intvs):
lc = histogram(ts, bins=np.rint((e - s)/sample_time).astype(int), range=[s, e])
yield lc
%memit ps_it = AveragedPowerspectrum.from_lc_iterable(fits_times_iterable(fname, segment_size, fine_sample_time), segment_size=segment_size, dt=fine_sample_time)
ps_it.power[:10]
78it [00:32, 2.43it/s]
peak memory: 4531.69 MiB, increment: 2286.30 MiB
[18]:
array([1.02539377e-04, 1.03036920e-04, 9.54479649e-05, 1.13488540e-04,
1.10641888e-04, 1.11452625e-04, 1.15657206e-04, 1.04863608e-04,
9.25844488e-05, 9.50754514e-05], dtype=float64)
Hurray! We managed to keep the memory increment to ~2GB, comparable with the sole AveragedPowerspectrum operation!
[ ]: