Getting started with Mass2

The Microcalorimeter Analysis Software Suite, or Mass, is a library designed for the analysis of data generated by superconducting microcalorimeters, such as Transition-Edge Sensors (TESs) and Thermal Kinetic-Inductance Detectors (TKIDs).

Comparison of Mass versions 2 vs 1

Mass2 is different from the previous version of Mass in several important ways, including;

  1. Mass2 takes a different view of data wrangling and bookkeeping than version 1, leveraging the power of the Pola.rs Dataframe for this purpose. Polars is a high-performance modern dataframe library. It organizes the data structures and provides data I/O. This change increases efficiency and performance for some activities. More importantly, using a Dataframe makes the arrays of per-pulse quantities both flexible (you can add fields of your own design on equal footing with standard quantities) and clearer (you can easily find a complete list of all such quantities with data types and statistics).
  2. Mass2 no longer uses HDF5 datasets to automatically cache analysis results alongside the raw data. This change removes a valuable, but often maddeningly imperfect feature. We think the cost worth paying. There are big benefits in clarity and avoiding both errors and hidden side-effects.
  3. Mass2 give access to the native save/load features of Polars, making it easy to preserve and recover subsets or complete sets of analyzed data. The preferred file format is Apache Parquet, but several other choices like CSV are supported and can be used for interchange with frameworks other than Mass.
  4. Many data structures in Mass2 use Python's "frozen dataclasses" features, making them immutable. With immutable objects, it is much easier to avoid inconsistent state and sneaky bugs. They will take some getting used to at first, but (just as with the lack of HDF5 backing) we believe the benefits are worth the costs.

Some things have not changed. Mass still makes heavy use of the libraries:

  • Numpy. For numerical arrays and scientific computation.
  • Scipy. Additional algorithms for scientific computation.
  • Matplotlib. For high-quality plotting.
  • LMFIT. Convenient non-linear curve fitting.

Mass also still:

  • Lets you operate on many microcalorimeter data channels in parallel with a single function call.
  • Works from full pulse records in LJH files, or pre-filtered pulses in OFF files.
  • Generates standard pulse-summary quantities.
  • Creates and applies optimal filters.
  • Creates and applies energy-calibration functions.
  • Fits known fluorescence-line shapes to measured spectra.

This guide is based on the evolving API of Mass2, still under development.

How to load pulse data

Raw pulse data are stored in the proprietary LJH format. An Optimally Filtered Format ("OFF") has also been used, in which the Dastard data acquisition program performs optimal filtering in real time. Mass2 can work with data of both types, though of course any process that requires access to the full raw pulse record will fail on an OFF file. (Other formats could be added to Mass2 in the future, if necessary.)

How to load full pulses from LJH files

Mass2 can read LJH files, whether the modern version 2.2, or the earlier 2.1 or 2.0 versions. The older versions have much reduced time resolution and no ability to connect with external triggers; for the most part, they were superceded in late 2015 by the latest verison. The file version is automatically detected by Mass, and you should not have to think about it.

For this getting started guide, we'll assume you have the pulsedata package installed. (This should happen automatically when you install Mass2.) It's a standard source of a small number of LJH and OFF example files.

Often, you want to load all the data files in a directory, possibly with companion noise files from another. That looks like this:

# mkdocs: render
import numpy as np
import pylab as plt
import polars as pl
import pulsedata
import mass2

pn_pair = pulsedata.pulse_noise_ljh_pairs["bessy_20240727"]
data = mass2.Channels.from_ljh_folder(
    pulse_folder=pn_pair.pulse_folder,
    noise_folder=pn_pair.noise_folder
)

The value returned, data, is a mass2.Channels object. At its heart is data.channels, a dictionary mapping from channel numbers to single-sensor objects of the type mass2.Channel.

When you create the Channels, there are options to exclude certain channels with problems, or limit the result to some maximum number of channels. For these needs, you can use the arguments limit: int and/or exclude_ch_nums: Iterable[int]. For instance, the following will return a Channels that omits channel 4220 and limits the output to no more than 5 channels (in fact, there will be only 1, given the data source has only one channel other than number 4220):

less_data = mass2.Channels.from_ljh_folder(
    pulse_folder=pn_pair.pulse_folder, noise_folder=pn_pair.noise_folder,
    limit=5, exclude_ch_nums=[4220]
)

Advanced cases, for deep exploration

If you want to create a single mass2.Channel object (not created as part of a mass2.Channels object):

pfile = pn_pair.pulse_folder / "20240727_run0002_chan4220.ljh"
nfile = pn_pair.noise_folder / "20240727_run0000_chan4220.ljh"  # the noise file is optional
ch = mass2.Channel.from_ljh(pfile, nfile)

To open a single LJH file and study it as a pure file, you can use the internal class LJHFile, like this:

# mkdocs: render
ljh = mass2.LJHFile.open(pn_pair.pulse_folder/"20240727_run0002_chan4220.ljh")
print(ljh.npulses, ljh.npresamples, ljh.nsamples)
print(ljh.is_continuous)
print(ljh.dtype)

plt.clf()
y = ljh.read_trace(30)
plt.plot(y, ".r")

How to load pre-filtered pulses from OFF files

OFF files do not record the entire microcalorimeter pulse, but only a summary of it. This means that no analysis steps that require the raw pulse can be performed. Also, many of the per-pulse summary quantities that exist in an OFF file have different names (and perhaps slightly different definitions) than the corresponding fields computed in a standard LJH-based analysis. Use of OFF files is not a primary goal of Mass2 in the initial released, but they are supported for legacy analysis.

# Load one OFF file into an `OffFile` object'
p = pn_pair.pulse_folder
off = mass2.core.OffFile(p / "20240727_run0002_chan4220.off")

# Load multiple OFF files info a `Channels` object
import glob
files = glob.glob(str(p / "*_chan*.off"))
offdata = mass2.Channels.from_off_paths(files, description="OFF file demo")

# Here's the dataframe for a single channel's per-pulse data given an OFF input
ch = offdata.ch0
print(ch.df)

How to load additional information

Most data sets come with experiment-state labels, and some also have external trigger timing. Both sets of information are universal across channels, in the sense that a single table of state labels or external trigger times is relevant to all channels in that data acquisition. Therefore, they are best loaded for all channels at once.

Here is the first time we see a consequence of the immutability of the Channel and Channels classes: instead of adding information to an existing object, we have to create a new object, a copy of what came before but with added information. In this case, the added information will be new data columns in the dataframe of each channel. But to use the new, enhanced object, we have to re-assign the name data to the new object. Thus:

# mkdocs: render

# Load the *_experiment_state.txt file, assuming it has the usual name and path
data = data.with_experiment_state_by_path()
ch = data.ch0
print(ch.df.columns)
states = ch.df["state_label"]
print(states.dtype)
print(states.unique(), states.unique_counts())
assert states.unique()[-2] == "SCAN3"
assert states.unique_counts()[-2] == 42433

The output reads:

['timestamp', 'pulse', 'subframecount', 'state_label']
Categorical
shape: (6,)
Series: 'state_label' [cat]
[
    null
    "START"
    "CAL2"
    "PAUSE"
    "SCAN3"
    "SCAN4"
] shape: (6,)
Series: 'state_label' [u32]
[
    4
    4
    11468
    40854
    42433
    5237
]

Notice that Polars saves the state labels as a "Categorical" type. This type improves performance when the values are strings known to take on one of a limited number of possibilities (far fewer than the number of rows in a table).

Here's how you would load external trigger, if this example data set had external triggers:

# exttrig_path = p / "20240727_run0002_external_trigger.bin"
# data = data.with_external_trigger_by_path(exttrig_path)

What is in the Channel and Channels objects

The most important part of Channels is a dictionary mapping channel numbers each to a Channel object. You can access a specific channel, or the "first", like this:

ch = data.channels[4220]
# or if you just want any rando channel, this returns the first channel in order of creation (normally, the one with the lowest channel #)
arb_channel = data.ch0

Other features include a dictionary of channels declared "bad"

print(f"There are {len(data.bad_channels)} bad channels in the data at first.")
data_withbad = data.set_bad(4219, "We think chan 4219 is no good at all.")
print(f"There are {len(data_withbad.bad_channels)} bad channels if we capriciously declare one bad.")

It is easy to join the data from successive experiments into a single concatenated object. This example just joins a second copy of the same data to itself, because the pulsedata repository doesn't happen to have more data from the same experiment.

later_data = data
data_merged = data.concat_data(later_data)

The most important part of each Channel object is the underlying per-pulse dataframe. Print it, then the facts specific to the sensor channel as a whole, first as an object, then a dataframe:

print(ch.df)
print(ch.header)
print(ch.header.df)

# Here are the pulse dataframe's column names, the contents of the subframecount column, and the min/mean/max of the timestamp columns
print(ch.df.columns)
print(ch.df["subframecount"])
ts = ch.df["timestamp"]
print(ts.min(), ts.mean(), ts.max())
assert len(ts) == 100_000

The above commands produce the output:

shape: (100_000, 4)
┌────────────────────────────┬──────────────────────┬───────────────┬─────────────┐
│ timestamp                  ┆ pulse                ┆ subframecount ┆ state_label │
│ ---                        ┆ ---                  ┆ ---           ┆ ---         │
│ datetime[μs]               ┆ array[u16, 500]      ┆ u64           ┆ cat         │
╞════════════════════════════╪══════════════════════╪═══════════════╪═════════════╡
│ 2024-07-27 13:30:44.451920 ┆ [6063, 6076, … 6655] ┆ 1519640266112 ┆ null        │
│ 2024-07-27 13:30:44.472140 ┆ [6071, 6065, … 6670] ┆ 1519640589632 ┆ null        │
│ 2024-07-27 13:30:44.475788 ┆ [6120, 6123, … 6215] ┆ 1519640648000 ┆ null        │
│ 2024-07-27 13:30:44.497137 ┆ [6074, 6065, … 6660] ┆ 1519640944448 ┆ null        │
│ 2024-07-27 13:30:44.573915 ┆ [6074, 6072, … 6672] ┆ 1519642206080 ┆ START       │
│ …                          ┆ …                    ┆ …             ┆ …           │
│ 2024-07-27 16:11:55.982246 ┆ [6065, 6063, … 6278] ┆ 1674384771712 ┆ PAUSE       │
│ 2024-07-27 16:11:55.986849 ┆ [6077, 6089, … 6662] ┆ 1674384833216 ┆ PAUSE       │
│ 2024-07-27 16:11:56.054243 ┆ [6058, 6048, … 6637] ┆ 1674385923008 ┆ PAUSE       │
│ 2024-07-27 16:11:56.080575 ┆ [6067, 6064, … 6639] ┆ 1674386344320 ┆ PAUSE       │
│ 2024-07-27 16:11:56.220080 ┆ [6062, 6055, … 6483] ┆ 1674388576448 ┆ PAUSE       │
└────────────────────────────┴──────────────────────┴───────────────┴─────────────┘
ChannelHeader(description='20240727_run0002_chan4219.ljh', ch_num=4219, frametime_s=4e-06, n_presamples=250, n_samples=500)
shape: (1, 27)
┌──────────────────┬─────────────────────────┬───────────────────┬─────────────┬───┬────────────┬──────────┬─────────────────────────┬────────────┐
│ Save File Format ┆ Software Version        ┆ Software Git Hash ┆ Data source ┆ … ┆ Pixel Name ┆ Timebase ┆ Filename                ┆ continuous │
│ Version          ┆ ---                     ┆ ---               ┆ ---         ┆   ┆ ---        ┆ ---      ┆ ---                     ┆ ---        │
│ ---              ┆ str                     ┆ str               ┆ str         ┆   ┆ str        ┆ f64      ┆ str                     ┆ bool       │
│ str              ┆                         ┆                   ┆             ┆   ┆            ┆          ┆                         ┆            │
╞══════════════════╪═════════════════════════╪═══════════════════╪═════════════╪═══╪════════════╪══════════╪═════════════════════════╪════════════╡
│ 2.2.1            ┆ DASTARD version 0.3.2   ┆ 508fd92           ┆ Abaco       ┆ … ┆            ┆ 0.000004 ┆ /Users/fowlerj/qsp/lib/ ┆ false      │
│                  ┆                         ┆                   ┆             ┆   ┆            ┆          ┆ python3…                ┆            │
└──────────────────┴─────────────────────────┴───────────────────┴─────────────┴───┴────────────┴──────────┴─────────────────────────┴────────────┘
['timestamp', 'pulse', 'subframecount', 'state_label']
shape: (100_000,)
Series: 'subframecount' [u64]
[
    1519640266112
    1519640589632
    1519640648000
    1519640944448
    1519642206080
    …
    1674384771712
    1674384833216
    1674385923008
    1674386344320
    1674388576448
]
2024-07-27 13:30:44.451920 2024-07-27 14:43:53.618149 2024-07-27 16:11:56.220080

Each column (e.g., ch.df["timestamp"]) is a Polars "series" (type: pl.Series). See the Polars documentation to learn about the enormous range of operations you can perform on dataframes and data series. The library is optimized for parallel operation and offers database-like queries to select and join frames. It offers the option of lazy queries, where a query is built up step-by-step then optimized before being executed. We cannot stress enough how thoroughly worth it it is for a Mass2 user to learn and take advantage of the Polars library to make any query, no matter how simple or complex. Polars is excellent.

Analyzing pulse data

One of the design goals for Mass2 was to enable pulse analysis in each of two distinct situations:

  1. Analysis of a data set from a new instrument, or a new configuration (e.g., a changed bath temperature or bias voltage), or an unfamiliar sample (especially a new calibration target). In this case, we don't yet know the appropriate optimal filter to use for each sensor, or how to convert measured pulse sizes into photon energies. We might also not even know the optimal set of analysis choices to make, such as whether arrival-time ("phase") correction is required.
  2. Analysis of new data from an established instrument in a known configuration with a known energy-calibration curve. In this case, we can apply an analysis "recipe" learned from the other case. It is assumed that this earlier recipe was learned from similar data, and that it will therefore produce good, if imperfect, results. The results should be good enough to use during real-time, or "online" data analysis to assess data quality, even if they have to be improved offline for a final result.

This section concentrate on the first case, analysis of new data for which no established recipe exists. For the second case, see Re-using a recipe and Storing a recipe.

Summarizing pulses

The Channel.summarize_pulses() method returns a new Channel with a much enhanced dataframe that has nearly a dozen new columns, such as ["pretrig_mean", "pulse_rms", ...]. To use this method on all the Channels in a Channels object, we make use of the handy Channels.map() method. It takes a single, callable argument f which must accept a Channel object and returns a modifie version of it. We also often follow this up by a function that attaches a "good expression" to each channel. Here we'll consider pulses "good" when their pretrigger rms and their postpeak derivatives are no more than 8-sigma off the median. These two steps can be performed with map() like this:

# mkdocs: render
def summarize_and_cut(ch: mass2.Channel) -> mass2.Channel:
    return (
        ch.summarize_pulses()
        .with_good_expr_pretrig_rms_and_postpeak_deriv(8, 8)
    )
data = data.map(summarize_and_cut)

# Plot a distribution
ch = data.ch0
prms = ch.df["pulse_rms"]
hist_range = range=np.percentile(prms, [0.5, 99.5])
bin_edges = np.linspace(hist_range[0], hist_range[1], 1000)
ch.plot_hist("pulse_rms", bin_edges)
plt.xlabel("Pulse rms (arbs)")
plt.title(f"The middle 99% of pulse rms values for channel {ch.header.ch_num}")

It is usual to replace the previous object with the new, enhanced one, but it isn't necessary. The above result could be named data2 instead of data if you need access to data both before and after for some reason. Most operations only add to the fields in a data frame, without modifying or removing any existing fields, so preserving the old object is not usually useful.

Computing and applying optimal filters

To compute an optimal filter for each channel, one must analyze the noise (to learn its autocorrelation function) and create a model of pulses (typically, the model is an average pulse, possibly computed over a selected subset of the records). The simplest approach is to use the Channel.filter5lag method. It analyzes noise the standard way, and it computes an average of all pulses designated "good" in the previous step. If you wanted to restrict those pulses further, an optional use_expr argument lets you pass in a Polars expression to reject pulses outside a certain range of times, or experiment states, or to select based on a range of pulse_rms values. We'll skip the "use" here and instead take the default choice of using all good pulses

# mkdocs: render
def do_filter(ch: mass2.Channel) -> mass2.Channel:
    return ch.filter5lag(f_3db=10000)
data = data.map(do_filter)

The operation above (or the application of any 5-lag filter) will add new fields ("5lagx", "5lagy") to the channel data frame. It will also add a new, last "step" in each channel's analysis "recipe". See more in Saving your results below. Here's how you can look at the last step and make its built-in debug plot. For optimal filter construction, the debug plot is simply a plot of the filter created in the step.

# mkdocs: render
ch = data.ch0
step = ch.steps[-1]
step.dbg_plot(ch.df)

The step also contains both the new optimal filter and the FilterMaker used to create it. You can look at the latter to find the noise correlation and spectrum, and the signal model--all the ingredients needed to make optimal filters:

# mkdocs: render
maker = step.filter_maker
plt.clf()
plt.subplot(221)
plt.plot(maker.noise_autocorr[:100], ".-b")
plt.plot(0, maker.noise_autocorr[0], "ok")
plt.title("Noise autocorrelation")
plt.xlabel(f"Lags (each lag = {maker.sample_time_sec*1e6:.2f} µs)")

plt.subplot(222)
freq_khz = np.linspace(0, 0.5*1e-3/maker.sample_time_sec, len(maker.noise_psd))
plt.loglog(freq_khz[1:], maker.noise_psd[1:], "-g")
plt.xlabel("Frequency (kHz)")
plt.title("Noise power spectral density")
plt.ylabel("Noise PSD (arbs$^2$ / Hz)")

plt.subplot(212)
t_ms = (np.arange(ch.header.n_samples)-ch.header.n_presamples)*maker.sample_time_sec*1000
plt.plot(t_ms, maker.signal_model, "r")
plt.title("Model pulse")
plt.xlabel("Time after trigger (ms)")
plt.ylabel("Signal (arbs)")
plt.tight_layout()

Corrections and energy calibration

We can apply the drift correction based on pretrigger mean to the filtered values, and also perform a rough calibration. That might look like:

# mkdocs: render

def dc_and_rough_cal2(ch: mass2.Channel) -> mass2.Channel:
    import polars as pl
    use_cal = (pl.col("state_label") == "CAL2")
    use_dc = pl.lit(True)
    line_names = ["CKAlpha", "NKAlpha", "OKAlpha", "FeLl", "FeLAlpha", "FeLBeta",
                "NiLAlpha", "NiLBeta", "CuLAlpha", "CuLBeta", 980]
    return (
        ch.rough_cal_combinatoric(
            line_names=line_names,
            uncalibrated_col="pulse_rms",
            calibrated_col="energy_pulse_rms",
            ph_smoothing_fwhm=6,
            use_expr=use_cal,
        )
        .driftcorrect(
            indicator_col="pretrig_mean", uncorrected_col="5lagy",
            use_expr=use_dc)
        .rough_cal_combinatoric(
            line_names,
            uncalibrated_col="5lagy_dc",
            calibrated_col="energy_5lagy_dc",
            ph_smoothing_fwhm=6,
            use_expr=use_cal,
        )
    )

data = data.map(dc_and_rough_cal2)
ch = data.ch0
plt.clf()
use = (pl.col("state_label") == "CAL2")
ax = plt.subplot(211)
ch.plot_hist("energy_pulse_rms", np.linspace(0, 1000, 1001), axis=ax, use_expr=use)
plt.title("Rough-calibrated energy (not optimally filtered)")
plt.xlabel("Rough energy (eV)")

ax = plt.subplot(212)
ch.plot_hist("5lagy_dc", np.linspace(0, 2600, 1001), axis=ax, use_expr=use)
plt.title("Uncalibrated, optimally filtered pulse heights")
plt.xlabel("Pulse heights (arbs)")
plt.tight_layout()

Now, an improved calibration can be achieved with actual fits to the known calibration lines. (One subtle aspect of this improved calibration is that there must be an existing calibration from the same field into energy that you want to use for the final. In this case, we must have that second rough_cal_combinatoric above, because only that second one uses the field 5lagy_dc as its input. Furthermore, we have to track which step it was whose calibration we are improving upon. Here, that is the previous step, called -1 in python. The first rough calibration wouldn't meet our needs, because it is a calibration from pulse_rms into energy, a totally different mapping.)

# mkdocs: render
def do_multifit(ch: mass2.Channel) -> mass2.Channel:
    import mass2
    import polars as pl
    STEP_WITH_ROUGH_CAL = -1
    multifit = mass2.MultiFit(
        default_fit_width=80,
        default_use_expr=(pl.col("state_label") == "CAL2"),
        default_bin_size=0.3,
    )
    line_names = ["CKAlpha", "NKAlpha", "OKAlpha", "FeLl", "FeLAlpha",
                "NiLAlpha", "CuLAlpha", 980.0]
    dhigh = {"CuLAlpha": 12, "NiLAlpha": 12}
    dlow = {980.0: 20}
    for line in line_names:
        multifit = multifit.with_line(line, dlo=dlow.get(line, None), dhi=dhigh.get(line, None))
    return ch.multifit_mass_cal(multifit, STEP_WITH_ROUGH_CAL, "energy_5lagy_best")

data2 = data.map(do_multifit)
ch = data2.ch0
plt.clf()
ax1 = plt.subplot(211)
use_cal = (pl.col("state_label") == "CAL2")
edges = np.linspace(0, 1000, 1001)
ch.plot_hist("energy_5lagy_best", edges, axis=ax1, use_expr=use_cal)
plt.title("Calibrated energy, state='CAL2'")
plt.xlabel("Pulse energy (eV)")

ax2 = plt.subplot(212, sharex=ax1)
use_noncal = (pl.col("state_label") == "SCAN3")
ch.plot_hist("energy_5lagy_best", edges, axis=ax2, use_expr=use_noncal)
plt.title("Calibrated energy, state='SCAN3'")
plt.xlabel("Pulse energy (eV)")
plt.xlim([0, 1000])
plt.tight_layout()

Saving results and the Recipe system

Mass2 is designed to make storing results easy, both the the quantities produced in analysis and the steps taken to compute them. This section explains how and explores the saving of steps and of numerical results.

Analysis "steps" have been mentioned before. The big idea is that each channel is associated with a "recipe" containing all the operations and parameters used to create it, starting from the raw pulse records in an LJH file. (LJH files are never modified by Mass.) This recipe consists of one or more steps to be applied in a fixed order. Each step takes for input fields that existed in the channel's dataframe before the step, and it creates one or more new fields that exist in the dataframe afterwards. Importantly, a recipe can be stored to disk, and applied in the future, either to the same data set, or to another.

Warning: One catch is that each recipe is specific to one sensor, and they are tracked only by channel number. If the set of active sensors changes after the learning phase, any new sensors will lack a recipe. Worse, if a probe of the µMUX resonator frequencies leads to renumbering of the sensors, then we won't be able to find the right recipe for most (or all) sensors.

Caching computed data (work in progress)

Sometimes you run an analysis that you consider "final"; you want to keep the results only, and you expect (hope?) never to look at the raw LJH files again. We are still working out a standard system for caching computed data: how to name the files, how to combine them, etc. Here are two approaches that employ Parquet files. One stores a file per channel, and the other stores all data in a single file.

Approach A: one file per channel Here we store each channel's dataframe in a separate file. Notice that we want to drop the column representing the raw pulse data, because the output would otherwise be far too large (and redundant). Probably it's fine to drop the subframe count, too, so we do that here.

# For automated tests, we want the output in a temporary directory
import tempfile, os
output_dir = tempfile.TemporaryDirectory(prefix="mass2_getting_started")
print(f"Test output lives in '{output_dir.name}'")

columns_to_drop = ("pulse", "subframecount")
for cnum, ch in data.channels.items():
    filename = os.path.join(output_dir.name, f"output_test_chan{cnum}.parquet")
    df = ch.df.drop(columns_to_drop)
    df.write_parquet(filename)

You can pass options to write_parquet to control compression and other aspects of the file. If you prefer to write out only a specified subset of the fields in the dataframes, replace the .drop(columns_to_drop) in the code above with a .select(columns_to_keep) operation. All the methods that act on dataframes are possible:

  • drop to remove specific columns
  • select to keep specific columns
  • limit to write only a maximum number of rows
  • filter to select rows according to a Polars expression
  • alias to rename columns
  • and many others.

TODO: we should work on a way to make it easier to load up these files and start analyzing where you left off. One might match up these files with an LJH file, to permit creation of a new Channel object. For now, this sort of data caching is limited to when you just need the data, not the full framework of Mass2. (But keep in mind, the Polars framework is really powerful, and there's a lot you could imagine doing with just the data.)

Approach B: one file for all channels

This is similar, except that we use some new Polars tricks:

  1. polars.DataFrame.with_columns() to append a new column containing this channel's number (we will need it later, because we're about to mix all channels into a new dataframe)
  2. polars.concat() to join rows from multiple dataframes
columns_to_drop = ("pulse", "subframecount")
all_df = []
for cnum, ch in data.channels.items():
    df = ch.df.drop(columns_to_drop).with_columns(chan_number=pl.lit(cnum))
    all_df.append(df)
filename = os.path.join(output_dir.name, "output_test_allchan.parquet")
pl.concat(all_df).write_parquet(filename)

Storing analysis recipes

The system for storing analysis recipes is extremely simple. The following will save the full analysis recipe for each channel in the data object.

recipe_file = os.path.join(output_dir.name, "full_recipes.pkl")
data.save_recipes(recipe_file)

Beware that this operation will drop all "debug information" in certain steps in the recipe (generally, any steps that store a lot of data that's used not for performing the step but only for debugging). The benefit is that the recipes file is smaller, and faster to save and load. The cost is that debugging plots cannot be made after the recipes are reloaded. You probably save a recipe because you've already debugged its steps, so this is usually an acceptable tradeoff. If you don't like it, use the drop_debug=False argument.

By default all steps in a recipe are saved, but you might not want or need this. For instance, in the examples given on this page, we computed two "rough" calibrations and a final one. The two rough ones are "dead ends" if you only care about the final energy. When you are saving a recipe for later use on future data sets in a real-time pipeline, you might want to store a minimal recipe--one that omits any steps not needed to reach a few critical fields. Perhaps you have no need to compute anything that doesn't directly lead to the final energy value. If you omit dead ends from the stored recipes, then not only is the file smaller, but the computation required to run the recipes is reduced. You might prefer instead:

required_fields = {"energy_5lagy_best", "pretrig_mean"}
trimmed_recipe_file = os.path.join(output_dir.name, "trimmed_recipes.pkl")
data.save_recipes(trimmed_recipe_file, required_fields=required_fields)

This version will save all the steps that produce the two specified required_fields, and all steps that the depend on. The two rough calibration steps, however, will be trimmed away as dead ends. (In this instance, having the pretrigger mean as a required field is superfluous. It's already required in the drift correction step, which is required before the step that takes drift-corrected optimally-filtered pulse height into energy.)

Re-using a recipe (on the same or new data)

Once a mass2.Channels object is created from raw LJH files, an existing recipe can be loaded from a file and executed on the raw data. The result will be that the dataframe for each Channel will now contain the derived fields, including {"energy_5lagy_best", "pretrig_mean"}, as well as any fields computed along with them, or on the direct path to them.

pn_pair = pulsedata.pulse_noise_ljh_pairs["bessy_20240727"]
data_replay = mass2.Channels.from_ljh_folder(
    pulse_folder=pn_pair.pulse_folder,
    noise_folder=pn_pair.noise_folder
).with_experiment_state_by_path()
data_replay = data_replay.load_recipes(trimmed_recipe_file)

Further examples

There are more examples of how one can use the Mass2 approach to data analysis in the examples/ directory, in the form of Marimo notebooks. Many examples in the top level of that directory have been designed for use on data found in the https://github.com/ggggggggg/pulsedata repository, so they should work for any user of Mass. Other examples rely on data that isn't public, but they can still be useful training materials for ways to use Mass2.

Marimo is similar to Jupyter, but it has some nice advantages. It stores notebooks as pure Python (vs JSON for Jupyter). It uses a reactive execution model with an understanding of the data-flow dependency graph. This fixes a lot of the hidden-state and unexpected behavior that often come with Jupyter's linear-flow (top-to-bottom) execution model. Another really cool Marimo feature (though it's still a bit buggy) is that Matplotlib figure can be rendered and made interactive in the web browser (rather than having to explicitly set x/y limits and re-run a cell)--the result is closer to what you expect from an iPython session.

Marimo is easy to start from the command line. The marimo shell command will start a marimo server running, which both serves web pages and runs the underlying Python code. Here are a few examples of how to start marimo:

cd ~/where/I/stored/mass2

# Each of the following commands starts a local Marimo webserver,
# then opens a brower tab pointing to it. They also...

# ...display a chooser to select whatever example you like:
marimo edit examples

# ...display one specific example notebook, with (editable) code:
marimo edit examples/bessy_20240727.py

# ...display a notebook's results, but not the code that created it:
marimo run examples/ebit_juy2024_from_off.py

We strongly suggest exploring a couple of these notebooks as you launch your adventure with Mass2.