Parameter Extraction - Temporal and Statistical Parameters
Difficulty Level:
Tags extract☁statistics☁temporal signals

When we are working with a signal recording, it is important to perform a preliminar analysis to understand how it behaves and obtain its main characteristics. One of the most prevalent ways to do this is by extracting descriptive parameters. There are different ways to approach this, and this notebook will cover some of the most common statistical and temporal parameters that can be used to analyse time series.

The use of these descriptive characteristics will help us decide the most effective strategy to follow during data preparation, model selection and model tuning depending on the use we will give to the data.

List of statistical parameters:

  • Maximum, minimum and range
  • Mean
  • Mode
  • Median and quantiles
  • Variance and standard deviation
  • Skewness and kurtosis

List of temporal parameters:

  • Autocorrelation
  • Stationarity
  • Seasonal decomposition

1 - Importation of the needed packages

In [1]:
# biosignalsnotebooks own package for loading and plotting the acquired data
import biosignalsnotebooks as bsnb

# Scientific packages
from numpy import linspace, array, mean, median, quantile, std, var
from scipy.stats import mode, skew, kurtosis
from statsmodels.tsa.stattools import acf, adfuller, kpss
from statsmodels.tsa.seasonal import seasonal_decompose
In [2]:
# Base packages used in OpenSignals Tools Notebooks for plotting data
from bokeh.plotting import figure, output_file, show, curdoc
from bokeh.io import output_notebook
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, Plot, LinearAxis, BoxAnnotation, Arrow, VeeHead, LinearAxis, Range1d
output_notebook(hide_banner=True)

# Hidden methods
from numpy import where, asarray, sum, histogram, arange, sqrt
from pandas import DataFrame, Series

import warnings
warnings.filterwarnings('ignore')

2 - Load of sample signal data

In [3]:
# Load of data (from a relative file path inside the project folder)
data, header = bsnb.load("/signal_samples/ecg_20_sec_100_Hz.h5", get_header=True)
channel = list(data.keys())[0]
In [4]:
print(header)
{'channels': array([1]), 'comments': '', 'date': b'2018-9-28', 'device': 'biosignalsplux', 'device connection': b'BTH00:07:80:D8:A7:F9', 'device name': b'00:07:80:D8:A7:F9', 'digital IO': array([0, 1]), 'firmware version': 773, 'resolution': array([16]), 'sampling rate': 100, 'sync interval': 2, 'time': b'14:36:39.886', 'sensor': [b'ECG'], 'column labels': {1: 'channel_1'}}

3 - Storage of sampling frequency and acquired data in variables

Since in this case we know that the original signal is an ECG recording, we can also convert its units to mV following the method explained in the ECG Sensor - Unit Conversion notebook.

In [5]:
# Sampling frequency and acquired data
fs = header["sampling rate"]

# Signal Samples
signal_raw = data[channel]
time = bsnb.generate_time(signal_raw, fs)

# Let's convert the signal's units, since we know it is a ECG signal
vcc = 3000 # mV
gain = 1000
ch = "CH1" # Channel
sr = header['sampling rate'] # Sampling rate
resolution = header['resolution'] # Resolution (number of available bits)
signal = (((array(signal_raw) / 2**resolution) - 0.5) * vcc) / gain
In [6]:
# Let's plot our raw signal
p = figure(plot_width=1000, plot_height=200, y_axis_label="Electric Voltage (mV)", x_axis_label="Time (s)", **bsnb.opensignals_kwargs("figure"))
p.ygrid.grid_line_alpha=0.5

# add a circle renderer with x and y coordinates, size, color, and alpha
p.line(time, signal, **bsnb.opensignals_kwargs("line"))

bsnb.opensignals_style([p])

show(p) # show the results

4 - Extraction of statistical parameters in a time signal

A fundamental task in many statistical analysis is to characterise the location (a central value that best describes the data) and variability (or spread) of a dataset, in order to describe and understand it better. These descriptors are independent of the domain of the signal, and can be used to describe any type of information or dataset. Due to their simplicity and the fact that they are easy to calculate, they are widely used during statistical analyses.

4.1 - Minimum, maximum and range values

The minimum and maximum values are, respectively, the lowest and highest power values across the signal duration. There may be more than one occurrence of the minimum and maximum value across the time series.

The range is the difference between the maximum and the minimum value.

In [7]:
# Let's calculate the minimum, maximum and range of our signal
max_value = max(signal)
min_value = min(signal)
signal_range = max_value - min_value
In [8]:
print("Maximum (mV): ", max_value, "\nMinimum (mV): ", min_value, "\nRange (maximum - minimum): ", signal_range)
Maximum (mV):  1.10101318359375 
Minimum (mV):  -0.170013427734375 
Range (maximum - minimum):  1.271026611328125
In [9]:
# Let's draw this values in our signal
index_max = where(signal == max_value)[0]
index_min = where(signal == min_value)[0]

# Let's plot our raw signal
p_maxmin = figure(plot_width=1000, plot_height=200, y_axis_label="Electric Voltage (mV)", x_axis_label="Time (s)", **bsnb.opensignals_kwargs("figure"))
p_maxmin.ygrid.grid_line_alpha=0.5

p_maxmin.line(time, signal, **bsnb.opensignals_kwargs("line"))
color_1 = bsnb.opensignals_color_pallet()
color_2 = bsnb.opensignals_color_pallet()
p_maxmin.circle([time[i] for i in index_max],[max_value for i in range(len(index_max))], size=15, line_color=color_1, fill_color=color_1, fill_alpha=0.5, legend_label='Maximum')
p_maxmin.circle([time[i] for i in index_min],[min_value for i in range(len(index_min))], size=15, line_color=color_2, fill_color=color_2, fill_alpha=0.5, legend_label='Minimum')

bsnb.opensignals_style([p_maxmin])

show(p_maxmin)

4.2 - Mean

The arithmetic mean ($\bar{x}$) of a temporal data set represents the average power value across the signal duration.

\begin{align} \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i=\frac{x_1+x_2+\cdots+x_n}{n} \\ \end{align}

With $x_1, x_2, \ldots, x_n$ being the values in the dataset and $n$ the number of elements in the dataset.

In [10]:
# Let's calculate the mean value of our signal
mean_val = mean(signal)
In [11]:
print('Mean (mV):', mean_val)
Mean (mV): 0.00021564978679627862
In [12]:
# Let's draw the mean line in our plot
p.line(time, mean_val, line_alpha=0.5, color=bsnb.opensignals_color_pallet(), line_width=3, legend_label='Mean')

show(p)

4.3 - Mode

The mode represents the power value with the highest number of occurrences in the dataset across the signal duration.

There may be more than one mode, if two or more different values have the same number of occurrences.

In [13]:
# Let's find the mode (or modes) in our signal
mode, occurrences = mode(signal)
In [14]:
print('Mode: ', mode, '. Occurrences: ', occurrences)
Mode:  [-0.04655457] . Occurrences:  [9]
In [15]:
# Let's print the mode values in a separate new graph
index_mode = where(signal == mode)

p_modes = figure(plot_width=1000, plot_height=200, y_axis_label="Electric Voltage (mV)", x_axis_label="Time (s)", **bsnb.opensignals_kwargs("figure"))
p_modes.line(time, signal, **bsnb.opensignals_kwargs("line"))
p_modes.ygrid.grid_line_alpha=0.5

index_aux = 0
for i in index_mode:
    p_modes.line(time, mode[index_aux], color="green", line_alpha=0.5, **bsnb.opensignals_kwargs("line")) # we can draw the line marking the mode value in the y-axis
    p_modes.circle([time[idx] for idx in i], mode[index_aux], 
             size=10, line_color="green", fill_color="green", fill_alpha=0.5) # we can also mark the points where the function has the mode value
    index_aux += 1

bsnb.opensignals_style([p_modes])
show(p_modes)

4.4 - Median value and quantiles

The median is the value that separates the top half from the lower half in a dataset. It is a popular summary statistic because it is simple to calculate and gives a measure that is more robust than the mean in the presence of outlier values.

The median is also known as the 2-quantile because it marks the middle point of the dataset. However, other quantiles can also be explored.

For example, 4-quantiles (or quartiles) divide the dataset in four sections:

  • the first quartile (Q1), also known as the 25th percentile, splits off the lowest 25% of data from the highest 75%.
  • the second quartile (Q2), also known as the 50th percentile, is the same as the median.
  • the third quartile (Q3), also known as the 75th percentile splits off the lowest 75% of data from the highest 25%.
In [16]:
# Let's calculate the median, 25 and 75 quartiles
median_val = median(signal) # 50% of the values of the signal will be lower than the median or 50th quantile.
Q1 = quantile(signal, 0.25) # 25% of the values of the signal will be lower than the 25th quantile
Q3 = quantile(signal, 0.75) # 75% of the values of the signal will be lower than the 75th quantile
In [17]:
print('Mean: ', mean_val)
print('| Q1 |\t\t\t| Q2 (Median) |\t\t\t| Q3 |\n', Q1, '\t', median_val, '\t\t', Q3)
Mean:  0.00021564978679627862
| Q1 |			| Q2 (Median) |			| Q3 |
 -0.0703582763671875 	 -0.0411529541015625 		 0.026275634765625
In [18]:
# Let's see them in a separated new graph to avoid cluttering the main one
p_quartiles = figure(plot_width=1000, plot_height=200, y_axis_label="Electric Voltage (mV)", x_axis_label="Time (s)", **bsnb.opensignals_kwargs("figure"))
p_quartiles.line(time, signal, **bsnb.opensignals_kwargs("line"))
p_quartiles.ygrid.grid_line_alpha=0.5

# Redrawing the mean line...
p_quartiles.line(time, mean_val, color=bsnb.opensignals_color_pallet(), legend_label='Mean', **bsnb.opensignals_kwargs("line"))

p_quartiles.line(time, Q3, line_dash='dashed', legend_label='Q3', **bsnb.opensignals_kwargs("line"))
p_quartiles.line(time, median_val, line_dash='dashed', legend_label='Q2 (Median)', **bsnb.opensignals_kwargs("line"))
p_quartiles.line(time, Q1, line_dash='dashed', legend_label='Q1', **bsnb.opensignals_kwargs("line"))

p_quartiles.legend.location = "top_left"
p_quartiles.y_range = Range1d(-0.20, 0.80)
p_quartiles.x_range = Range1d(-1, 6)

bsnb.opensignals_style([p_quartiles])

show(p_quartiles)

In this example we can observe that the distance between the Q3 and Q2 is bigger than the distance between Q2 and Q1 . Quartiles give us and idea about how the signal"s values are distributed.

Even though the signal has extreme values (peaks), their frequency of occurrence is low, and that is why the distance between the Q3 line and the top of the graph is much bigger than the distance between any of the other quartiles.

4.5 - Standard deviation and variance

Both the standard deviation and the variance are statistics determined by using the mean, and used to measure the degree of dispersion in a dataset. Informally, they measure how close a group of numbers tend to be to the mean.

The variance ($\sigma^2$) is defined as the average of the squared differences from the mean of the dataset. It is obtained by calculating the difference between each point of the dataset and the mean, squaring and averaging the results. Squaring the distance from the mean gives greater weight to values that are further from the mean. However, because of the squaring, the variance is no longer in the same measurement units as the dataset. This makes measuring variance a bit difficult, so many times the standard deviation is used instead.

The standard deviation ($\sigma$) is the square root of the variation. A low standard deviation value indicates that the values tend to be more concentrated and closer to the mean, whilst a high value indicates that the values tend to spread out over a wider range.

\begin{align} \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N (x_i - \bar{x})^2 }, \\ \end{align}
In [19]:
# Let's calculate the standard deviation and the variance
std_val = std(signal)
variance = var(signal)
In [20]:
print('Mean: ', mean_val)
print('Standard deviation: ', std_val, '\nVariance: ', variance)
Mean:  0.00021564978679627862
Standard deviation:  0.16587464942332372 
Variance:  0.02751439932131055
In [21]:
# Let's draw the std line in our graph
color_std = bsnb.opensignals_color_pallet()
p.line(x=time, y=mean_val+std_val, line_dash='dashed', legend_label='Standard deviation', line_color=color_std)
p.line(x=time, y=mean_val-std_val, line_dash='dashed', line_color=color_std)

show(p)

4.6 - Skewness and kurtosis

The skewness is a metric of symmetry (or the lack of) in a dataset. A dataset is symmetric if it looks the same to the left and right of the center point. A normal distribution has a skewness of 0.

  • Negative (<0) skew value : longer left tail; distribution concentrates on the right.
  • Positive (>=0) skew value : longer right tail; distribution concentrates on the left.

Kurtosis indicates if the dataset is heavy-tailed or light-tailed in comparison to a normal distribution. Datasets with high kurtosis ("heavy-tailed") tend to have outliers. The lower the kurtosis value, the less common it is to find outliers in a dataset.

In [22]:
# Let's calculate the skewness and kurtosis
skewness = skew(signal)
kurtosis_val = kurtosis(signal)
In [23]:
print('Skewness: ', skewness, '\nKurtosis: ', kurtosis_val)
Skewness:  4.271322272366648 
Kurtosis:  21.250186573425474
In [24]:
arr_hist, edges = histogram(signal, bins = int(180/5))

# Put the information in a dataframe
delays = DataFrame({'arr_delay': arr_hist, 'left': edges[:-1], 'right': edges[1:]})

p_histogram = figure(plot_height = 250, plot_width = 1000, x_axis_label="mV", y_axis_label="bins", **bsnb.opensignals_kwargs("figure"))

# Add a quad glyph
p_histogram.quad(bottom=0, top=delays['arr_delay'], 
       left=delays['left'], right=delays['right'], 
       fill_color=bsnb.opensignals_color_pallet(), line_color='black')

bsnb.opensignals_style([p_histogram])

# Show the plot
show(p_histogram) 

The histogram shows clearly that the time series is right tailed, which is consistent with the positive skew value obtained. Some outliers are also present on the right side of the histogram, around the 1 mV value. This is consistent with the kurtosis result, because as we previously mentioned in the analysis of the quartiles, the highest peaks in the heartbeat cycle are, although with a low frequency, common across the time signal.

5 - Temporal parameters

Temporal parameters are used for analyzing time series and extract meaningful statistics or other characteristics of the signal. These are parameters specific for data series located in the time domain. They extract relevant and descriptive information about a dataset in virtue of its temporal nature, which should be taken into account when analysing temporal series.

5.1 - Autocorrelation

Autocorrelation , or serial correlation, is often used in time domain signals. It is the relationship of a signal with a delayed copy of itself as a function of the delay. The analysis of autocorrelation finds repeating patterns, such as the presence of a periodic signal obscured by noise.

The ACF (autocorrelation function) shows the correlation between points separated by various time lags. So, in other words, it is the degree of association between points based on how many time steps apart they are. Normally, the autocorrelation function falls towards 0 as points become more separated, because the bigger the separation, the less correlation between the points. This is not a rule, but it is the most typical scenario.

In [25]:
# Let's obtain the autocorrelation function for our signal
autocorrelation_function = acf(signal, fft=True, nlags=round(len(signal)*0.75))
In [26]:
def get_autocorrelation_plot_params(series):
    n = len(series)
    data = asarray(series)
    mean_val = mean(data)
    c0 = sum((data - mean_val) ** 2) / float(n)

    def r(h):
        return ((data[:n - h] - mean_val) *
                (data[h:] - mean_val)).sum() / float(n) / c0
    x = arange(n) + 1
    y = list(map(r, x))
    z95 = 1.959963984540054 # confidence interval 95%
    z99 = 2.5758293035489004 # confidence interval 99%
    return n, x, y, z95, z99

n, x, y, z95, z99 = get_autocorrelation_plot_params(autocorrelation_function)
x = x/100
auto_correlation_plot2 = figure(title='Time Series Auto-Correlation', plot_width=1000,
                                plot_height=400, x_axis_label="Lag (s)", y_axis_label="Autocorrelation", **bsnb.opensignals_kwargs("figure"))

color_3 = bsnb.opensignals_color_pallet()
color_4 = bsnb.opensignals_color_pallet()
auto_correlation_plot2.line(x, y=z99 / sqrt(n), line_dash='dashed', line_color=color_3, legend_label='99% confidence band')
auto_correlation_plot2.line(x, y=z95 / sqrt(n), line_color=color_3, legend_label='95% confidence band')
auto_correlation_plot2.line(x, y=0.0, line_color=color_4)
auto_correlation_plot2.line(x, y=-z95 / sqrt(n), line_color=color_3)
auto_correlation_plot2.line(x, y=-z99 / sqrt(n), line_dash='dashed', line_color=color_3)

auto_correlation_plot2.line(x, y, **bsnb.opensignals_kwargs("line"))
#auto_correlation_plot2.circle(x, y, fill_color="white", size=8)  # optional

curdoc().add_root(column(auto_correlation_plot2))

bsnb.opensignals_style([auto_correlation_plot2])

show(auto_correlation_plot2)

As expected, the ACF tends to 0 the longer the distance between the points. However the analysis highlights the presence of a periodic pattern, with high correlation, characterized by the succession of a negative, positive and negative peak with a period of less than 1 second.

In this case, since this signal is a ECG recording, it makes sense that there is a periodicity associated with the heartbeat.

5.2 - Stationarity

A time series is stationary when its statistic properties, such as the mean and variance, do not change over time.

The stationarity of a dataset can usually be predicted by looking at the signal"s plot, histogram or ACF. It can also be checked by calculating the mean and variance of the signal in different time intervals; if the results are similar then the signal is probably stationary. There are also statistical tests to determine if a time series is stationary, for instance the Dickey-Fuller test or the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test.

In [27]:
# Let's try to check the stationarity of our signal
# We can measure the mean and variance of different intervals of the signal and see if they are different
mean_1 = mean(signal[1:300])
mean_2 = mean(signal[800:1100])
mean_3 = mean(signal[1600:1900])

var_1 = var(signal[1:300])
var_2 = var(signal[800:1100])
var_3 = var(signal[1600:1900])

# Let's try the Dickey-Fuller test
adf_test = adfuller(signal)

# Let's try the KPSS test
kpsstest = kpss(signal, regression='c')
In [28]:
print('Length of the signal: ', len(signal))
print('Means: ', mean_1, mean_2, mean_3)
print('Variances: ', var_1, var_2, var_3, '\n')
Length of the signal:  1965
Means:  -0.0026015980187865805 0.006810760498046875 -0.00485137939453125
Variances:  0.0260274864826267 0.03005196664656978 0.025713346455339344 

In [29]:
# Using pandas Series, we can print the ADF results in a readable way.
print ('Results of ADF Test:')
dfoutput = Series(adf_test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in adf_test[4].items():
   dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)

# Using pandas Series, we can print the KPSS results in a readable way.
print ('\nResults of KPSS Test:')
kpss_output = Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest[3].items():
    kpss_output['Critical Value (%s)'%key] = value
print (kpss_output)
Results of ADF Test:
Test Statistic                -9.868093e+00
p-value                        4.063043e-17
#Lags Used                     2.600000e+01
Number of Observations Used    1.938000e+03
Critical Value (1%)           -3.433729e+00
Critical Value (5%)           -2.863033e+00
Critical Value (10%)          -2.567565e+00
dtype: float64

Results of KPSS Test:
Test Statistic            0.014125
p-value                   0.100000
Lags Used                26.000000
Critical Value (10%)      0.347000
Critical Value (5%)       0.463000
Critical Value (2.5%)     0.574000
Critical Value (1%)       0.739000
dtype: float64

Although the means remain pretty similar in the three chosen intervals, the variances do show some significant changes.

Let"s analyse the results of the statistical tests:

  • The Dickey-Fuller test returned a very small p-value. Since this value is much smaller than 0.05 (95% confidence interval) and 0.01 (99% confidence interval), then we can reject the null hypothesis of non-stationarity. This test deems the signal as stationary.
  • In the case of the KPSS test, the results have to be interpreted in the opposite way: a high p-value would point towards a stationary signal. In our case, however, the p-value is 0.1, which means that the signal may be non-stationary.

So we got opposite results in each test. What does this mean? In reality, there is more than one type of non-stationarity. As a short introduction, there are three possible types of signals based in stationarity:

  • Strict stationary : signals where the mean, variance and covariance do not vary with time. The aim is to convert non-stationary series into strict stationary series for making predictions.
  • Trend stationary : a time series that can be made strict stationary by removing its trend. KPSS is a trend stationarity test.
  • Difference stationary : a time series that can be made strict stationary by differentiating. ADF is a difference stationarity test.

Therefore, the results of the KPSS and ADF tests are useful to test wether a signal es stationary, trend stationary or difference stationary:

  • If both tests conclude that the series is not stationary: the series is not stationary
  • If both tests conclude that the series is stationary: the series is stationary
  • If KPSS = stationary and ADF = not stationary: the signal is trend stationary
  • If KPSS = not stationary and ADF = stationary: the signal difference stationary
So in this case our signal would fall under the difference stationarity definition. To understand more about signal stationarity, there are some interesting online resources that go deeper on how to transform non-stationary signals and different ways to handle them.

5.3 - Seasonality decomposition

The seasonality decomposition of a signal allows dividing the main time series in different components, each of which has a specific characteristic. This decomposition can be done following two different models:

  • additive , for linear distributions, where the time signal is decomposed following the formula:
    $Signal=trend+seasonality+residue$
  • multiplicative , for exponential series, where the time signal is decomposed following the formula:
    $Signal=trend \times seasonality \times residue$
The resulting components are the following:
  • Trend : increasing or decreasing tendency in the time series
  • Seasonality : the repeating short-term cycles in the series
  • Residue : random noise present in the signal that cannot be attributed to trend or seasonality
In [30]:
# Let's decompose our signal into trend, seasonal and residual components using seasonal decomposition
signal = array(signal) - mean(signal)
filter_signal_1 = bsnb.lowpass(signal, f=40, order=1, fs=sr)
filter_signal_2 = bsnb.lowpass(filter_signal_1, f=40, order=3, fs=sr)
results = seasonal_decompose(filter_signal_2, model='additive', freq=sr, extrapolate_trend='freq')
In [31]:
decomposition_plot2 = figure(title='Seasonal decomposition - Observed and trend', plot_width=1200, plot_height=200, **bsnb.opensignals_kwargs("figure"))
decomposition_plot2.line(arange(0,len(results.observed)), results.observed, legend_label='Observed', **bsnb.opensignals_kwargs("line"))
decomposition_plot2.line(arange(0,len(results.trend)), results.trend, legend_label='Trend', **bsnb.opensignals_kwargs("line"))
bsnb.opensignals_style([decomposition_plot2])
show(decomposition_plot2)

decomposition_plot3 = figure(title='Seasonal decomposition - Seasonality and residuals', plot_width=1200, plot_height=200, **bsnb.opensignals_kwargs("figure"))
decomposition_plot3.line(arange(0,len(results.seasonal)), results.seasonal, legend_label='Seasonality', **bsnb.opensignals_kwargs("line"))
decomposition_plot3.line(arange(0,len(results.resid)), results.resid, legend_label='Residuals', **bsnb.opensignals_kwargs("line"))
bsnb.opensignals_style([decomposition_plot3])
show(decomposition_plot3)

The decomposition in this case is unable to separate the cyclic element (the heartbeat) from the signal. That is why the most significant part of the signal ends up as residues. This is not a good result, and it is probably caused by the naive approach of the seasonal_decompose function in the statsmodels package.

There are more complex and robust approaches to this problem, like Loess or STL decomposition, which would however require a more in depth explanation.

We hope that you have enjoyed this guide. biosignalsnotebooks is an environment in continuous expansion, so don"t stop your journey and learn more with the remaining Notebooks !

In [32]:
bsnb.css_style_apply()
.................... CSS Style Applied to Jupyter Notebook .........................
Out[32]: