|
Parameter Extraction - Temporal and Statistical Parameters |
When we are working with a signal recording, it is important to perform a preliminar analysis to understand how it behaves and obtain its main characteristics. One of the most prevalent ways to do this is by extracting descriptive parameters. There are different ways to approach this, and this notebook will cover some of the most common statistical and temporal parameters that can be used to analyse time series.
The use of these descriptive characteristics will help us decide the most effective strategy to follow during data preparation, model selection and model tuning depending on the use we will give to the data.
List of statistical parameters:
List of temporal parameters:
1 - Importation of the needed packages
# biosignalsnotebooks own package for loading and plotting the acquired data
import biosignalsnotebooks as bsnb
# Scientific packages
from numpy import linspace, array, mean, median, quantile, std, var
from scipy.stats import mode, skew, kurtosis
from statsmodels.tsa.stattools import acf, adfuller, kpss
from statsmodels.tsa.seasonal import seasonal_decompose
2 - Load of sample signal data
# Load of data (from a relative file path inside the project folder)
data, header = bsnb.load("/signal_samples/ecg_20_sec_100_Hz.h5", get_header=True)
channel = list(data.keys())[0]
3 - Storage of sampling frequency and acquired data in variables
Since in this case we know that the original signal is an ECG recording, we can also convert its units to mV following the method explained in the ECG Sensor - Unit Conversion notebook.# Sampling frequency and acquired data
fs = header["sampling rate"]
# Signal Samples
signal_raw = data[channel]
time = bsnb.generate_time(signal_raw, fs)
# Let's convert the signal's units, since we know it is a ECG signal
vcc = 3000 # mV
gain = 1000
ch = "CH1" # Channel
sr = header['sampling rate'] # Sampling rate
resolution = header['resolution'] # Resolution (number of available bits)
signal = (((array(signal_raw) / 2**resolution) - 0.5) * vcc) / gain
4 - Extraction of statistical parameters in a time signal
A fundamental task in many statistical analysis is to characterise the location (a central value that best describes the data) and variability (or spread) of a dataset, in order to describe and understand it better. These descriptors are independent of the domain of the signal, and can be used to describe any type of information or dataset. Due to their simplicity and the fact that they are easy to calculate, they are widely used during statistical analyses.
4.1 - Minimum, maximum and range values
The minimum and maximum values are, respectively, the lowest and highest power values across the signal duration. There may be more than one occurrence of the minimum and maximum value across the time series.The range is the difference between the maximum and the minimum value.
# Let's calculate the minimum, maximum and range of our signal
max_value = max(signal)
min_value = min(signal)
signal_range = max_value - min_value
4.2 - Mean
The arithmetic mean ($\bar{x}$) of a temporal data set represents the average power value across the signal duration. \begin{align} \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i=\frac{x_1+x_2+\cdots+x_n}{n} \\ \end{align}With $x_1, x_2, \ldots, x_n$ being the values in the dataset and $n$ the number of elements in the dataset.
# Let's calculate the mean value of our signal
mean_val = mean(signal)
4.3 - Mode
The mode represents the power value with the highest number of occurrences in the dataset across the signal duration.There may be more than one mode, if two or more different values have the same number of occurrences.
# Let's find the mode (or modes) in our signal
mode, occurrences = mode(signal)
4.4 - Median value and quantiles
The median is the value that separates the top half from the lower half in a dataset. It is a popular summary statistic because it is simple to calculate and gives a measure that is more robust than the mean in the presence of outlier values.The median is also known as the 2-quantile because it marks the middle point of the dataset. However, other quantiles can also be explored.
For example, 4-quantiles (or quartiles) divide the dataset in four sections:
# Let's calculate the median, 25 and 75 quartiles
median_val = median(signal) # 50% of the values of the signal will be lower than the median or 50th quantile.
Q1 = quantile(signal, 0.25) # 25% of the values of the signal will be lower than the 25th quantile
Q3 = quantile(signal, 0.75) # 75% of the values of the signal will be lower than the 75th quantile
In this example we can observe that the distance between the Q3 and Q2 is bigger than the distance between Q2 and Q1 . Quartiles give us and idea about how the signal"s values are distributed.
Even though the signal has extreme values (peaks), their frequency of occurrence is low, and that is why the distance between the Q3 line and the top of the graph is much bigger than the distance between any of the other quartiles.
4.5 - Standard deviation and variance
Both the standard deviation and the variance are statistics determined by using the mean, and used to measure the degree of dispersion in a dataset. Informally, they measure how close a group of numbers tend to be to the mean.The variance ($\sigma^2$) is defined as the average of the squared differences from the mean of the dataset. It is obtained by calculating the difference between each point of the dataset and the mean, squaring and averaging the results. Squaring the distance from the mean gives greater weight to values that are further from the mean. However, because of the squaring, the variance is no longer in the same measurement units as the dataset. This makes measuring variance a bit difficult, so many times the standard deviation is used instead.
The standard deviation ($\sigma$) is the square root of the variation. A low standard deviation value indicates that the values tend to be more concentrated and closer to the mean, whilst a high value indicates that the values tend to spread out over a wider range.
\begin{align} \sigma = \sqrt{\frac{1}{N-1}\sum_{i=1}^N (x_i - \bar{x})^2 }, \\ \end{align}# Let's calculate the standard deviation and the variance
std_val = std(signal)
variance = var(signal)
4.6 - Skewness and kurtosis
The skewness is a metric of symmetry (or the lack of) in a dataset. A dataset is symmetric if it looks the same to the left and right of the center point. A normal distribution has a skewness of 0.Kurtosis indicates if the dataset is heavy-tailed or light-tailed in comparison to a normal distribution. Datasets with high kurtosis ("heavy-tailed") tend to have outliers. The lower the kurtosis value, the less common it is to find outliers in a dataset.
# Let's calculate the skewness and kurtosis
skewness = skew(signal)
kurtosis_val = kurtosis(signal)
The histogram shows clearly that the time series is right tailed, which is consistent with the positive skew value obtained. Some outliers are also present on the right side of the histogram, around the 1 mV value. This is consistent with the kurtosis result, because as we previously mentioned in the analysis of the quartiles, the highest peaks in the heartbeat cycle are, although with a low frequency, common across the time signal.
5 - Temporal parameters
Temporal parameters are used for analyzing time series and extract meaningful statistics or other characteristics of the signal. These are parameters specific for data series located in the time domain. They extract relevant and descriptive information about a dataset in virtue of its temporal nature, which should be taken into account when analysing temporal series.
5.1 - Autocorrelation
Autocorrelation , or serial correlation, is often used in time domain signals. It is the relationship of a signal with a delayed copy of itself as a function of the delay. The analysis of autocorrelation finds repeating patterns, such as the presence of a periodic signal obscured by noise.The ACF (autocorrelation function) shows the correlation between points separated by various time lags. So, in other words, it is the degree of association between points based on how many time steps apart they are. Normally, the autocorrelation function falls towards 0 as points become more separated, because the bigger the separation, the less correlation between the points. This is not a rule, but it is the most typical scenario.
# Let's obtain the autocorrelation function for our signal
autocorrelation_function = acf(signal, fft=True, nlags=round(len(signal)*0.75))
As expected, the ACF tends to 0 the longer the distance between the points. However the analysis highlights the presence of a periodic pattern, with high correlation, characterized by the succession of a negative, positive and negative peak with a period of less than 1 second.
In this case, since this signal is a ECG recording, it makes sense that there is a periodicity associated with the heartbeat.
5.2 - Stationarity
A time series is stationary when its statistic properties, such as the mean and variance, do not change over time.The stationarity of a dataset can usually be predicted by looking at the signal"s plot, histogram or ACF. It can also be checked by calculating the mean and variance of the signal in different time intervals; if the results are similar then the signal is probably stationary. There are also statistical tests to determine if a time series is stationary, for instance the Dickey-Fuller test or the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test.
# Let's try to check the stationarity of our signal
# We can measure the mean and variance of different intervals of the signal and see if they are different
mean_1 = mean(signal[1:300])
mean_2 = mean(signal[800:1100])
mean_3 = mean(signal[1600:1900])
var_1 = var(signal[1:300])
var_2 = var(signal[800:1100])
var_3 = var(signal[1600:1900])
# Let's try the Dickey-Fuller test
adf_test = adfuller(signal)
# Let's try the KPSS test
kpsstest = kpss(signal, regression='c')
Although the means remain pretty similar in the three chosen intervals, the variances do show some significant changes.
Let"s analyse the results of the statistical tests:
So we got opposite results in each test. What does this mean? In reality, there is more than one type of non-stationarity. As a short introduction, there are three possible types of signals based in stationarity:
Therefore, the results of the KPSS and ADF tests are useful to test wether a signal es stationary, trend stationary or difference stationary:
5.3 - Seasonality decomposition
The seasonality decomposition of a signal allows dividing the main time series in different components, each of which has a specific characteristic. This decomposition can be done following two different models:# Let's decompose our signal into trend, seasonal and residual components using seasonal decomposition
signal = array(signal) - mean(signal)
filter_signal_1 = bsnb.lowpass(signal, f=40, order=1, fs=sr)
filter_signal_2 = bsnb.lowpass(filter_signal_1, f=40, order=3, fs=sr)
results = seasonal_decompose(filter_signal_2, model='additive', freq=sr, extrapolate_trend='freq')
The decomposition in this case is unable to separate the cyclic element (the heartbeat) from the signal. That is why the most significant part of the signal ends up as residues. This is not a good result, and it is probably caused by the naive approach of the seasonal_decompose function in the statsmodels package.
There are more complex and robust approaches to this problem, like Loess or STL decomposition, which would however require a more in depth explanation.
We hope that you have enjoyed this guide. biosignalsnotebooks is an environment in continuous expansion, so don"t stop your journey and learn more with the remaining Notebooks !