mt.pandas.dataframe
Additional utilities dealing with dataframes.
Functions
rename_column()
: Renames a column in a dataframe.row_apply()
: Applies a function on every row of a pandas.DataFrame, optionally with a progress bar.parallel_apply()
: Parallel-applies a function on every row or column of a pandas.DataFrame, optionally with a progress bar.warn_duplicate_records()
: Warns of duplicate records in the dataframe based on a list of keys.filter_rows()
: Returns df[s] but warn if the number of rows drops.
- mt.pandas.dataframe.rename_column(df: DataFrame, old_column: str, new_column: str) bool
Renames a column in a dataframe.
- Parameters:
df (pandas.DataFrame) – the dataframe to work on
old_column (str) – the column name to be renamed
new_column (str) – the new column name
- Returns:
whether or not the column has been renamed
- Return type:
bool
- mt.pandas.dataframe.row_apply(df: DataFrame, func, bar_unit='it') DataFrame
Applies a function on every row of a pandas.DataFrame, optionally with a progress bar.
- Parameters:
df (pandas.DataFrame) – a dataframe
func (function) – a function to map each row of the dataframe to something
bar_unit (str, optional) – unit name to be passed to the progress bar. If None is provided, no bar is displayed.
- Returns:
output series by invoking df.apply. And a progress bar is shown if asked.
- Return type:
pandas.DataFrame
- mt.pandas.dataframe.parallel_apply(df: DataFrame, func, axis: int = 1, n_cores: int = -1, parallelism: str = 'multiprocess', logger: IndentedLoggerAdapter | None = None, scoped_msg: str | None = None) Series
Parallel-applies a function on every row or column of a pandas.DataFrame, optionally with a progress bar.
The method wraps class:pandas_parallel_apply.DataFrameParallel. The default axis is on rows. The progress bars are shown if and only if a logger is provided.
- Parameters:
df (pandas.DataFrame) – a dataframe
func (function) – a function to map a series to a series. It must be pickable for parallel processing.
axis ({0,1}) – axis of applying. 1 for rows (default). 0 for columns.
n_cores (int) – number of CPUs to use. Passed as-is to
pandas_parallel_apply.DataFrameParallel
.parallelism ({'multithread', 'multiprocess'}) – multi-threading or multi-processing. Passed as-is to
pandas_parallel_apply.DataFrameParallel
.logger (mt.logg.IndentedLoggerAdapter, optional) – logger for debugging purposes.
scoped_msg (str, optional) – whether or not to scoped_info the progress bars. Only valid if a logger is provided
- Returns:
output dataframe by invoking df.apply.
- Return type:
pandas.DataFrame
See also
pandas_parallel_apply.DataFrameParallel
the wrapped class for the parallel_apply purpose
- mt.pandas.dataframe.warn_duplicate_records(df: DataFrame, keys: list, msg_format: str = 'Detected {dup_cnt}/{rec_cnt} duplicate records.', logger: IndentedLoggerAdapter | None = None)
Warns of duplicate records in the dataframe based on a list of keys.
- Parameters:
df (pandas.DataFrame) – a dataframe
keys (list) – list of column names
msg_format (str, optional) – the message to be logged. Two keyword arguments will be provided ‘rec_cnt’ and ‘dup_cnt’.
logger (mt.logg.IndentedLoggerAdapter, optional) – logger for debugging purposes.
- mt.pandas.dataframe.filter_rows(df: DataFrame, s: Series, msg_format: str | None = None, logger: IndentedLoggerAdapter | None = None) DataFrame
Returns df[s] but warn if the number of rows drops.
- Parameters:
df (pandas.DataFrame) – a dataframe
s (pandas.Series) – the boolean series to filter the rows of df. Must be of the same size as df.
msg_format (str, optional) – the message to be logged. Two keyword arguments will be provided ‘n_before’ and ‘n_after’.
logger (mt.logg.IndentedLoggerAdapter, optional) – logger for debugging purposes.