mt.pandas.dataframe

Additional utilities dealing with dataframes.

Functions

  • rename_column(): Renames a column in a dataframe.

  • row_apply(): Applies a function on every row of a pandas.DataFrame, optionally with a progress bar.

  • parallel_apply(): Parallel-applies a function on every row or column of a pandas.DataFrame, optionally with a progress bar.

  • warn_duplicate_records(): Warns of duplicate records in the dataframe based on a list of keys.

  • filter_rows(): Returns df[s] but warn if the number of rows drops.

mt.pandas.dataframe.rename_column(df: DataFrame, old_column: str, new_column: str) bool

Renames a column in a dataframe.

Parameters:
  • df (pandas.DataFrame) – the dataframe to work on

  • old_column (str) – the column name to be renamed

  • new_column (str) – the new column name

Returns:

whether or not the column has been renamed

Return type:

bool

mt.pandas.dataframe.row_apply(df: DataFrame, func, bar_unit='it') DataFrame

Applies a function on every row of a pandas.DataFrame, optionally with a progress bar.

Parameters:
  • df (pandas.DataFrame) – a dataframe

  • func (function) – a function to map each row of the dataframe to something

  • bar_unit (str, optional) – unit name to be passed to the progress bar. If None is provided, no bar is displayed.

Returns:

output series by invoking df.apply. And a progress bar is shown if asked.

Return type:

pandas.DataFrame

mt.pandas.dataframe.parallel_apply(df: DataFrame, func, axis: int = 1, n_cores: int = -1, parallelism: str = 'multiprocess', logger: IndentedLoggerAdapter | None = None, scoped_msg: str | None = None) Series

Parallel-applies a function on every row or column of a pandas.DataFrame, optionally with a progress bar.

The method wraps class:pandas_parallel_apply.DataFrameParallel. The default axis is on rows. The progress bars are shown if and only if a logger is provided.

Parameters:
  • df (pandas.DataFrame) – a dataframe

  • func (function) – a function to map a series to a series. It must be pickable for parallel processing.

  • axis ({0,1}) – axis of applying. 1 for rows (default). 0 for columns.

  • n_cores (int) – number of CPUs to use. Passed as-is to pandas_parallel_apply.DataFrameParallel.

  • parallelism ({'multithread', 'multiprocess'}) – multi-threading or multi-processing. Passed as-is to pandas_parallel_apply.DataFrameParallel.

  • logger (mt.logg.IndentedLoggerAdapter, optional) – logger for debugging purposes.

  • scoped_msg (str, optional) – whether or not to scoped_info the progress bars. Only valid if a logger is provided

Returns:

output dataframe by invoking df.apply.

Return type:

pandas.DataFrame

See also

pandas_parallel_apply.DataFrameParallel

the wrapped class for the parallel_apply purpose

mt.pandas.dataframe.warn_duplicate_records(df: DataFrame, keys: list, msg_format: str = 'Detected {dup_cnt}/{rec_cnt} duplicate records.', logger: IndentedLoggerAdapter | None = None)

Warns of duplicate records in the dataframe based on a list of keys.

Parameters:
  • df (pandas.DataFrame) – a dataframe

  • keys (list) – list of column names

  • msg_format (str, optional) – the message to be logged. Two keyword arguments will be provided ‘rec_cnt’ and ‘dup_cnt’.

  • logger (mt.logg.IndentedLoggerAdapter, optional) – logger for debugging purposes.

mt.pandas.dataframe.filter_rows(df: DataFrame, s: Series, msg_format: str | None = None, logger: IndentedLoggerAdapter | None = None) DataFrame

Returns df[s] but warn if the number of rows drops.

Parameters:
  • df (pandas.DataFrame) – a dataframe

  • s (pandas.Series) – the boolean series to filter the rows of df. Must be of the same size as df.

  • msg_format (str, optional) – the message to be logged. Two keyword arguments will be provided ‘n_before’ and ‘n_after’.

  • logger (mt.logg.IndentedLoggerAdapter, optional) – logger for debugging purposes.