mt.pandas.convert

Functions

  • dfload_asyn(): An asyn function that loads a dataframe file based on the file’s extension.

  • dfload(): Loads a dataframe file based on the file’s extension.

  • dfsave_asyn(): An asyn function that saves a dataframe to a file based on the file’s extension.

  • dfsave(): Saves a dataframe to a file based on the file’s extension.

  • dfpack(): Packs a dataframe into a more compact format.

  • dfunpack(): Unpacks a compact dataframe into a more expanded format.

async mt.pandas.convert.dfload_asyn(df_filepath, *args, show_progress=False, unpack=True, parquet_convert_ndarray_to_list=False, file_read_delayed: bool = False, max_rows: int | None = None, nrows: int | None = None, context_vars: dict = {}, **kwargs) DataFrame

An asyn function that loads a dataframe file based on the file’s extension.

Parameters:
  • df_filepath (str) – local path to an existing dataframe. The file extension is used to determine the file type.

  • show_progress (bool) – show a progress spinner in the terminal

  • unpack (bool) – whether or not to unpack the dataframe after loading. Ignored for ‘.pdh5’ format.

  • parquet_convert_ndarray_to_list (bool) – whether or not to convert 1D ndarrays in the loaded parquet table into Python lists. Ignored for ‘.pdh5’ format.

  • file_read_delayed (bool) – whether or not some columns can be delayed for reading later. Only valid for ‘.pdh5’ format.

  • max_rows (int, optional) – limit the maximum number of rows to read. Only valid for ‘.csv’, ‘.pdh5’ and ‘.parquet’ formats. This argument is only for backward compatibility. Please use nrows instead.

  • nrows (int, optional) – limit the maximum number of rows to read. Only valid for ‘.csv’, ‘.pdh5’ and ‘.parquet’ formats.

  • context_vars (dict) – a dictionary of context variables within which the function runs. It must include context_vars[‘async’] to tell whether to invoke the function asynchronously or not. Ignored for ‘.pdh5’ format.

  • *args (tuple) – list of positional arguments to pass to the corresponding reader. Ignored for ‘.pdh5’ format.

  • **kwargs (dict) – dictionary of keyword arguments to pass to the corresponding reader. Ignored for ‘.pdh5’ format.

Returns:

loaded dataframe

Return type:

pandas.DataFrame

Notes

For ‘.csv’ or ‘.csv.zip’ files, we use mt.pandas.csv.read_csv(). For ‘.parquet’ files, we use pandas.read_parquet(). For .pdh5 files, we use mt.pandas.pdh5.load_pdh5_asyn().

Raises:

TypeError – if file type is unknown

mt.pandas.convert.dfload(df_filepath, *args, show_progress=False, unpack=True, parquet_convert_ndarray_to_list=False, file_read_delayed: bool = False, max_rows: int | None = None, nrows: int | None = None, **kwargs) DataFrame

Loads a dataframe file based on the file’s extension.

Parameters:
  • df_filepath (str) – local path to an existing dataframe. The file extension is used to determine the file type.

  • show_progress (bool) – show a progress spinner in the terminal

  • unpack (bool) – whether or not to unpack the dataframe after loading. Ignored for ‘.pdh5’ format.

  • parquet_convert_ndarray_to_list (bool) – whether or not to convert 1D ndarrays in the loaded parquet table into Python lists. Ignored for ‘.pdh5’ format.

  • file_read_delayed (bool) – whether or not some columns can be delayed for reading later. Only valid for ‘.pdh5’ format.

  • max_rows (int, optional) – limit the maximum number of rows to read. Only valid for ‘.csv’, ‘.pdh5’ and ‘.parquet’ formats. This argument is only for backward compatibility. Please use nrows instead.

  • nrows (int, optional) – limit the maximum number of rows to read. Only valid for ‘.csv’, ‘.pdh5’ and ‘.parquet’ formats.

  • *args (tuple) – list of positional arguments to pass to the corresponding reader. Ignored for ‘.pdh5’ format.

  • **kwargs (dict) – dictionary of keyword arguments to pass to the corresponding reader. Ignored for ‘.pdh5’ format.

Returns:

loaded dataframe

Return type:

pandas.DataFrame

Notes

For ‘.csv’ or ‘.csv.zip’ files, we use mt.pandas.csv.read_csv(). For ‘.parquet’ files, we use pandas.read_parquet(). For .pdh5 files, we use mt.pandas.pdh5.load_pdh5_asyn().

Raises:

TypeError – if file type is unknown

async mt.pandas.convert.dfsave_asyn(df, df_filepath, file_mode=436, show_progress=False, pack=True, context_vars: dict = {}, file_write_delayed: bool = False, **kwargs)

An asyn function that saves a dataframe to a file based on the file’s extension.

Parameters:
  • df (pandas.DataFrame) – a dataframe

  • df_filepath (str) – local path to an existing dataframe. The file extension is used to determine the file type.

  • file_mode (int) – file mode to be set to using os.chmod(). If None is given, no setting of file mode will happen.

  • show_progress (bool) – show a progress spinner in the terminal

  • pack (bool) – whether or not to pack the dataframe before saving. Ignored for ‘.pdh5’ format.

  • context_vars (dict) – a dictionary of context variables within which the function runs. It must include context_vars[‘async’] to tell whether to invoke the function asynchronously or not. Ignored for ‘.pdh5’ format.

  • file_write_delayed (bool) – Only valid in asynchronous mode. If True, wraps the file write task into a future and returns the future. In all other cases, proceeds as usual. Ignored for ‘.pdh5’ format.

  • **kwargs (dict) – dictionary of keyword arguments to pass to the corresponding writer. Ignored for ‘.pdh5’ format.

Returns:

either a future or the number of bytes written, depending on whether the file write task is delayed or not. For ‘.pdh5’ format, 1 is returned.

Return type:

asyncio.Future or int

Notes

For ‘.csv’ or ‘.csv.zip’ files, we use mt.pandas.csv.to_csv(). For ‘.parquet’ files, we use pandas.DataFrame.to_parquet(). For ‘.pdh5’ files, we use mt.pandas.pdh5.save_pdh5().

Raises:

TypeError – if file type is unknown or if the input is not a dataframe

mt.pandas.convert.dfsave(df, df_filepath, file_mode=436, show_progress=False, pack=True, **kwargs)

Saves a dataframe to a file based on the file’s extension.

Parameters:
  • df (pandas.DataFrame) – a dataframe

  • df_filepath (str) – local path to an existing dataframe. The file extension is used to determine the file type.

  • file_mode (int) – file mode to be set to using os.chmod(). If None is given, no setting of file mode will happen.

  • show_progress (bool) – show a progress spinner in the terminal

  • pack (bool) – whether or not to pack the dataframe before saving

  • **kwargs (dict) – dictionary of keyword arguments to pass to the corresponding writer

Returns:

whatever the corresponding writer returns

Return type:

object

Notes

For ‘.csv’ or ‘.csv.zip’ files, we use mt.pandas.csv.to_csv(). For ‘.parquet’ files, we use pandas.DataFrame.to_parquet().

Raises:

TypeError – if file type is unknown or if the input is not a dataframe

mt.pandas.convert.dfpack(df, spinner=None)

Packs a dataframe into a more compact format.

At the moment, it converts each ndarray column into 3 columns, and each cv.Image column into a json column.

Parameters:
  • df (pandas.DataFrame) – dataframe to be packed

  • spinner (Halo, optional) – spinner for tracking purposes

Returns:

output dataframe

Return type:

pandas.DataFrame

mt.pandas.convert.dfunpack(df, spinner=None)

Unpacks a compact dataframe into a more expanded format.

This is the reverse function of dfpack().

Parameters:
  • df (pandas.DataFrame) – dataframe to be unpacked

  • spinner (Halo, optional) – spinner for tracking purposes

Returns:

output dataframe

Return type:

pandas.DataFrame

Classes

  • Pdh5Cell: A read-only cell of a pdh5 column.

class mt.pandas.convert.Pdh5Cell(col: Pdh5Column, row_id: int)

A read-only cell of a pdh5 column.

Inheritance

digraph inheritance36576215af { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "Pdh5Cell" [URL="mt.pandas.pdh5.html#mt.pandas.pdh5.Pdh5Cell",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A read-only cell of a pdh5 column."]; }