file_io.ObjectDataReader

Base class for reading object-related data from a variety of sources and returning a numpy structured array.

Each subclass of ObjectDataReader must implement at least the functions _read_rows_internal and _read_objects_internal, both of which return a numpy structured array. Each data source needs to have a column ObjID that identifies the object and can be used for joining and filtering.

Caching is implemented in the base class. This will lazy load the full table into memory from the chosen data source, so it should only be used with smaller data sets. Both read_rows and read_objects will check for a cached table before reading the files, allowing them to perform direct numpy operations if the data is already in memory.

Classes

ObjectDataReader

The base class for reading in the object data.

Module Contents

class ObjectDataReader(cache_table=False, format_column_name: str = None, required_columns: list[str] = [], primary_id_column_name: str = 'ObjID', station_column_name: str = 'stn', **kwargs)[source]

Bases: abc.ABC

The base class for reading in the object data.

_cache_table = False[source]

_table = None[source]

_format_column_name = None[source]

_required_columns = [][source]

_primary_id_column_name = 'ObjID'[source]

_station_column_name = 'stn'[source]

abstractmethod get_reader_info()[source]

Return a string identifying the current reader name and input information (for logging and output).

Returns:: name – The reader information.
Return type:: str

abstractmethod get_row_count()[source]

Return the total number of rows in the input file.

Returns:: row_count – The number of rows in the input file.
Return type:: int

read_rows(block_start=0, block_size=None, **kwargs)[source]

Reads in a set number of rows from the input, performs post-processing and validation, and returns a numpy structured array.

Parameters:

block_start (int (optional)) – The 0-indexed row number from which to start reading the data. For example in a CSV file block_start=2 would skip the first two lines after the header and return data starting on row=2. Default=0
block_size (int (optional)) – the number of rows to read in. Use block_size=None to read in all available data. Default = None
**kwargs (dictionary, optional) – Extra arguments

Returns:

res – The data read in from the file.

Return type:

numpy structured array

abstractmethod _read_rows_internal(block_start=0, block_size=None, **kwargs)[source]: Function to do the actual source-specific reading.

read_objects(obj_ids, **kwargs)[source]

Read in a chunk of data corresponding to all rows for a given set of object IDs.

Parameters:

obj_ids (list) – A list of object IDs to use.
**kwargs (dictionary, optional) – Extra arguments

Returns:

res – The data read in from the file.

Return type:

Numpy structured array

abstractmethod _read_objects_internal(obj_ids, **kwargs)[source]: Function to do the actual source-specific reading.

_validate_object_id_column(input_table)[source]

Checks that the object ID column exists and converts it to a string. This is the common validity check for all object data tables.

Parameters:: input_table (structured array) – A loaded table.
Returns:: input_table – Returns the input dataframe modified in-place.
Return type:: structured array

_process_and_validate_input_table(input_table, **kwargs)[source]

Perform any input-specific processing and validation on the input table. Modifies the input dataframe in place.

Parameters:

input_table (structured array) – A loaded table.
**kwargs (dictionary, optional) – Extra arguments

Returns:

input_table – Returns the input table modified in-place.

Return type:

Numpy structured array

Notes

The base implementation includes filtering that is common to most input types. Subclasses should call super.process_and_validate() to ensure that the ancestor’s validation is also applied.

Additional arguments to use:

disallow_nanboolean: if True then checks the data for NaNs or nulls.