file_io.ObjectDataReader
Base class for reading object-related data from a variety of sources and returning a numpy structured array.
Each subclass of ObjectDataReader must implement at least the functions _read_rows_internal and _read_objects_internal, both of which return a numpy structured array. Each data source needs to have a column ObjID that identifies the object and can be used for joining and filtering.
Caching is implemented in the base class. This will lazy load the full
table into memory from the chosen data source, so it should only be
used with smaller data sets. Both read_rows and read_objects
will check for a cached table before reading the files, allowing them
to perform direct numpy operations if the data is already in memory.
Classes
The base class for reading in the object data. |
Module Contents
- class ObjectDataReader(cache_table=False, format_column_name: str = None, required_columns: list[str] = [], primary_id_column_name: str = 'ObjID', station_column_name: str = 'stn', **kwargs)[source]
Bases:
abc.ABCThe base class for reading in the object data.
- abstractmethod get_reader_info()[source]
Return a string identifying the current reader name and input information (for logging and output).
- Returns:
name – The reader information.
- Return type:
str
- abstractmethod get_row_count()[source]
Return the total number of rows in the input file.
- Returns:
row_count – The number of rows in the input file.
- Return type:
int
- read_rows(block_start=0, block_size=None, **kwargs)[source]
Reads in a set number of rows from the input, performs post-processing and validation, and returns a numpy structured array.
- Parameters:
block_start (int (optional)) – The 0-indexed row number from which to start reading the data. For example in a CSV file block_start=2 would skip the first two lines after the header and return data starting on row=2. Default=0
block_size (int (optional)) – the number of rows to read in. Use block_size=None to read in all available data. Default = None
**kwargs (dictionary, optional) – Extra arguments
- Returns:
res – The data read in from the file.
- Return type:
numpy structured array
- abstractmethod _read_rows_internal(block_start=0, block_size=None, **kwargs)[source]
Function to do the actual source-specific reading.
- read_objects(obj_ids, **kwargs)[source]
Read in a chunk of data corresponding to all rows for a given set of object IDs.
- Parameters:
obj_ids (list) – A list of object IDs to use.
**kwargs (dictionary, optional) – Extra arguments
- Returns:
res – The data read in from the file.
- Return type:
Numpy structured array
- abstractmethod _read_objects_internal(obj_ids, **kwargs)[source]
Function to do the actual source-specific reading.
- _validate_object_id_column(input_table)[source]
Checks that the object ID column exists and converts it to a string. This is the common validity check for all object data tables.
- Parameters:
input_table (structured array) – A loaded table.
- Returns:
input_table – Returns the input dataframe modified in-place.
- Return type:
structured array
- _process_and_validate_input_table(input_table, **kwargs)[source]
Perform any input-specific processing and validation on the input table. Modifies the input dataframe in place.
- Parameters:
input_table (structured array) – A loaded table.
**kwargs (dictionary, optional) – Extra arguments
- Returns:
input_table – Returns the input table modified in-place.
- Return type:
Numpy structured array
Notes
The base implementation includes filtering that is common to most input types. Subclasses should call super.process_and_validate() to ensure that the ancestor’s validation is also applied.
Additional arguments to use:
- disallow_nanboolean
if True then checks the data for NaNs or nulls.