File import parameters
======================

``str or file-like`` parameters specify where to read the input data
from. They can be:

* a file name.
* a file object, representing a file opened in binary mode.
* an object that acts like an instance of `io.BufferedIOBase`_. Reading from it
  returns bytes.

``format`` is a string specifying an input format.
The following input formats are supported:

* "msgpack" - the data is MessagePack_ serialized
* "json" - the data is JSON_ serialized.
* "csv" - the data is CSV, and will be read using the `Python CSV module`_.
* "tsv" - the data is TSV (tab separated data), and will be read using the
  `Python CSV module`_ with ``dialect=csv.excel_tab`` explicitly set.

.. _`io.BufferedIOBase`: https://docs.python.org/3/library/io.html#io.BufferedIOBase
.. _MessagePack: https://msgpack.org/
.. _JSON: https://www.json.org/
.. _`Python CSV module`: https://docs.python.org/3/library/csv.html

If ``.gz`` is appended to the format name (for instance, ``"json.gz"``) then
the data is assumed to be gzip compressed, and will be uncompressed as it is
read.

Both MessagePack and JSON data are composed of an array of records, where each
record is a dictionary (hash or mapping) of column name to column value.

In all import formats, every record must have a column named "time".

JSON data
---------

JSON data is read using the utf-8 encoding.

CSV data
--------

When reading CSV data, the following parameters may also be supplied,
all of which are optional:

* ``dialect`` specifies the CSV dialect. The default is ``csv.excel``.
* ``encoding`` specifies the encoding that will be used to turn the binary
  input data into string data. The default encoding is ``"utf-8"``
* ``columns`` is a list of strings, giving names for the CSV
  columns. The default is ``None``, meaning that the column names will be
  taken from the first record in the CSV data.
* ``dtypes`` is a dictionary used to specify a datatype for individual
  columns, for instance ``{"col1": "int"}``. The available datatypes are
  ``"bool"``, ``"float"``, ``"int"``, ``"str"`` and ``"guess"``, where
  ``"guess"`` means to use the function guess_csv_value_.
* ``converters`` is a dictionary used to specify a function that will be used
  to parse individual columns, for instace ``{"col1", int}``. The function
  must take a string as its single input parameter, and return a value of the
  required type.

If a column is named in both ``dtypes`` and ``converters``, then the function
given in ``converters`` will be used to parse that column.

If a column is not named in either ``dtypes`` or ``converters``, then it will
be assumed to have datatype ``"guess"``, and will be parsed with
guess_csv_value_.

Note that errors raised when calling a function from the ``converters``
dictionary will not be caught. So if ``converters={"col1": int}`` and "col1"
contains ``"not-an-int"``, the resulting ``ValueError`` will not be caught.

.. _guess_csv_value: api/misc.html#tdclient.util.guess_csv_value

To summarise, the default for reading CSV files is:

  ``dialect=csv.excel, encoding="utf-8", columns=None, dtypes=None, converters=None``

TSV data
--------

When reading TSV data, the parameters that may be used are the same as for
CSV, except that:

* ``dialect`` may not be specified, and ``csv.excel_tab`` will be used.

The default for reading TSV files is:

  ``encoding="utf-8", columns=None, dtypes=None, converters=None``