This page has instructions for migrating data from Avro files into CockroachDB using IMPORT
.
For best practices for optimizing import performance in CockroachDB, see Import Performance Best Practices.
Step 1. Export data to Avro
Please refer to the documentation of your database for instructions on exporting data to Avro.
You will need to export one file per table, with the following requirements:
- Files must be self-contained object container file (OFC) or a binary or JSON file containing one Avro record per line.
- Files must be UTF-8 encoded.
Data type mapping
Avro data types will be flexibly mapped to the target schema; that is, Avro and CockroachDB SQL types do not need to match exactly. By default, CockroachDB ignores unknown Avro fields and sets any columns to NULL
if they were not set in the Avro record.
Use the table below for data type mappings:
Avro Data Type | CockroachDB Data Type |
---|---|
BOOL |
BOOL , STRING |
INT |
INT , STRING |
FLOAT |
FLOAT , STRING |
STRING |
STRING |
BYTES |
BYTES , STRING |
ARRAY |
ARRAY , STRING |
UUID |
STRING |
DATE |
STRING |
TIME |
STRING |
INTERVAL |
STRING |
TIMESTAMP |
STRING |
JSON |
STRING |
BIT |
STRING |
DECIMAL |
STRING |
CockroachDB will attempt to convert the Avro data type to the CockroachDB data type; otherwise, it will report an error.
Step 2. Host the files where the cluster can access them
Each node in the CockroachDB cluster needs to have access to the files being imported. There are several ways for the cluster to access the data; for more information on the types of storage IMPORT
can pull from, see the following:
We strongly recommend using cloud storage such as Amazon S3 or Google Cloud to host the data files you want to import.
Step 3. Import the Avro
To import Avro data:
Import an object container file
An object container file (OCF) is a self-contained Avro file and includes both the table schema and records. For Avro OCF, there are two available import options:
strict_validation
, which rejects Avro records that do not have a one-to-one data type mapping to the target schema. By default, CockroachDB ignores unknown Avro fields and sets missing SQL fields toNULL
.- New in v21.1:
row_limit
, which is useful for finding errors quickly before executing a more time- and resource-consuming import. Settingrow_limit = 'n'
will import the first n rows of a table.
The following example uses sample data from Teradata.
For example, to import the data from userdata1.avro
into an employees
table, issue the following IMPORT
statement:
> IMPORT TABLE employees (
registration_dttm STRING,
id INT,
first_name STRING,
last_name STRING,
email STRING,
gender STRING,
ip_address STRING,
cc INT,
country STRING,
birthdate STRING,
salary FLOAT,
title STRING,
comments STRING
)
AVRO DATA (
's3://[bucket-placeholder]/userdata1.avro?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]'
);
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+---------
535041064396062721 | succeeded | 1 | 1000 | 0 | 162825
(1 row)
> SELECT * FROM employees LIMIT 5;
registration_dttm | id | first_name | last_name | email | gender | ip_address | cc | country | birthdate | salary | title | comments
-----------------------+----+------------+-----------+--------------------------+--------+----------------+------------------+------------------------+------------+-----------+--------------------------+-----------
2016-02-03T07:55:29Z | 1 | Amanda | Jordan | ajordan0@com.com | Female | 1.197.201.2 | 6759521864920116 | Indonesia | 3/8/1971 | 49756.53 | Internal Auditor | 1E+02
2016-02-03T17:04:03Z | 2 | Albert | Freeman | afreeman1@is.gd | Male | 218.111.175.34 | NULL | Canada | 1/16/1968 | 150280.17 | Accountant IV |
2016-02-03T01:09:31Z | 3 | Evelyn | Morgan | emorgan2@altervista.org | Female | 7.161.136.94 | 6767119071901597 | Russia | 2/1/1960 | 144972.51 | Structural Engineer |
2016-02-03T12:36:21Z | 4 | Denise | Riley | driley3@gmpg.org | Female | 140.35.109.83 | 3576031598965625 | China | 4/8/1997 | 90263.05 | Senior Cost Accountant |
2016-02-03T05:05:31Z | 5 | Carlos | Burns | cburns4@miitbeian.gov.cn | | 169.113.235.40 | 5602256255204850 | South Africa | | NULL | |
(5 rows)
Repeat this process for each OCF you want to import.
As of v21.2 IMPORT TABLE
will be deprecated. We recommend using CREATE TABLE
followed by IMPORT INTO
to import data into a new table. For an example, read Import into a new table from a CSV file.
To import data into an existing table, use IMPORT INTO
.
You will need to run ALTER TABLE ... ADD CONSTRAINT
to add any foreign key relationships.
Import binary or JSON records
In addition to importing Avro OCF, you can also import a binary or JSON file containing Avro records:
- To import a binary file, use the
data_as_binary_records
option. - To import a JSON file, use the
data_as_json_records
option.
The schema is not included in these types of files, so you need to specify the schema. You need to either:
- Specify the schema inline with the
schema
option. - Specify the schema by pointing to the file with the
schema_uri
option.
There are additional import options you can use when importing binary and JSON files:
strict_validation
, which rejects Avro records that do not have a one-to-one data type mapping to the target schema. By default, CockroachDB ignores unknown Avro fields and sets missing SQL fields toNULL
.records_terminated_by
, which specifies the unicode character used to indicate new lines in the input binary or JSON file (default:\n
).- New in v21.1:
row_limit
, which is useful for finding errors quickly before executing a more time- and resource-consuming import. Settingrow_limit = 'n'
will import the first n rows of a table.
The following example uses sample data generated by Avro tools.
For example, to import the data from simple-schema.json
into an simple
table, issue the following IMPORT
statement:
> IMPORT TABLE simple
CREATE USING
's3://[bucket-placeholder]/simple-schema.sql?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]'
AVRO DATA (
's3://[bucket-placeholder]/simple-sorted.json?AWS_ACCESS_KEY_ID=[placeholder]&AWS_SECRET_ACCESS_KEY=[placeholder]'
)
WITH
data_as_json_records,
schema = '{ "type": "record",
"name": "simple",
"fields":
[
{ "name": "i", "type": "int" },
{ "name": "s", "type": "string" },
{ "name": "b", "type": ["null", "bytes"] }
]
}';
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+--------
535294613223669761 | succeeded | 1 | 1000 | 0 | 50372
(1 row)
> SELECT * FROM simple LIMIT 5;
i | s | b
--------------+-----------------------------------+---------------------------------------------...
-2135825688 | dcpamywjlvtohbbtbtpypubccu | \303\204\303\264\303\216\027\303\221\017\303...
-2135463332 | rmspluxnumigrpbrkfmuktphnmfskt | \303\232\017i>{b.~\302\277\177A\302\264\303\...
-2132354298 | mebfxrhurtngsqvlyjechuglymuxfjpvv | \303\2541E\302\277\302\2714\302\257\303\201\...
-2131856455 | hfrgfefflpopvtemrspaixitncghwqfrr | NULL
-2116408431 | thuosfwm | \016s\026\303\264\303\247\302\201\302\264o\3...
(5 rows)
Repeat this process for each binary or JSON file you want to import.
As of v21.2 IMPORT TABLE
will be deprecated. We recommend using CREATE TABLE
followed by IMPORT INTO
to import data into a new table. For an example, read Import into a new table from a CSV file.
To import data into an existing table, use IMPORT INTO
.
You will need to run ALTER TABLE ... ADD CONSTRAINT
to add any foreign key relationships.