Here I summarize what I found through the journey, while I have time when my cluster is busy running.
What I have:
Raw Input: in Avro with many schemas mixed in s3 bucket (this is bad, later I will tell you why)
Target: store date in redshift for query (this is ok. well I can work on parquet directly, storing in a db-like place can save me a lot of trouble when exporting and ad-hoc queries are badly needed in no time!)
So choices of approaches:
- parse avro to parquet and make use of spark parquet package to write into a redshift.
- load avro directly to redshift via COPY command
Choice 2 is better than Choice 1, because parquet to redshift actually is converted to avro and written into s3. Choice 1 requires two rounds of network io.
Choice 1 required only once, but you need to create your table first (which is simple and should be done for better control over schema).
What made me crazy:
- mix different avro schemas into one S3 bucket is a bad bad idea: currently library lacks of loading different schemas in one call. They failed to identify the difference of schemas. Instead, spark parquet package will throw schema column is not found error. And COPY command will falsely match different schemas into a strange result.
- avro needs to be checked extensively. Corrupted data will make program cry. Redshift COPY’s maxerror has no effects over corrupted data. Two types of errors were seen: 1. Cannot init avro reader from s3 file Incorrect sync bytes. 2. Invalid AVRO file found. Unexpected end of AVRO file. The two errors corresponds to spark parquet packages 1. Invalid Sync! 2. gives you empty avro data.
- For COPY command, I use manifest in S3. My advice is to use different key for different manifest, don’t try to overwrite existing manifest. There are cases you did not overwrite but append! Once manifest appended, redshift will append the records as well as duplicates.