Possible error with TfRecordReader #1619

tmabraham · 2020-02-11T21:53:35Z

There is a Kaggle competition for TPUs but the data is provided as TFRecords.

I wanted to use PyTorch for this competition and use this amazing library. The library seems to have TFRecord support, with the TfRecordReader. However, please see this thread. It seems that the TfRecordReader might not be properly reading in the dataset. The image is not being read in properly, and multiple ids are showing up, when it is supposed to be a single example. Could there be a bug in the way TfRecordReader is loading in the data.

Also, is it possible to add this officially to the library? Finally, is it possible to add support for reading multiple files in parallel like tf.data.TfRecordDataset does?

The text was updated successfully, but these errors were encountered:

dlibenzi · 2020-02-11T22:55:44Z

About the image ... you get JPEG data, not a Tensor with image data.
You do have to decode the JPEG.

What do you mean "officially"?
Is not it already?

ailzhang · 2020-02-12T01:55:49Z

@tmabraham By official, did you suggest we adding them to the documentation? https://pytorch.org/xla/#pytorch-xla-api

AdityaSoni19031997 · 2020-02-12T06:16:54Z

Yes i think, the APIs should be added to the official docs since there's gonna be lot of activity (thanks to the kaggle comp and free TPUs) now!
Plus a link to official docs in the repo at the top might help!

tmabraham · 2020-02-12T06:53:11Z

@dlibenzi Do you know of a good way to decode the JPEG data with PyTorch?

Also, I am still not sure if this is the main issue because of the decoding issue. There is still the problem that the function is returning multiple ids even though it should be a single example/image. I say it should be a single image rather than, say, a batch of images because only a single target class is returned.

However, I will still look into the decoding of the JPEG images. Thanks...

dlibenzi · 2020-02-12T17:11:32Z

Here is a quick hack I had put together when I was testing it:

https://gist.github.com/dlibenzi/0075e27fca67ce31f7a6d701d77de48a

If I run it over an imagenet TFRecord file, I get:

image/colorspace	RGB
image/class/label	tensor([130])
image/class/synset	n02006656
image/channels	tensor([3])
image/object/bbox/label	tensor([], dtype=torch.int64)
image/width	tensor([455])
image/format	JPEG
image/height	tensor([500])
image/class/text	spoonbill
image/object/bbox/ymin	tensor([])
image/encoded	tensor([ -1, -40,  -1,  ..., -41,  -1, -39], dtype=torch.int8)
image/object/bbox/ymax	tensor([])
image/object/bbox/xmin	tensor([])
image/filename	n02006656_9965.JPEG
image/object/bbox/xmax	tensor([])


image/channels	tensor([3])
image/object/bbox/label	tensor([], dtype=torch.int64)
image/width	tensor([900])
image/format	JPEG
image/height	tensor([600])
image/class/text	ptarmigan
image/object/bbox/ymin	tensor([])
image/encoded	tensor([ -1, -40,  -1,  ..., -30,  -1, -39], dtype=torch.int8)
image/object/bbox/ymax	tensor([])
image/object/bbox/xmin	tensor([])
image/filename	n01796340_812.JPEG
image/object/bbox/xmax	tensor([])
image/colorspace	RGB
image/class/label	tensor([82])
image/class/synset	n01796340

...

dlibenzi · 2020-02-12T17:12:54Z

Note that JPEG decoding is crucial to input pipeline scalability, so I am not sure the library I use in that test code is the best in class.

tmabraham · 2020-02-12T20:38:46Z

@dlibenzi Thank you for demonstrating how to successfully decode the image. @AdityaSoni19031997 has already created a nice Kaggle kernel demonstrating its use over here.

However, the actually data provided by TfRecordReader is still in the wrong format. Here is the Pytorch XLA format returned:
Train

Validation

Here you can see there are multiple ids, and each image is duplicated. Even weirder is that the actual dataset is the same for all the files it seems (see the image data for the train set and validation set is the same, the same images are plotted after decoding), which is definitely incorrect.

There definitely seems to be a bug somewhere, but I cannot pinpoint where exactly this is happening. I am not familiar with TFRecords and this competition is honestly the first time I have played around with TFRecords.

AdityaSoni19031997 · 2020-02-12T23:51:33Z

Also,

import torch_xla.core.xla_model as xm
xm.xrt_world_size() ## gives 1
devices = xm.get_xla_supported_devices() 
## This never completes execution and nbs freezes i guess

Not sure why TF says there are 8 replicas but PyTorch detects only 1? (If I understand that function correctly)

jysohn23 · 2020-02-13T00:03:26Z

@AdityaSoni19031997 You want to make sure you're using our multiprocessing API by calling xmp.spawn(...) first. Take a look at #1576.

Also if you're using Colab, take a look at our sample notebooks as well: https://github.com/pytorch/xla/tree/master/contrib/colab.

tmabraham · 2020-02-13T06:16:04Z

@dlibenzi @jysohn23 Do you have any tips for debugging the PyTorch XLA TfRecordReader?

I also plan to try NVIDIA DALI's TFRecord reader as well in the next couple of days.

dlibenzi · 2020-02-13T13:59:06Z

Is the data in the correct format?
As you can see from the examples I posted, extracted from our TFRecord imagenet, there is no duplication.
Are the prints that you show above the ones of a single record?

dlibenzi · 2020-02-13T14:04:27Z

Can you try using this file:

https://storage.cloud.google.com/pytorch-tpu-releases/davide/train-00012-of-01024

tmabraham · 2020-02-13T18:55:19Z

@dlibenzi is this a tfrecord file?

dlibenzi · 2020-02-13T18:56:41Z

Yes.

martin-gorner · 2020-02-14T19:19:35Z

In the decoded TFRecords you posted @tmabraham, for training data, "class" and "image" seem to be correct, "id" is not a label that exists in this dataset so I do not know what is being returned there. I would venture that it's a case of bad error handling in the API.

AdityaSoni19031997 · 2020-02-14T19:28:38Z

Well my experiment with the file @dlibenzi shared above, works perfectly with the tfrecorder for PyTorch-XLA!

Now coming to Kaggle comp data's, I feel data is somewhat weird (only the ID), the class and the image work's correctly! (As Martin said above) (Had tested on 4 training tfrecs).

But What I am wondering is this,. Is it that if you are starting from scratch trying to use TPUs for Pytorch XLA, it needs some special configurations etc?

Because when I did what was shared here, the code block just freezes! (NB I followed the Instructions which we use on Colab for using PyTorch XLAs). By code block's, I meant function's like xm.xla_device() etc just freezes...

Thanks.

ailzhang · 2020-02-14T19:48:01Z

@AdityaSoni19031997 Would you mind sharing a link to colab where it freezes? I read through the thread but didn't found one. Thanks!

AdityaSoni19031997 · 2020-02-14T23:26:33Z

@ailzhang It runs very smoothly on Colab. Since the competition is on Kaggle, so I was trying to use Kaggle Kernel's with Accelerator as TPUs. But as of now,. Kaggle Kernel's probably don't support Pytorch-XLA(yet). So I borrowed the installation section which we have on Colab notebooks(namely that "torch-xla==nightly cell) and executed it on Kaggle Kernel's.

Would be great if you guys can just check it out there itself. I am sorry for the confusion.

dlibenzi · 2020-02-15T01:36:05Z

What I have noticed is that some of the Colab being shared, do not have the correct Runtime (TPU) selected.

tmabraham · 2020-02-19T09:30:04Z

Hello all,

I didn't get a chance to look into this further, but since there seems to possibly be an additional error, I will try to explain more carefully what this new apparent error is.

Here is the code I am using in Kaggle Kernels to read in the files:

from kaggle_datasets import KaggleDatasets

# Data access
GCS_DS_PATH = KaggleDatasets().get_gcs_path()

# Configuration
IMAGE_SIZE = [512, 512]
EPOCHS = 20
BATCH_SIZE = 16 * 1

GCS_PATH_SELECT = { # available image sizes
    192: GCS_DS_PATH + '/tfrecords-jpeg-192x192',
    224: GCS_DS_PATH + '/tfrecords-jpeg-224x224',
    331: GCS_DS_PATH + '/tfrecords-jpeg-331x331',
    512: GCS_DS_PATH + '/tfrecords-jpeg-512x512'
}
GCS_PATH = GCS_PATH_SELECT[IMAGE_SIZE[0]]

TRAINING_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/train/*.tfrec')
VALIDATION_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/val/*.tfrec')
TEST_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/test/*.tfrec')

# REFERENCE https://gist.github.com/dlibenzi/0075e27fca67ce31f7a6d701d77de48a

from PIL import Image
import numpy as np
import sys
import torch
import torch_xla.utils.tf_record_reader as tfrr


def decode(ex):
    w = 512 #ex['image/width'].item()
    h = 512 # ex['image/height'].item()
    image = Image.frombytes("RGB", (512, 512),
                          ex['image'].numpy().tobytes(),
                          "JPEG".lower(), 'RGB', None)
    return torch.from_numpy(np.asarray(image))

l, max_id = [], 19999
def readem(path):
    global l, max_id
    transforms = {}
    r = tfrr.TfRecordReader(path, compression='', transforms=transforms)
    count = 0
    
    while True:
        ex = r.read_example()
        if not ex:
            break 
            print('\n')
        for lbl, data in ex.items():
            print(ex)
            #print('{}\t{}'.format(lbl, data))
            l.append(decode(ex))
            count += 1
    print('\n\nDecoded {} samples'.format(count), max_id)
    return l

imgs = readem(path=TRAINING_FILENAMES[0])

Now this is what I get returned:

You can see that in the boxed areas, the examples are being repeated thrice.

Even weirder and more worrisome is the following:

The training and testing images are coming up the same! It's very much possible that there's an error in my code so let me know, but I don't see any right now so this is really weird.

@dlibenzi and @martin-gorner any clue as to why this is happening?

Also again there is still a problem with multiple ids. @martin-gorner posted over here that this should just be ignored, but if I understand correctly the dataset does contain an id field, and one is needed for proper submission of predictions. @martin-gorner Maybe this could be better addressed in the Kaggle Forums?

Finallly, I think @AdityaSoni19031997 had tested out the tfrecord file @dlibenzi shared and he mentioned to me it seems to work. However, I haven't gotten a chance to try it out yet and see if it works fine.

dlibenzi · 2020-02-19T20:41:37Z

You list l is global, can you try to remove that?
You can try using tensorflow, and see which data you get.
But, we use the TF record reader code, so it must be the same.

martin-gorner · 2020-02-19T21:58:10Z

... the dataset does contain an id field ...

The training and validation datasets contain an "image" and a "class" field. The "class" field is of type int64 and there is no "id" field.

The test dataset contains an "image" and an "id" field. The "id" filed is of type tf.string and there is no "class" field.

For reference, look at the Getting started with 100 flowers on TPU notebook, functions read_labeled_tfrecord and read_unlabeled_tfrecord.

tmabraham · 2020-02-20T00:30:13Z

@dlibenzi I tried not making the list global but still the same problem.

@martin-gorner Ah ok true, but the test dataset id field also has multiple values. Also, the test dataset seems to also repeat and has same data a training set.

I hear you are working with the PyTorch XLA team to get Kaggle TPUs to work with PyTorch? Are the problems associated with this issue? Or is there a different problem? If so, could you please describe what other issues need to be fixed by the Kaggle team in order to use PyTorch XLA on Kaggle TPUs?

dlibenzi · 2020-02-20T00:54:04Z

The problem of imgs[0] is definitely a global list problem.
You are appending the test set after the existing training, so imgs[0] is going to be the same.

dlibenzi · 2020-02-20T01:06:51Z

So I created a new copy that creates an MD5 for every image:

https://gist.github.com/dlibenzi/0aa7d2b47aaffcd91c83cad70c080035

Then grep HASH | sort | uniq -c and every hash has count 1 in the imagenet TFRecord file I posted above.

tmabraham · 2020-02-20T03:09:35Z

@dlibenzi Ah you are right I did make a mistake with the global variable. Sorry about that.
However the other two errors (multiple ids, repeated examples) still are there.

tmabraham · 2020-02-20T04:34:26Z

@dlibenzi @martin-gorner Interestingly Aditya showed over here that even though the hash codes are all unique, the images are repeated in the training set. Again there could be an error in the code, but it seems like this could be an actual bug?

dlibenzi · 2020-02-20T14:22:08Z

That code using MD5 is wrong again (you cannot cache a global MD5 object and keep updating it to get different items hashes). Guys, stop using globals 😉

import hashlib

m = hashlib.md5()
m.update('1'.encode())
print(m.hexdigest())
m.update('1'.encode())
print(m.hexdigest())
m.update('1'.encode())
print(m.hexdigest())

$ python /tmp/md5_test.py
c4ca4238a0b923820dcc509a6f75849b
6512bd43d9caa6e02c990b0a82652dca
698d51a19d8a121ce581499d7b701668

I have tried with the example TFRecord I pulled from our internal imagenet repo (posted above), and there are no repeated images.
I did not sample the whole imagenet, but that file did not seem to have.

dlibenzi · 2020-02-20T14:48:45Z

So here a new version which saves to the /tmp/tf_images folder:

https://gist.github.com/dlibenzi/c9868a1090f6f8ef9d79d2cfcbadd8ab

And this is my folder:

https://storage.cloud.google.com/pytorch-tpu-releases/davide/tf_images.tar.gz

It's hard to check all these images, but from a quick look they seem OK.

AdityaSoni19031997 · 2020-02-20T15:02:18Z

Well my apologies, I never used md5 ever before :(

Thanks for all your help!

Learnt a lot of things from you! That's what experience teaches you!

Thanks Again :)

dlibenzi · 2020-02-20T15:06:03Z

I would try using my TFRecord file to check your code experiments, because could be that yours is not properly created (not the TFRecord format itself, otherwise we could not read it at all, but the content).

AdityaSoni19031997 · 2020-02-20T16:08:09Z

So i have now plotted the images, kernel link

Thanks a lot for the help @dlibenzi.

I guess we can close the issue now @tmabraham; (though we need to figure out the id col!)

dlibenzi closed this as completed Feb 20, 2020

zcain117 added the kaggle label Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible error with TfRecordReader #1619

Possible error with TfRecordReader #1619

tmabraham commented Feb 11, 2020

dlibenzi commented Feb 11, 2020

ailzhang commented Feb 12, 2020

AdityaSoni19031997 commented Feb 12, 2020 •

edited

Loading

tmabraham commented Feb 12, 2020

dlibenzi commented Feb 12, 2020

dlibenzi commented Feb 12, 2020

tmabraham commented Feb 12, 2020

AdityaSoni19031997 commented Feb 12, 2020 •

edited

Loading

jysohn23 commented Feb 13, 2020

tmabraham commented Feb 13, 2020

dlibenzi commented Feb 13, 2020

dlibenzi commented Feb 13, 2020

tmabraham commented Feb 13, 2020

dlibenzi commented Feb 13, 2020

martin-gorner commented Feb 14, 2020 •

edited

Loading

AdityaSoni19031997 commented Feb 14, 2020 •

edited

Loading

ailzhang commented Feb 14, 2020

AdityaSoni19031997 commented Feb 14, 2020 •

edited

Loading

dlibenzi commented Feb 15, 2020

tmabraham commented Feb 19, 2020

dlibenzi commented Feb 19, 2020

martin-gorner commented Feb 19, 2020

tmabraham commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

tmabraham commented Feb 20, 2020

tmabraham commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

AdityaSoni19031997 commented Feb 20, 2020 •

edited

Loading

dlibenzi commented Feb 20, 2020

AdityaSoni19031997 commented Feb 20, 2020 •

edited

Loading

Possible error with TfRecordReader #1619

Possible error with TfRecordReader #1619

Comments

tmabraham commented Feb 11, 2020

dlibenzi commented Feb 11, 2020

ailzhang commented Feb 12, 2020

AdityaSoni19031997 commented Feb 12, 2020 • edited Loading

tmabraham commented Feb 12, 2020

dlibenzi commented Feb 12, 2020

dlibenzi commented Feb 12, 2020

tmabraham commented Feb 12, 2020

AdityaSoni19031997 commented Feb 12, 2020 • edited Loading

jysohn23 commented Feb 13, 2020

tmabraham commented Feb 13, 2020

dlibenzi commented Feb 13, 2020

dlibenzi commented Feb 13, 2020

tmabraham commented Feb 13, 2020

dlibenzi commented Feb 13, 2020

martin-gorner commented Feb 14, 2020 • edited Loading

AdityaSoni19031997 commented Feb 14, 2020 • edited Loading

ailzhang commented Feb 14, 2020

AdityaSoni19031997 commented Feb 14, 2020 • edited Loading

dlibenzi commented Feb 15, 2020

tmabraham commented Feb 19, 2020

dlibenzi commented Feb 19, 2020

martin-gorner commented Feb 19, 2020

tmabraham commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

tmabraham commented Feb 20, 2020

tmabraham commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

dlibenzi commented Feb 20, 2020

AdityaSoni19031997 commented Feb 20, 2020 • edited Loading

dlibenzi commented Feb 20, 2020

AdityaSoni19031997 commented Feb 20, 2020 • edited Loading

AdityaSoni19031997 commented Feb 12, 2020 •

edited

Loading

AdityaSoni19031997 commented Feb 12, 2020 •

edited

Loading

martin-gorner commented Feb 14, 2020 •

edited

Loading

AdityaSoni19031997 commented Feb 14, 2020 •

edited

Loading

AdityaSoni19031997 commented Feb 14, 2020 •

edited

Loading

AdityaSoni19031997 commented Feb 20, 2020 •

edited

Loading

AdityaSoni19031997 commented Feb 20, 2020 •

edited

Loading