Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible error with TfRecordReader #1619

Closed
tmabraham opened this issue Feb 11, 2020 · 32 comments
Closed

Possible error with TfRecordReader #1619

tmabraham opened this issue Feb 11, 2020 · 32 comments

Comments

@tmabraham
Copy link

There is a Kaggle competition for TPUs but the data is provided as TFRecords.

I wanted to use PyTorch for this competition and use this amazing library. The library seems to have TFRecord support, with the TfRecordReader. However, please see this thread. It seems that the TfRecordReader might not be properly reading in the dataset. The image is not being read in properly, and multiple ids are showing up, when it is supposed to be a single example. Could there be a bug in the way TfRecordReader is loading in the data.

Also, is it possible to add this officially to the library? Finally, is it possible to add support for reading multiple files in parallel like tf.data.TfRecordDataset does?

@dlibenzi
Copy link
Collaborator

About the image ... you get JPEG data, not a Tensor with image data.
You do have to decode the JPEG.

What do you mean "officially"?
Is not it already?

@ailzhang
Copy link
Contributor

@tmabraham By official, did you suggest we adding them to the documentation? https://pytorch.org/xla/#pytorch-xla-api

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 12, 2020

Yes i think, the APIs should be added to the official docs since there's gonna be lot of activity (thanks to the kaggle comp and free TPUs) now!
Plus a link to official docs in the repo at the top might help!

@tmabraham
Copy link
Author

@dlibenzi Do you know of a good way to decode the JPEG data with PyTorch?

Also, I am still not sure if this is the main issue because of the decoding issue. There is still the problem that the function is returning multiple ids even though it should be a single example/image. I say it should be a single image rather than, say, a batch of images because only a single target class is returned.

However, I will still look into the decoding of the JPEG images. Thanks...

@dlibenzi
Copy link
Collaborator

Here is a quick hack I had put together when I was testing it:

https://gist.github.com/dlibenzi/0075e27fca67ce31f7a6d701d77de48a

If I run it over an imagenet TFRecord file, I get:

image/colorspace	RGB
image/class/label	tensor([130])
image/class/synset	n02006656
image/channels	tensor([3])
image/object/bbox/label	tensor([], dtype=torch.int64)
image/width	tensor([455])
image/format	JPEG
image/height	tensor([500])
image/class/text	spoonbill
image/object/bbox/ymin	tensor([])
image/encoded	tensor([ -1, -40,  -1,  ..., -41,  -1, -39], dtype=torch.int8)
image/object/bbox/ymax	tensor([])
image/object/bbox/xmin	tensor([])
image/filename	n02006656_9965.JPEG
image/object/bbox/xmax	tensor([])


image/channels	tensor([3])
image/object/bbox/label	tensor([], dtype=torch.int64)
image/width	tensor([900])
image/format	JPEG
image/height	tensor([600])
image/class/text	ptarmigan
image/object/bbox/ymin	tensor([])
image/encoded	tensor([ -1, -40,  -1,  ..., -30,  -1, -39], dtype=torch.int8)
image/object/bbox/ymax	tensor([])
image/object/bbox/xmin	tensor([])
image/filename	n01796340_812.JPEG
image/object/bbox/xmax	tensor([])
image/colorspace	RGB
image/class/label	tensor([82])
image/class/synset	n01796340

...

@dlibenzi
Copy link
Collaborator

Note that JPEG decoding is crucial to input pipeline scalability, so I am not sure the library I use in that test code is the best in class.

@tmabraham
Copy link
Author

@dlibenzi Thank you for demonstrating how to successfully decode the image. @AdityaSoni19031997 has already created a nice Kaggle kernel demonstrating its use over here.

However, the actually data provided by TfRecordReader is still in the wrong format. Here is the Pytorch XLA format returned:
Train
image
Validation
image

Here you can see there are multiple ids, and each image is duplicated. Even weirder is that the actual dataset is the same for all the files it seems (see the image data for the train set and validation set is the same, the same images are plotted after decoding), which is definitely incorrect.

There definitely seems to be a bug somewhere, but I cannot pinpoint where exactly this is happening. I am not familiar with TFRecords and this competition is honestly the first time I have played around with TFRecords.

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 12, 2020

Also,

import torch_xla.core.xla_model as xm
xm.xrt_world_size() ## gives 1
devices = xm.get_xla_supported_devices() 
## This never completes execution and nbs freezes i guess

Not sure why TF says there are 8 replicas but PyTorch detects only 1? (If I understand that function correctly)

@jysohn23
Copy link
Collaborator

@AdityaSoni19031997 You want to make sure you're using our multiprocessing API by calling xmp.spawn(...) first. Take a look at #1576.

Also if you're using Colab, take a look at our sample notebooks as well: https://github.com/pytorch/xla/tree/master/contrib/colab.

@tmabraham
Copy link
Author

@dlibenzi @jysohn23 Do you have any tips for debugging the PyTorch XLA TfRecordReader?

I also plan to try NVIDIA DALI's TFRecord reader as well in the next couple of days.

@dlibenzi
Copy link
Collaborator

Is the data in the correct format?
As you can see from the examples I posted, extracted from our TFRecord imagenet, there is no duplication.
Are the prints that you show above the ones of a single record?

@dlibenzi
Copy link
Collaborator

Can you try using this file:

https://storage.cloud.google.com/pytorch-tpu-releases/davide/train-00012-of-01024

@tmabraham
Copy link
Author

@dlibenzi is this a tfrecord file?

@dlibenzi
Copy link
Collaborator

Yes.

@martin-gorner
Copy link

martin-gorner commented Feb 14, 2020

In the decoded TFRecords you posted @tmabraham, for training data, "class" and "image" seem to be correct, "id" is not a label that exists in this dataset so I do not know what is being returned there. I would venture that it's a case of bad error handling in the API.

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 14, 2020

Well my experiment with the file @dlibenzi shared above, works perfectly with the tfrecorder for PyTorch-XLA!

Now coming to Kaggle comp data's, I feel data is somewhat weird (only the ID), the class and the image work's correctly! (As Martin said above) (Had tested on 4 training tfrecs).

But What I am wondering is this,. Is it that if you are starting from scratch trying to use TPUs for Pytorch XLA, it needs some special configurations etc?

Because when I did what was shared here, the code block just freezes! (NB I followed the Instructions which we use on Colab for using PyTorch XLAs). By code block's, I meant function's like xm.xla_device() etc just freezes...

Thanks.

@ailzhang
Copy link
Contributor

@AdityaSoni19031997 Would you mind sharing a link to colab where it freezes? I read through the thread but didn't found one. Thanks!

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 14, 2020

@ailzhang It runs very smoothly on Colab. Since the competition is on Kaggle, so I was trying to use Kaggle Kernel's with Accelerator as TPUs. But as of now,. Kaggle Kernel's probably don't support Pytorch-XLA(yet). So I borrowed the installation section which we have on Colab notebooks(namely that "torch-xla==nightly cell) and executed it on Kaggle Kernel's.

Would be great if you guys can just check it out there itself. I am sorry for the confusion.

@dlibenzi
Copy link
Collaborator

What I have noticed is that some of the Colab being shared, do not have the correct Runtime (TPU) selected.

@tmabraham
Copy link
Author

Hello all,

I didn't get a chance to look into this further, but since there seems to possibly be an additional error, I will try to explain more carefully what this new apparent error is.

Here is the code I am using in Kaggle Kernels to read in the files:

from kaggle_datasets import KaggleDatasets

# Data access
GCS_DS_PATH = KaggleDatasets().get_gcs_path()

# Configuration
IMAGE_SIZE = [512, 512]
EPOCHS = 20
BATCH_SIZE = 16 * 1

GCS_PATH_SELECT = { # available image sizes
    192: GCS_DS_PATH + '/tfrecords-jpeg-192x192',
    224: GCS_DS_PATH + '/tfrecords-jpeg-224x224',
    331: GCS_DS_PATH + '/tfrecords-jpeg-331x331',
    512: GCS_DS_PATH + '/tfrecords-jpeg-512x512'
}
GCS_PATH = GCS_PATH_SELECT[IMAGE_SIZE[0]]

TRAINING_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/train/*.tfrec')
VALIDATION_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/val/*.tfrec')
TEST_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/test/*.tfrec')

# REFERENCE https://gist.github.com/dlibenzi/0075e27fca67ce31f7a6d701d77de48a

from PIL import Image
import numpy as np
import sys
import torch
import torch_xla.utils.tf_record_reader as tfrr


def decode(ex):
    w = 512 #ex['image/width'].item()
    h = 512 # ex['image/height'].item()
    image = Image.frombytes("RGB", (512, 512),
                          ex['image'].numpy().tobytes(),
                          "JPEG".lower(), 'RGB', None)
    return torch.from_numpy(np.asarray(image))

l, max_id = [], 19999
def readem(path):
    global l, max_id
    transforms = {}
    r = tfrr.TfRecordReader(path, compression='', transforms=transforms)
    count = 0
    
    while True:
        ex = r.read_example()
        if not ex:
            break 
            print('\n')
        for lbl, data in ex.items():
            print(ex)
            #print('{}\t{}'.format(lbl, data))
            l.append(decode(ex))
            count += 1
    print('\n\nDecoded {} samples'.format(count), max_id)
    return l

imgs = readem(path=TRAINING_FILENAMES[0])

Now this is what I get returned:

image

You can see that in the boxed areas, the examples are being repeated thrice.

Even weirder and more worrisome is the following:

image
image

The training and testing images are coming up the same! It's very much possible that there's an error in my code so let me know, but I don't see any right now so this is really weird.

@dlibenzi and @martin-gorner any clue as to why this is happening?

Also again there is still a problem with multiple ids. @martin-gorner posted over here that this should just be ignored, but if I understand correctly the dataset does contain an id field, and one is needed for proper submission of predictions. @martin-gorner Maybe this could be better addressed in the Kaggle Forums?

Finallly, I think @AdityaSoni19031997 had tested out the tfrecord file @dlibenzi shared and he mentioned to me it seems to work. However, I haven't gotten a chance to try it out yet and see if it works fine.

@dlibenzi
Copy link
Collaborator

You list l is global, can you try to remove that?
You can try using tensorflow, and see which data you get.
But, we use the TF record reader code, so it must be the same.

@martin-gorner
Copy link

... the dataset does contain an id field ...

The training and validation datasets contain an "image" and a "class" field. The "class" field is of type int64 and there is no "id" field.

The test dataset contains an "image" and an "id" field. The "id" filed is of type tf.string and there is no "class" field.

For reference, look at the Getting started with 100 flowers on TPU notebook, functions read_labeled_tfrecord and read_unlabeled_tfrecord.

@tmabraham
Copy link
Author

@dlibenzi I tried not making the list global but still the same problem.

@martin-gorner Ah ok true, but the test dataset id field also has multiple values. Also, the test dataset seems to also repeat and has same data a training set.

I hear you are working with the PyTorch XLA team to get Kaggle TPUs to work with PyTorch? Are the problems associated with this issue? Or is there a different problem? If so, could you please describe what other issues need to be fixed by the Kaggle team in order to use PyTorch XLA on Kaggle TPUs?

@dlibenzi
Copy link
Collaborator

The problem of imgs[0] is definitely a global list problem.
You are appending the test set after the existing training, so imgs[0] is going to be the same.

@dlibenzi
Copy link
Collaborator

So I created a new copy that creates an MD5 for every image:

https://gist.github.com/dlibenzi/0aa7d2b47aaffcd91c83cad70c080035

Then grep HASH | sort | uniq -c and every hash has count 1 in the imagenet TFRecord file I posted above.

@tmabraham
Copy link
Author

@dlibenzi Ah you are right I did make a mistake with the global variable. Sorry about that.
However the other two errors (multiple ids, repeated examples) still are there.

@tmabraham
Copy link
Author

@dlibenzi @martin-gorner Interestingly Aditya showed over here that even though the hash codes are all unique, the images are repeated in the training set. Again there could be an error in the code, but it seems like this could be an actual bug?

@dlibenzi
Copy link
Collaborator

That code using MD5 is wrong again (you cannot cache a global MD5 object and keep updating it to get different items hashes). Guys, stop using globals 😉

import hashlib

m = hashlib.md5()
m.update('1'.encode())
print(m.hexdigest())
m.update('1'.encode())
print(m.hexdigest())
m.update('1'.encode())
print(m.hexdigest())
$ python /tmp/md5_test.py
c4ca4238a0b923820dcc509a6f75849b
6512bd43d9caa6e02c990b0a82652dca
698d51a19d8a121ce581499d7b701668

I have tried with the example TFRecord I pulled from our internal imagenet repo (posted above), and there are no repeated images.
I did not sample the whole imagenet, but that file did not seem to have.

@dlibenzi
Copy link
Collaborator

So here a new version which saves to the /tmp/tf_images folder:

https://gist.github.com/dlibenzi/c9868a1090f6f8ef9d79d2cfcbadd8ab

And this is my folder:

https://storage.cloud.google.com/pytorch-tpu-releases/davide/tf_images.tar.gz

It's hard to check all these images, but from a quick look they seem OK.

Screen Shot 2020-02-20 at 6 46 48 AM

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 20, 2020

Well my apologies, I never used md5 ever before :(

Thanks for all your help!

Learnt a lot of things from you! That's what experience teaches you!

Thanks Again :)

@dlibenzi
Copy link
Collaborator

I would try using my TFRecord file to check your code experiments, because could be that yours is not properly created (not the TFRecord format itself, otherwise we could not read it at all, but the content).

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 20, 2020

So i have now plotted the images, kernel link

Thanks a lot for the help @dlibenzi.

I guess we can close the issue now @tmabraham; (though we need to figure out the id col!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants