-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible error with TfRecordReader #1619
Comments
About the image ... you get JPEG data, not a Tensor with image data. What do you mean "officially"? |
@tmabraham By official, did you suggest we adding them to the documentation? https://pytorch.org/xla/#pytorch-xla-api |
Yes i think, the APIs should be added to the official docs since there's gonna be lot of activity (thanks to the kaggle comp and free TPUs) now! |
@dlibenzi Do you know of a good way to decode the JPEG data with PyTorch? Also, I am still not sure if this is the main issue because of the decoding issue. There is still the problem that the function is returning multiple ids even though it should be a single example/image. I say it should be a single image rather than, say, a batch of images because only a single target class is returned. However, I will still look into the decoding of the JPEG images. Thanks... |
Here is a quick hack I had put together when I was testing it: https://gist.github.com/dlibenzi/0075e27fca67ce31f7a6d701d77de48a If I run it over an imagenet TFRecord file, I get:
|
Note that JPEG decoding is crucial to input pipeline scalability, so I am not sure the library I use in that test code is the best in class. |
@dlibenzi Thank you for demonstrating how to successfully decode the image. @AdityaSoni19031997 has already created a nice Kaggle kernel demonstrating its use over here. However, the actually data provided by TfRecordReader is still in the wrong format. Here is the Pytorch XLA format returned: Here you can see there are multiple ids, and each image is duplicated. Even weirder is that the actual dataset is the same for all the files it seems (see the image data for the train set and validation set is the same, the same images are plotted after decoding), which is definitely incorrect. There definitely seems to be a bug somewhere, but I cannot pinpoint where exactly this is happening. I am not familiar with TFRecords and this competition is honestly the first time I have played around with TFRecords. |
Also,
Not sure why TF says there are 8 replicas but PyTorch detects only 1? (If I understand that function correctly) |
@AdityaSoni19031997 You want to make sure you're using our multiprocessing API by calling Also if you're using Colab, take a look at our sample notebooks as well: https://github.com/pytorch/xla/tree/master/contrib/colab. |
Is the data in the correct format? |
Can you try using this file: https://storage.cloud.google.com/pytorch-tpu-releases/davide/train-00012-of-01024 |
@dlibenzi is this a tfrecord file? |
Yes. |
In the decoded TFRecords you posted @tmabraham, for training data, "class" and "image" seem to be correct, "id" is not a label that exists in this dataset so I do not know what is being returned there. I would venture that it's a case of bad error handling in the API. |
Well my experiment with the file @dlibenzi shared above, works perfectly with the tfrecorder for PyTorch-XLA! Now coming to Kaggle comp data's, I feel data is somewhat weird (only the ID), the class and the image work's correctly! (As Martin said above) (Had tested on 4 training tfrecs). But What I am wondering is this,. Is it that if you are starting from scratch trying to use TPUs for Pytorch XLA, it needs some special configurations etc? Because when I did what was shared here, the code block just freezes! (NB I followed the Instructions which we use on Colab for using PyTorch XLAs). By code block's, I meant function's like Thanks. |
@AdityaSoni19031997 Would you mind sharing a link to colab where it freezes? I read through the thread but didn't found one. Thanks! |
@ailzhang It runs very smoothly on Colab. Since the competition is on Kaggle, so I was trying to use Kaggle Kernel's with Accelerator as TPUs. But as of now,. Kaggle Kernel's probably don't support Pytorch-XLA(yet). So I borrowed the installation section which we have on Colab notebooks(namely that "torch-xla==nightly cell) and executed it on Kaggle Kernel's. Would be great if you guys can just check it out there itself. I am sorry for the confusion. |
What I have noticed is that some of the Colab being shared, do not have the correct Runtime (TPU) selected. |
Hello all, I didn't get a chance to look into this further, but since there seems to possibly be an additional error, I will try to explain more carefully what this new apparent error is. Here is the code I am using in Kaggle Kernels to read in the files:
Now this is what I get returned: You can see that in the boxed areas, the examples are being repeated thrice. Even weirder and more worrisome is the following: The training and testing images are coming up the same! It's very much possible that there's an error in my code so let me know, but I don't see any right now so this is really weird. @dlibenzi and @martin-gorner any clue as to why this is happening? Also again there is still a problem with multiple ids. @martin-gorner posted over here that this should just be ignored, but if I understand correctly the dataset does contain an Finallly, I think @AdityaSoni19031997 had tested out the tfrecord file @dlibenzi shared and he mentioned to me it seems to work. However, I haven't gotten a chance to try it out yet and see if it works fine. |
You list |
The training and validation datasets contain an "image" and a "class" field. The "class" field is of type int64 and there is no "id" field. The test dataset contains an "image" and an "id" field. The "id" filed is of type tf.string and there is no "class" field. For reference, look at the Getting started with 100 flowers on TPU notebook, functions |
@dlibenzi I tried not making the list global but still the same problem. @martin-gorner Ah ok true, but the test dataset id field also has multiple values. Also, the test dataset seems to also repeat and has same data a training set. I hear you are working with the PyTorch XLA team to get Kaggle TPUs to work with PyTorch? Are the problems associated with this issue? Or is there a different problem? If so, could you please describe what other issues need to be fixed by the Kaggle team in order to use PyTorch XLA on Kaggle TPUs? |
The problem of |
So I created a new copy that creates an MD5 for every image: https://gist.github.com/dlibenzi/0aa7d2b47aaffcd91c83cad70c080035 Then |
@dlibenzi Ah you are right I did make a mistake with the global variable. Sorry about that. |
@dlibenzi @martin-gorner Interestingly Aditya showed over here that even though the hash codes are all unique, the images are repeated in the training set. Again there could be an error in the code, but it seems like this could be an actual bug? |
That code using MD5 is wrong again (you cannot cache a global MD5 object and keep updating it to get different items hashes). Guys, stop using globals 😉 import hashlib
m = hashlib.md5()
m.update('1'.encode())
print(m.hexdigest())
m.update('1'.encode())
print(m.hexdigest())
m.update('1'.encode())
print(m.hexdigest())
I have tried with the example TFRecord I pulled from our internal imagenet repo (posted above), and there are no repeated images. |
So here a new version which saves to the https://gist.github.com/dlibenzi/c9868a1090f6f8ef9d79d2cfcbadd8ab And this is my folder: https://storage.cloud.google.com/pytorch-tpu-releases/davide/tf_images.tar.gz It's hard to check all these images, but from a quick look they seem OK. |
Well my apologies, I never used md5 ever before :( Thanks for all your help! Learnt a lot of things from you! That's what experience teaches you! Thanks Again :) |
I would try using my TFRecord file to check your code experiments, because could be that yours is not properly created (not the TFRecord format itself, otherwise we could not read it at all, but the content). |
So i have now plotted the images, kernel link Thanks a lot for the help @dlibenzi.
|
There is a Kaggle competition for TPUs but the data is provided as TFRecords.
I wanted to use PyTorch for this competition and use this amazing library. The library seems to have TFRecord support, with the TfRecordReader. However, please see this thread. It seems that the TfRecordReader might not be properly reading in the dataset. The image is not being read in properly, and multiple ids are showing up, when it is supposed to be a single example. Could there be a bug in the way
TfRecordReader
is loading in the data.Also, is it possible to add this officially to the library? Finally, is it possible to add support for reading multiple files in parallel like
tf.data.TfRecordDataset
does?The text was updated successfully, but these errors were encountered: