Categories
AI/ML CNNs deep dev Vision

TF2: U-Net

One of the main decisions is how to train the Vision. We have an NVIDIA Jetson NX now, which can work on training in the background.

We will try Tensorflow 2 first, and if training is slow, we can try TensorFlow with TensorRT (TF-TRT).

But we’re starting from scratch. As the title suggests, we’re going to try get U-Net working. A neural network shaped like a U, for instance segmentation.

So, dev environment with virtual environments and pip? or Docker?

Let’s try Docker first. Some instructions here and here…

https://github.com/NVIDIA/nvidia-docker

https://www.tensorflow.org/install/docker

docker pull tensorflow/tensorflow:latest-gpu-jupyter
or
... # latest release w/ GPU support and Jupyter


#ok but we need NVIDIA container kit on the host:

sudo apt-get install curl

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get install -y nvidia-docker2

For the Jetson, we need to install NVIDIA container kit to get access to the host’s GPU.

Ok going for this one…

sudo docker pull tensorflow/tensorflow:2.4.1-gpu-jupyter

I prefer tagged versions to ‘latest’ because they’re probably more stable.

Working from Jupyter Notebook will be a good way to preserve the code, and if we can use Docker, let’s do that, because containers are easier to deal with, usually, than virtual python environments on a host. We’ll leave this for now because we need to prepare the data.

OIDv6

In the meantime, I need to redo the OID (Open Images) download with bounding boxes or segmentation mask info. Let’s go straight for segmentation, using the method we tried before.

Need dev setup basics. give me some curl and some pip3.

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

python3 get-pip.py

pip install openimages

WARNING: The script wheel is installed in ‘/home/chicken/.local/bin’ which is not on PATH.

ok…

export PATH=”/home/chicken/.local/bin:$PATH

and again… pip install openimages

So we download some files with mask file names

wget https://storage.googleapis.com/openimages/v5/test-annotations-object-segmentation.csv
wget https://storage.googleapis.com/openimages/v5/validation-annotations-object-segmentation.csv
wget https://storage.googleapis.com/openimages/v5/train-annotations-object-segmentation.csv

I tried v6 in that URL, but nope. Whatever.

mkdir OID
mkdir OID/v6
cd OID/v6
mkdir csv
mkdir csv/full
mkdir images
mkdir images/Chicken
mkdir images/Chicken/train
mkdir images/Chicken/test
mkdir images/Chicken/validation
mkdir masks
mkdir masks/Chicken
mkdir masks/Chicken/train
mkdir masks/Chicken/test
mkdir masks/Chicken/validation
mkdir recordsTf
mkdir recordsTf/Chicken
mkdir recordsTf/Chicken/test
mkdir recordsTf/Chicken/train
mkdir recordsTf/Chicken/validation

Ok new website page. https://storage.googleapis.com/openimages/web/download.html

Ok seems like Google’s links are still using v5, so let’s stick with v5.

Need some egrep to find the related images.

egrep '/m/09b5t' csv/full/test-annotations-object-segmentation.csv | egrep -o ^[0-9a-f]* > csv/chicken-test-images-ids.txt

egrep '/m/09b5t' csv/full/validation-annotations-object-segmentation.csv | egrep -o ^[0-9a-f]* > csv/chicken-validation-images-ids.txt

egrep '/m/09b5t' csv/full/train-annotations-object-segmentation.csv | egrep -o ^[0-9a-f]* > csv/chicken-train-images-ids.txt

and now feed this into a downloader program. We can use the suggested downloader.py script. but I liked this bash function method. The downloader.py needs the files prefixed with the directory, which is a bit annoying. In Linux, you’d need to use sed to put the directory names in front of every line.

function getTestImages { echo wget $2 -O images/Chicken/test/$1.jpg >> csv/gettestimages.sh; }
export -f getTestImages

csvtool call getTestImages csv/test-images-urls.csv
bash csv/gettestimages.sh

function getValidationImages { echo wget $2 -O images/Chicken/validation/$1.jpg >> csv/getevaluationimages.sh; }
export -f getValidationImages

csvtool call getValidationImages csv/validation-images-urls.csv
bash csv/getevaluationimages.sh

function getTrainImages { echo wget $2 -O images/Chicken/train/$1.jpg >> csv/gettrainimages.sh; }
export -f getTrainImages

csvtool call getTrainImages csv/train-images-urls.csv
bash csv/gettrainimages.sh

This is a surprisingly epic task, all of this. Lots of Flickr accounts have closed, it seems, since 2018. Lots of 404s.

But ultimately quite a few pics of chickens:

2.3G ./images/Chicken/train
88M ./images/Chicken/validation
323M ./images/Chicken/test
2.7G ./images/Chicken

Now I need the PNG files that are the masks for these images.

It seems like these are the 16 zip files.

wget https://storage.googleapis.com/openimages/v5/train-masks/train-masks-0.zip through 16. Oh but it goes 0-9, then A-F.

So, ok how to automate this? bash or perl or python? ok..

for i in {0..9}; do wget https://storage.googleapis.com/openimages/v5/train-masks/train-masks-$i.zip; done

well good enough automation for now. if I used hex maybe I can loop 1..F in bash. Let’s compromise. I could have copy pasted in this time.

for i in {'a','b','c','d','e','f'}; do wget https://storage.googleapis.com/openimages/v5/train-masks/train-masks-$i.zip; done

They’re 262MB each file.

unzip *

2686684 files… yikes

ok i need to find the PNG masks associated with the JPG images. I can work this out but I am flying blind. Chicken is /m/09b5t –

ls -l | grep 09b5t

ls -l | grep 09b5t | wc -l

shows 2237 masks for Chickens. But we only have 1324 images of Chickens.

Ok I need to see pics on the jetson. Ultimately an RDP (remote desktop protocol would be best?). VNC server is an old code but it checks out. Followed these instructions. and connected to 192.168.101.109:5901

Nope. It’s comically small at 640×480.

VNC listening on port 5901

Ok but yeah I guess I just wanted to see the pictures. But this isn’t really necessary yet, or practical over VNC. I want to verify that the PNG mask corresponds to the JPG image contents. I’ll probably use a Jupyter Notebook ultimately. (I do end up using Jupyter Lab.)

We’re configuring Tensorflow 2 or PyTorch to train some convolutional network with this segmentation data.

There’s the mappings are in these files:

train-annotations-object-segmentation.csv
test-annotations-object-segmentation.csv
validation-annotations-object-segmentation.csv

It’s got the mappings, and some extra factoids about where the Google data entry annotator people clicked with their wand selection tool, and a “Predicted IoU”, which is a big topic. We should hopefully only need the image to segmentation file mapping.

  • MaskPath: name of the corresponding mask image.
  • ImageID: the image this mask lives in.
  • LabelName: the MID of the object class this mask belongs to.
  • BoxID: an identifier for the box within the image.
  • BoxXMinBoxXMaxBoxYMinBoxYMax: coordinates of the box linked to the mask, in normalized image coordinates. Note that this is not the bounding box of the mask, but the starting box from which the mask was annotated. These coordinates can be used to relate the mask data with the boxes data.
  • PredictedIoU: if present, indicates a predicted IoU value with respect to ground-truth. This quality estimate is machine-generated based on human annotator behaviour. See [3] for details.
  • Clicks: if present, indicates the human annotator clicks, which provided guidance during the annotation process we carried out (See [3] for details). This field is encoded using the following format: X1 Y1 T1;X2 Y2 T2;X3 Y3 T3;...Xi Yi are the coordinates of the click in normalized image coordinates. Ti is the click type, value 0 indicates the annotator marks the point as background, value 1 as part of the object instance (foreground). These clicks can be interesting for researchers in the field of interactive segmentation. They are not necessary for users interested in the final masks only.

Ok it’s the same name. Easy enough.

MaskPath,ImageID,LabelName,BoxID,BoxXMin,BoxXMax,BoxYMin,BoxYMax,PredictedIoU,Clicks
677c122b0eaa5d16_m04yx4_9a041d52.png,677c122b0eaa5d16,/m/04yx4,9a041d52,0.8875,0.960938,0.454167,0.720833,0.86864,0.95498 0.65197 1;0.89370 0.56579 1;0.94701 0.48968 0;0.91049 0.70010 1;0.93927 0.47160 1;0.90269 0.56068 0;0.92061 0.70749 0;0.92509 0.64628 0;0.92248 0.65188 1;0.93042 0.46071 1;0.93290 0.71142 1;0.94431 0.48783 0

We have our images downloaded…

Ok the masks folder is too big though. Let’s just do Chicken, ok? So we’ll delete any PNGs that don’t have m09b5t in their filename. And delete these zip files.

find . -type f -print0 | xargs --null grep -Z -L 'm09b5t' | xargs --null rm

Lol that deleted everything. Oops. Don’t do that. Ok download again…

We’ll process zip files one at a time.

 unzip train-masks-0.zip -d ./masks   (1 minute passes)
 cd masks
 find \! -name '*m09b5*png' -delete (30 seconds)
 mv * ../Chicken 

1…2….3…

OK unzipstuff.sh

I automated the process.

chicken@jetson:~/OID/v6$ cat unzipstuff.sh

#!/bin/bash
for i in 1 2 3 4 5 6 7 8 9 a b c d e f
do
  eval "unzip train-masks-$i.zip -d masks/"
  cd masks
  find ! -name 'm09b5png' -delete
  mv /home/chicken/OID/v6/masks/* /home/chicken/OID/v6/Chicken
  cd ..
done
I need to display the information somehow.  Jupyter Lab (Notebooks) are probably the best way to display code, and run it interactively.  


chicken@jetson:~$ jupyter notebook --generate-config
Writing default config to: /home/chicken/.jupyter/jupyter_notebook_config.py
chicken@jetson:~$ jupyter-lab

Ok so I wasn’t sure why I couldn’t connect to the server on the Jetson, but I’m able to run it at http://localhost:8888/ through an SSH tunnel.

ssh -L 8888:127.0.0.1:8888 chicken@192.168.101.109

I’m not sure what the difference between Lab and Notebook is, exactly, yet, either. But I think Notebook is a subset of Lab.

Ok so I’m trying to match JPGs and PNGs. Some interesting data, with multiple masks for some images, and no masks for some images.

I set up SAMBA to copy files over and investigate.

I see. The disturbing part is that no images in my test and validation folders matched any masks. But all of the train images had a match…

OH. train, validation and test ALL have their own 16 zip files of masks.

Good thing I automated that… ok so same thing, but changing ‘train’ to the ‘validation’ and ‘test’.

I did a programmatic test on the directories to see if any images were missing a mask:

for fname in os.listdir(test_images_dir):
   if len(glob.glob(test_masks_dir + "*" + fname[:-4] + "*")) == 0:     
      print(fname)

It’s looking better. Still some missing, but good enough now. Missing 6 validation masks, and 12 test masks. All training images have at least one mask

Number of Train images: 1122
Number of Train masks: 2237
Number of validation images: 44
Number of validation masks: 59
02a0f2858f27a7ba.jpg
01463f5494340d3d.jpg
00e71a70a2f669ff.jpg
05887f57bc232041.jpg
0d3da02e79f84dde.jpg
0ed7092c41c81d14.jpg
Number of test images: 154
Number of test masks: 186
0e9be8b09f71f909.jpg
0913fbf6fa5c190e.jpg
0f8a38312499d209.jpg
0650a130d7f707b5.jpg
0a8a5aa471796fd5.jpg
0cc4722ca906f86c.jpg
04423d3f6f5b8e74.jpg
03bc7fbc956b3a9a.jpg
07621394c8ad0b47.jpg
000411001ff7dd4f.jpg
0e5ecc56e464dcb8.jpg
05600e8a393e3c3a.jpg

I’ll move these ones out of the folder.

mkdir ~/backup
cd /home/chicken/OID/v6/images/Chicken/validation/
mv 02a0f2858f27a7ba.jpg ~/backup
mv 01463f5494340d3d.jpg ~/backup
mv 00e71a70a2f669ff.jpg ~/backup
mv 05887f57bc232041.jpg ~/backup
mv 0d3da02e79f84dde.jpg ~/backup
mv 0ed7092c41c81d14.jpg ~/backup
cd /home/chicken/OID/v6/images/Chicken/test/
mv 0e9be8b09f71f909.jpg ~/backup
mv 0913fbf6fa5c190e.jpg ~/backup
mv 0f8a38312499d209.jpg ~/backup
mv 0650a130d7f707b5.jpg ~/backup
mv 0a8a5aa471796fd5.jpg ~/backup
mv 0cc4722ca906f86c.jpg ~/backup
mv 04423d3f6f5b8e74.jpg ~/backup
mv 03bc7fbc956b3a9a.jpg ~/backup
mv 07621394c8ad0b47.jpg ~/backup
mv 000411001ff7dd4f.jpg ~/backup
mv 0e5ecc56e464dcb8.jpg ~/backup
mv 05600e8a393e3c3a.jpg ~/backup

Ok and now all the images have masks!

Number of Train images: 1122 
Number of Train masks: 2237 
Number of validation images: 38 
Number of validation masks: 59 
Number of test images: 142 
Number of test masks: 186

Momentous. Looking at the nicolas windt article, there might be some dead links. So let’s delete those images too.

find -size 0 -delete

Number of Train images: 982 
Number of Train masks: 2237 
Number of validation images: 32 
Number of validation masks: 59 
Number of test images: 130 
Number of test masks: 186

Oof, still good. Let’s load a picture in Jupyter. Ok tensorflow has a loadimage function.

No module named 'tensorflow'

Right. We tried installing it with Docker. How will that even work? Eish, gotta read up on this.

Back to Tensorflow.

Ok I already downloaded an NVIDIA-friendly tensorflow 3 weeks ago. Well, things move slowly, but all incremental gains move things forward. With experience you learn ways not to do things.

chicken@jetson:~/OID/v6/images$ sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
tensorflow/tensorflow 2.4.1-gpu-jupyter 64d8717296f8 3 weeks ago 5.71GB
dustynv/jetson-inference r32.5.0 ccc2a5f19dad 3 weeks ago 2.89GB
nvidia/cuda 11.0-base 2ec708416bb8 5 months ago 122MB

Ok the TF2 instructions say…

Start a GPU container, using the Python interpreter.

$ docker run -it --rm -v $(realpath ~/notebooks):/tf/notebooks -p 8888:8888 tensorflow/tensorflow:latest-jupyter

Run a Jupyter notebook server with your own notebook directory (assumed here to be ~/notebooks). To use it, navigate to localhost:8888 in your browser. So…

$ docker run -it --rm -v ~/notebooks:/tf/notebooks -p 8888:8888 tensorflow/tensorflow:2.4.1-gpu-jupyter

Error...

standard_init_linux.go:211: exec user process caused "exec format error"

And pip?

chicken@jetson:~$ pip3 install tensorflow
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement tensorflow
ERROR: No matching distribution found for tensorflow

Great. Sanity check…

docker run -it --rm tensorflow/tensorflow bash

standard_init_linux.go:211: exec user process caused "exec format error"

Ok. Right, Jetson is aarch64, not x86-64… so google is suggesting Archiconda. This is too much for now. What’s wrong with pip? Python 3.6.9 is supposed to work with TF2.4.1 https://pypi.org/project/tensorflow/ hmm i guess there’s just no aarch64 version of TF2 precompiled.

So… one option is switch to PyTorch. Other option is try archiconda. I’m going to try this: https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml

“The Machine learning container contains TensorFlow, PyTorch, JupyterLab, and other popular ML and data science frameworks such as scikit-learn, scipy, and Pandas pre-installed in a Python 3.6 environment. Get started on your AI journey quickly on Jetson with everything pre-installed in this container.”

docker pull nvcr.io/nvidia/l4t-ml:r32.5.0-py3

sudo docker run -it –rm –runtime nvidia –network host -v /home/chicken/OID:/opt/OID -v /home/chicken/notebooks:/opt/notebooks nvcr.io/nvidia/l4t-ml:r32.5.0-py3

ok now we’re cooking. (No chickens were cooked during the making of this.)

So now I’m back on track, at like step 0.

I’m working off the Keras U-Net code now, from https://keras.io/examples/vision/oxford_pets_image_segmentation/ because it’s one of the simplest CNNs out there, from 2015. I’ve also opened up another implementation because it has more useful examples for training.

Note though that due to U-Net’s simplicity, it is often used for medical computer vision applications, since there’s not so much deep learning magic going on. You can quite easily imagine the latent representation dwelling somehow, at the bottom of the U shaped neural network. It should give us something interesting.

Let’s find the latent representation of a chicken.

We need to correlate the images and masks. We can glob by file name. Probably good as anything. But should probably put it in arrays of arrays or something. One image, many masks. So like a map from an image filename, to a list of mask filenames. As python calls maps, ‘dictionaries’.

Ok amazing, that works. I can see image and mask, and they correspond.

At some point I need to transform these. Make them all 256×256 pixels or something like that. Hmm.

OK, I got the training running. I got the Jetson like a month ago now, probably.

Had to reduce the batch size and epoch size, to get rid of an Out of Memory error. Then had a sort of browser freeze.

I should really run a script like this, instead:

nohup train.py &

but instead i’m hoping i can run it in Jupyter and it just follows the execution, and doesn’t freeze up. Maybe if I remove some debugging text…

But the loss function wasn’t going anywhere, even after 50 epochs, overnight. The mask prediction is just all black.

And I need to restart the Docker to open the tensorboard port

For Docker users: In case you are running a Docker image of Jupyter Notebook server using TensorFlow’s nightly, it is necessary to expose not only the notebook’s port, but the TensorBoard’s port. Thus, run the container with the following command:

docker run -it -p 8888:8888 -p 6006:6006 \
tensorflow/tensorflow:nightly-py3-jupyter 

or in my case,

sudo docker run -it -p 8888:8888 -p 6006:6006 --rm --runtime nvidia --network host -v /home/chicken/OID:/opt/OID -v /home/chicken/notebooks:/opt/notebooks nvcr.io/nvidia/l4t-ml:r32.5.0-py3


hmm the python 'magic' is not working

Ok so I ran tensorboard inside the docker terminal, instead of in the notebook. (You can do that by checking the container ID of 'docker ps' and calling 'docker exec -it <ID> bash')


python3 -m tensorboard.main --logdir=/opt/notebooks/logs



from tensorboard import notebook
import datetime

#%load_ext tensorboard
%reload_ext tensorboard
%tensorboard --logdir /opt/notebooks/logs

notebook.list()
notebook.display(port=6006, height=1000)



ok yeah so my ML model didn't learn shit.  
Also apparently they don't have tensorflow 2 in this nvidia ML docker container.

root@jetson:/opt/notebooks/logs# pip3 show tensorflow
Name: tensorflow
Version: 1.15.4+nv20.11

So how to debug? The images are converted to an n-dimensional array.


Got array with shape: (4, 256, 256, 1)

Ok things are going weird now, almost as I notice the TF version. It must be getting late.

Next day: Ok Nvidia has a TF2 docker, and it shares about half the layers with the other docker, so that’s cool: nvcr.io/nvidia/l4t-tensorflow:r32.5.0-tf2.3-py3

But it doesn’t have jupyter installed. Maybe I can copy the relevant bits from the Dockerfile. I’ve tried installing Jupyter and committing the docker, but “Failed building wheel for cffi”, some aarch64 issue.

RUN apt-get update && apt-get install -y libffi6 libffi-dev

Hard to find the nvidia docker files, and they only have l4t-base available.

#

# JupyterLab Dockerfile bits

#

RUN pip3 install jupyter jupyterlab --verbose

#RUN jupyter labextension install @jupyter-widgets/jupyterlab-manager@2

RUN jupyter lab --generate-config

RUN python3 -c "from notebook.auth.security import set_password; set_password('nvidia', '/root/.jupyter/jupyter_notebook_config.json')"


CMD /bin/bash -c "jupyter lab --ip 0.0.0.0 --port 8888 --allow-root &> /var/log/jupyter.log" & echo "allow 10 sec for JupyterLab to start @ http://localhost:8888 (password nvidia)" && echo "JupterLab logging location:  /var/log/jupyter.log  (inside the container)" && /bin/bash
- from https://github.com/dusty-nv/jetson-containers/blob/master/Dockerfile.ml

ok sweet jeebus, after a big detour, i am using this successfully.

chicken@jetson:~$ cat Dockerfile

FROM docker.io/datamachines/jetsonnano-cuda_tensorflow_opencv:10.2_2.3_4.5.1-20210218
RUN pip3 install jupyter jupyterlab --verbose
RUN jupyter lab --generate-config
RUN python3 -c "from notebook.auth.security import set_password; set_password('nvidia', '/root/.jupyter/jupyter_notebook_config.json')"
EXPOSE 6006
EXPOSE 8888
CMD /bin/bash -c "jupyter lab --ip 0.0.0.0 --port 8888 --allow-root &> /var/log/jupyter.log" & \
echo "allow 10 sec for JupyterLab to start @ http://$(hostname -I | cut -d' ' -f1):8888 (password nvidia)" && \
echo "JupterLab logging location: /var/log/jupyter.log (inside the container)" && \
/bin/bash

chicken@jetson:~$ sudo docker build -t nx_setup .

chicken@jetson:~$ sudo docker run -it -p 8888:8888 -p 6006:6006 --rm --runtime nvidia --network host -v /home/chicken/:/dmc nx_setup

finally. So, back to tensorflow, and running U-Net!

So, maybe I see a problem with the semantic segmentation, possibly, which is related to chickens being a category among other things, rather than a binary chickeness and non-chickenness :

SparseCategoricalCrossentropy class

Use this crossentropy metric when there are two or more label classes. 

I only have one class. Chicken. So that won’t work. I need an egg dataset. Luckily this implementation has an example of an eye, and the veins, and that is why we want the U-Net, for the egg anomaly detection.

The problem’s symptom is that nothing is being learned during training. So maybe I’m using the wrong loss function.

I need to review instance segmentation “options”.

The loss function is currently measuring “the crossentropy metric between the labels and predictions.”

The reason I want instance segmentation is to differentiate between chickens, where possible. Panoptic segmentation actually makes the most sense for this project.

Panoptic segmentation uses a semantic network and an instance network, and uses them both, to deliver something like (“cat”,0), (“cat”,1), (“cat”,3)

COCO Panoptic API looks great, but it seems to need json to describe all of the PNG images. Bounding boxes seems unnecessary but COCO needs bounding boxes data.

We’ll start a new post on Panoptic Segmentation using COCO, and get back to Tensorflow 2 for U-Net, for semantic segmentation, when training on lit up eggs for in ovo sexing.

Update after a hiatus: I see a recent nnU-Net advancement… It’s a meta modelling process evolution thing. “self-configuring” for biomedical imaging. Hmm. Very interesting.

We’re not there yet. We just want to get a basic U-Net working.

I see too, Perceptilabs from W&B is released and they have some beautiful screenshots too, though not available on pip3 yet for aarch64. So it’s not an option at the moment.

So, for reminder, in this post, we’re trying to get basic U-Net segmentation working. Here’s a good explanation of it.

“Back to U-net”

I’ve found another implementation of U-Net that seems a bit more plug and play. There is also a useful note here regarding U-Net and the number of classes. https://github.com/karolzak/keras-unet/issues/3

(173, 512, 512, 3) (173, 512, 512)
vs
(30, 512, 512) (30, 512, 512)

One of their notebooks looks like a promising notebook, the kz-isbi-challenge.py, and I rigged it to run on my data, and I get OOM. Out of Memory. But this is jupyter lab. Let’s not train it in jupyter lab. Seems like a bad idea. Like a common problem that there’s probably a solution to, but where the solution is probably, ‘use python, dumbass’ So, converted to py, and edited. Had to take out all the plotting code. Pity. But same problem.

I found a jetson-stats https://github.com/rbonghi/jetson_stats jtop program and though it only showed 6.2GB/8GB of RAM the whole time, (I wasn’t even using up all the RAM?), it did remind me that i’m in a Docker, and maybe I’m not using swap space, and that 8GB is probably not enough RAM for a conv net. The U-Net had 31 million params.

Trainable params: 31,030,593




ResourceExhaustedError:  OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
  [[node functional_1/concatenate_3/concat (defined at <ipython-input-26-51303ee95255>:7) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_test_function_3292]

Hmm. Well, about the docker swap space, docker will use the resources it can, on the host, which is gonna be just a bit less than whatever the host can handle. So when it crashed, It appears to me that it’s trying to load gpu memory, and only has 400MB or so.

2021-06-07 19:06:35.653219: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] total_region_allocated_bytes_: 404856832 memory_limit_: 404856832 available bytes: 0 curr_region_allocation_bytes_: 809713664
 2021-06-07 19:06:35.653456: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats: 
 Limit:                       404856832
 InUse:                       395858688
 MaxInUse:                    404771584
 NumAllocs:                         540
 MaxAllocSize:                 69172736
 Reserved:                            0
 PeakReserved:                        0
 LargestFreeBlock:                    0

So that was the advice from the repo author, that you should check your threads to see if they’ve allocated memory already, leaving none for other processes. (top or ps -ef) to see processes running.

After killing jupyter, I left it training overnight, on 300 training images and masks, from our chicken dataset, and it ran out of memory. But it looks like it finished training before it crapped out, and this time, the Out of Memory (OOM) error had some bigger numbers.

2021-06-08 08:15:21.038084: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] total_region_allocated_bytes_: 1400856576 memory_limit_: 1400856576 available bytes: 0 curr_region_allocation_bytes_: 2801713152
2021-06-08 08:15:21.038151: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats:
Limit: 1400856576
InUse: 616462592
MaxInUse: 1400851712
NumAllocs: 37528
MaxAllocSize: 1280887296
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0


And you can see the loss was decreasing.  That's cool.
So that third, ghostly column, is the one we're watching.  I think it's just not very good yet.  But maybe I don't understand what it's doing, exactly, either. I am expecting that when I'm done here, it should be able to make the mask, from just the image. 

The loss functions I’ve used have been,

model.compile(
     optimizer=Adam(), 
     loss='binary_crossentropy',
     metrics=[iou, iou_thresholded]
 )

and

model.compile(
     optimizer=SGD(lr=0.01, momentum=0.99),
     loss=jaccard_distance,
     metrics=[iou, iou_thresholded]
 )

So that was training with the second one, last night. I will continue with it for now. Jaccard distance is, union minus intersection, over union. Sounds good to me. Optimising, using Stochastic Gradient Descent, with some hyperparameters.

 d_J(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }.

Let’s leave it training again. I’m also upping the ratio between training and validation data, from 50/50 to 80/20. why not.

Also, the code we had before, for the first U-Net attempt, in the ‘Chicken Vision.py’ notebook, seemed more memory efficient, because it was lazy loading the images. But maybe much of a muchness. We’ll see, perhaps.

So training isn’t working anymore, it seems.

W tensorflow/core/kernels/gpu_utils.cc:49] 
Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.

Followed by OOM. Benign.

Stats: 
 Limit:                      1403920384
 InUse:                       650411520
 MaxInUse:                   1403915520
 NumAllocs:                       37625
 MaxAllocSize:               1266649600
 Reserved:                            0
 PeakReserved:                        0
 LargestFreeBlock:                    0

Ok we might need a cloud gpu. Jetson NX not cutting it.

From a while later, after cloud gpus, it is worth noting that there is a weed detection U-Net using two different loss functions, Dice loss, and ‘Focal Tversky loss’, and only has a 19,667 parameter NN. That’s orders of magnitude smaller, so I might want to come back and see how.