I’ve now got a UNet that can provide predictions for where an egg is, in simulation.
So I want to design a reward function related to the egg prediction mask.
I haven’t ‘plugged in’ the trained neural network though, because it will slow things down, and I can just as well make use of the built-in pybullet segmentation to get the simulation egg pixels. At some point though, the robot will have to exist in a world where egg pixels are not labelled as such, and the simulation trained vision will be a useful basis for training.
I think a good reward function might be, (to not fall over), and to maximize the number of 1s for the egg prediction mask. An intermediate award might be the centering of egg pixels.
The numpy way to count mask pixels could be
arr = np.array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])
np.count_nonzero(arr == 1)
I ended up using the following to count the pixels:
seg = Image.fromarray(mask.astype('uint8'))
self._num_ones = (np.array(seg) == 1).sum()
Hmm for centering, not sure yet.
I’m looking into how to run pybullet / gym on the cloud and get some of it rendering.
I’ve found a few leads. VNC is an obvious solution, but probably won’t be available on Chrome OS. Pybullet has a broken link, but I think it’s suggesting something like this colab, more or less, using ‘pyrender’. User matpalm has a minimal example of sending images to Google Dataflow. Those might be good if I can render video. There’s a Jupyter example with capturing images in pybullet. I’ll have to research a bit more. An RDP viewer would probably be easiest, if it’s possible.
I set up the Ray Tune training again, on google cloud, and enabled the dashboard by opening some ports (8265, and 6006), and initialising ray with ray.init(dashboard_host=”0.0.0.0″)
I can see it improving the episode reward mean, but it’s taking a good while on the 4 CPU cloud machine. Cost is about $3.50/day on the CPU machine, and about $16/day on the GPU machine. Google is out of T4 GPUs at the moment.
I have it saving the occasional mp4 video using a Monitor wrapper that records every 10th episode.
We’ve got an egg in the gym environment now, so we need to collect some data for training the robot to go pick up an egg.
I’m going to have it save the rgba, depth and segmentation images to disk for Unet training. I left out the depth image for now. The pictures don’t look useful. But some papers are using the depth, so I might reconsider. Some weed bot paper uses 14-channel images with all sorts of extra domain specific data relevant to plants.
I wrote some code to take pics if the egg was in the viewport, and it took 1000 rgb and segmentation pictures or so. I need to change the colour of the egg for sure, and probably randomize all the textures a bit. But main thing is probably to make the segmentation layers with pixel colours 0,1,2, etc. so that it detects the egg and not so much the link in the foreground.
So sigmoid to softmax and so on. Switching to multi-class also begs the question whether to switch to Pytorch & COCO panoptic segmentation based training. It will have to happen eventually, as I think all of the fastest implementations are currently in Pytorch and COCO based. Keras might work fine for multiclass or multiple binary classification, but it’s sort of the beginning attempt. Something that works. More proof of concept than final implementation. But I think Keras will be good enough for these in-simulation 256×256 images.
Regarding multi-class segmentation, karolzak says “it’s just a matter of changing num_classes argument and you would need to shape your mask in a different way (layer per class??), so for multiclass segmentation you would need a mask of shape (width, height, num_classes)“
I’ll keep logging my debugging though, if you’re reading this.
So I ran segmask_linkindex.py to see what it does, and how to get more useful data. The code is not running because the segmentation image actually has an array of arrays. I presume it’s a numpy array. I think it must be the rows and columns. So anyway I added a second layer to the loop, and output the pixel values, and when I ran it in the one mode:
Ok I see. Hmm. Well the important thing is that this code is indeed for extracting the pixel information. I think it’s going to be best for the segmentation to use the simpler segmentation mask that doesn’t track the link info. Ok so I used that code from the guy’s thesis project, and that was interpolating the numbers. When I look at the unique elements of the mask without interpolation, I’ve got…
[ 0 2 255]
[ 0 2 255]
[ 0 2 255]
[ 0 2 255]
[ 0 2 255]
[ 0 1 2 255]
[ 0 1 2 255]
[ 0 2 255]
[ 0 2 255]
Ok, so I think:
255 is the sky
0 is the plane
2 is the robotable
1 is the egg
So yeah, I was just confused because the segmentation masks were all black and white. But if you look closely with a pixel picker tool, the pixel values are (0,0,0), (1,1,1), (2,2,2), (255,255,255), so I just couldn’t see it.
The interpolation kinda helps, to be honest.
As per OpenAI’s domain randomization helping with Sim2Real, we want to randomize some textures and some other things like that. I also want to throw in some random chickens. Maybe some cats and dogs. I’m afraid of transfer learning, at this stage, because a lot of it has to do with changing the structure of the final layer of the neural network, and that might be tough. Let’s just do chickens and eggs.
Both techniques increase the computational requirements: dynamics randomization slows training down by a factor of 3x, while learning from images rather than states is about 5-10x slower.
Ok that’s a bit more complex than I was thinking. I want to randomize textures and colours, first
I’ve downloaded and unzipped the ‘Describable Textures Dataset’
And ok it’s loading a random texture for the plane
and random colour for the egg and chicken
Ok, next thing is the Simulation CNN.
Interpolation doesn’t work though, for this, cause it interpolates from what’s available in the image:
[ 0 85 170 255]
[ 0 63 127 191 255]
[ 0 63 127 191 255]
I kind of need the basic UID segmentation.
[ 0 1 2 3 255]
Ok, pity about the mask colours, but anyway.
Let’s train the UNet on the new dataset.
We’ll need to make karolzak’s changes.
I’ve saved 2000+ rgb.jpg and seg.png files and we’ve got [0,1,2,3,255] [plane, egg, robot, chicken, sky]
So num_classes=5
And
“for multiclass segmentation you would need a mask of shape (width, height, num_classes) “
What is y.shape?
(2001, 256, 256, 1)
which is 2001 files, of 256 x 256 pixels, and one class. So if I change that to 5…? ValueError: cannot reshape array of size 131137536 into shape (2001,256,256,5)
Um… Ok I need to do more research. Brb.
So the keras_unet library is set up to input binary masks per class, and output binary masks per class.
I coded it up using the library author’s suggested method, as he pointed out that the gains of the integer encoding method are minimal. I’ll check it out another time. I think it might still make sense for certain cases.
Ok that’s pretty awesome. We have 4 masks. Human, chicken, egg, robot. I left out plane and sky for now. That was just 2000 images of training, and I have 20000. I trained on another 2000 images, and it’s down to 0.008 validation loss, which is good enough!
So now I want to load the CNN model in the locomotion code, and feed it the images from the camera, and then have a reward function related to maximizing the egg pixels.
I also need to look at the pybullet-planning project and see what it consists of, as I imagine they’ve made some progress on the next steps. “built-in implementations of standard motion planners, including PRM, RRT, biRRT, A* etc.” – I haven’t even come across these acronyms yet! Ok, they are motion planning. Solvers of some sort. Hmm.
The attempted training of the U-Net on the Jetson NX has been a bit slow, making odd progress over 2 nights, and I’m not sure if it’s working. I’ve had to reduce batch size to 1, and the filter size, which has reduced the number of parameters by about a factor of 10, and still, loading the NN into memory sometimes dies on a concatenation call. The number of images per batch can also crash it, so perhaps some memory can be saved with a better image loading process.
Anyway, projects under an official NVIDIA repo are suggesting that we should be able to train smaller networks like resnet18, with 11 million parameters, on the Jetson. So maybe we can still avoid the cloud.
But judging by the NVIDIA TLT info, any training of resnet50s or 100s are going to need serious GPUs and memory and space for training.
After looking at Google, Amazon and Microsoft offerings, the AWS g4dn.xlarge instance looks like it might be the best option, at $0.526/hr, or Google’s got a T4 based compute engine for only $0.35/hr. These are good options, if 16GB of video ram will be enough. It should be, because we’re working with like 5GB on the Jetson.
Microsoft has the NC6 option, which looks good for a much more beefy GPU and memory, at $0.90/hr.
We’re just looking at Pay-as-you-go prices, as the 1-year and 3-year commitments will end up being expensive.
I’m still keen to try train on the Jetson, but the cloud is becoming more and more probable. In Sweden, visiting Miranda, we’re unable to order a Jetson AGX Xavier, the 32GB version. Arrow won’t ship here without a VAT number, and SiliconHighway is out of stock.
So, attempting Cloud GPUs. If you want to cut to the chase, read this one backwards. So many problems. In the end, it turned out setting it up yourself is practically impossible, but there is an ‘AI Platform’ section that works.
Amazon AWS. Tried to log in to AWS. “Authentication failed because your account has been suspended.” Tells me to create a new account. But then brings me back to the same failure screen. Ok, sending email to their accounts department. Next.
Google Cloud. I tried to create a VM and add a T4 GPU, but none of the regions have them. So I need to download the Gcloud SDK and CLI tool first, to run a command to describe the regions, according to the ‘Before you begin‘ instructions..
Ok, GPUs will only run on N1 and A2 VMs. The A2 VMs are only for A100s, so I need an N1 VM in one of these regions, and we add a T4 GPU.
There’s an option to load a specific docker, and unfortunately they don’t seem to have one with both Pytorch and TF2. Let’s start with TF2 gcr.io/deeplearning-platform-release/tf2-gpu.2-4
So this looks like a good enough VM. 30GB RAM, 8 cpus. For europe-west3, the cost is about 50 cents / hr for the VM and 41 cents / hr for the GPU.
n1-standard-8
8
30GB
$0.4896
$0.09840
1 GPU
16 GB GDDR6
$0.41 per GPU
So let’s round up to about $1/hour. I ended up picking the n1-standard-4 (4 cpus, 15 gb ram).
At these prices I’ll want to get things up and running asap. So I am going to prep a bit, before I click the Create VM button.
I had to try a few things to find a cloud instance with a gpu, because the official list didn’t really work. I eventually got one with a T4 GPU from europe-west4-c.
It seems like Google Drive isn’t really part of the google cloud platform ecosystem, so I started a storage bucket with 50GB of space, and am uploading the chicken images to it.
The instance doesn’t have pip or jupyter installed. So let’s do that…
ok so when I sudo’ed, I got this error
Jul 20 14:45:01 chicken-vm konlet-startup[1665]: {"errorDetail":{"message":"write /var/lib/docker/tmp/GetImageBlob362062711: no space left on device"},"error":"write /var/lib/docker/tmp/GetImageBl
Jul 20 14:45:01 chicken-vm konlet-startup[1665]: ).
Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 No containers created by previous runs of Konlet found.
Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 Found 0 volume mounts in container chicken-vm declaration.
Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 Error: Failed to start container: Error: No such image: gcr.io/deeplearning-platform-release/tf2-gpu.2-4
Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 Saving welcome script to profile.d
So 10GB wasn’t enough to load gcr.io/deeplearning-platform-release/tf2-gpu.2-4 , I guess.
Ok deleting the VM. Next time, bigger hard drive. I’m now adding a cloud storage bucket and uploading the chicken images, so I can copy them to the VM’s drive later. It’s taking forever. Wow. Ok.
Now I am trying to spin up a VM again, and it’s practically impossible. I’ve tried every region and zone possible. Ok europe-west1-c. Finally. I also upped my ‘quota’ of gpus, under IAM->Quotas, in case that is a reason I couldn’t find a GPU VM. They reviewed and approved it in about 15 minutes.
+------------------+--------+-----------------+ | Name | Region | Requested Limit | +------------------+--------+-----------------+ | GPUS_ALL_REGIONS | GLOBAL | 1 | +------------------+--------+-----------------+
So after like 10 minutes of nothing, I see the docker container started up.
68ee22bf268f gcr.io/deeplearning-platform-release/tf2-gpu.2-4 "/entrypoint.sh /run…" 5 minutes ago Up 4 minutes klt-chicken-vm-template-1-ursn
I’ve enabled tcp:8080 port in the firewall settings, but the external ip and new port don’t seem to connect. https://35.195.66.139:8080/ Ah ha. http. We’re in!
Jupyter Lab starting up.
So I tried to download the gcloud tools to get gsutil to access my storage bucket, but was getting ‘Permission denied’, even as root. I chown’ed it to my user, but still no.
I had to go out, so I stopped the VM. Seems you can’t suspend a VM with a GPU. I also saw when I typed ‘sudo -i’ to switch user to root, it said to ‘docker attach’ to my container. But the container is just like a tty printing out logs, so you can get stuck in the docker, and need to ssh in again.
I think the issue was just that I need to be inside the docker to do things. The VM you log into is just a minimal container running environment. So I think that was my issue. Next time I install gsutil, I’ll run ‘docker exec -it 68ee22bf268f bash’ to get into the docker first.
Ok fired up the VM again. This time I exec’ed into the docker, and gsutil was already installed. gsutil cp -r gs://chicken-drive . is copying the files now. It’s slow, and it says to try with -m, for parallel copying, but I’m just going to let it carry on for now. It’s slow, but I can do some other stuff for now. So far our gcloud bill is $1.80.
Ok, /opt/jupyter/chicken-drive has my data now. But according to /opt/jupyter/.jupyter/jupyter_notebook_config.py, I need to move it under /home/jupyter.
Hmm. No space left on drive. What? 26GB all full. But it wasn’t full a second ago. How can moving files cause this? I guess the mv operation must copy and then delete. Ok, so deleting the new one. Let’s try again, one folder at a time. Oh boy. This is something a bit off about the google process. I didn’t start my container, and if I did, I’d probably map a volume. But the host is sort of read only. Anyway. We’re in. I can see the files in Jupyter Lab.
So now we’re training U-Net binary classification using keras-unet, by karolzak, based on the kz-isbi-chanllenge.ipynb notebook.
But now I’m getting this error when it’s clearly there…
FileNotFoundError: [Errno 2] No such file or directory: '/OID/v6/images/Chicken/train/'
Ok well I can’t work it out but changing it to a path relative to the notebook worked. base_dir = “../../../”
Ok first test round of training, binary classification: chicken, not-chicken. Just 173 image/mask pairs, 10 epochs of 40 steps.
Now let’s try with the training set. 1989 chickens this time. 50/50 split. 30 epochs of 50 steps. Ok second round… hmm, not so good. Pretty much all black.
Ok I’m changing the parameters of the network, fixing some code, and starting again.
I see that the pngs were loading float values, whereas in the example, they were loading ints. I fixed it by adding a m = m.convert(‘L’) to the mask (png) loading code. I think previously, it was training with the float values from 0 to 1, divided by 255, whereas the original example had int values from 0 to 255, divided by 255.
So I’m also resetting the parameters, to make this a larger network, since we’re training in the cloud. 512×512 instead of 256×256. Batch size of 3. Horizontal flip augmentation. 64 filters. 10 epochs of 100 steps. Go go go. Ok, out of memory. Batch size of 1. Still out of memory. Back to test set of 173 chickens. Ok it’s only maxing at 40% RAM now. I’ll let it run.
Ok, honestly I don’t know anymore. What is it even doing? Looks like it’s inversing black and white. That’s not very useful.
Ok before giving up, I’m going to make some changes.
The next day, I’m starting up the VM. Total cost so far, $8.84. The files are all missing, so I’m recopying, though using the gsutil -m cp -R gs://chicken-drive . option, and yes it is a lot faster. Though it slows down.
I think the current setup is maybe failing because we’re using 173 images with one kind of augmentation. Instead of 10 epochs of 100 steps of the same shit, let’s rather swap out the training images.
First problem is that Keras is basically broken, in this regard. I’ve immediately discovered that saving and loading a checkpoint does not save and load the metrics, and so it keeps evaluating against a loss of infinity, instead of what your saved model achieved. Very annoying.
Now, after stopping and restarting the VM, and enabling all cloud APIs, I’m having a new problem. gsutil no longer works. After 4% copied, network throughput drops to 0.0B/s. I tried reconnecting and now get:
Connection via Cloud Identity-Aware Proxy Failed
Code: 4003
Reason: failed to connect to backend
You may be able to connect without using the Cloud Identity-Aware Proxy.
I’ve switched back to ‘Allow default access’. Still getting 4003.
Ok, I’ve deleted the instance. Trying again. Started it up. It’s not installing the docker I asked for, after 22 minutes. Something is wrong. Let’s try again. Stopping VM. I’m ticking the ‘Run as priviliged’ box this time.
Ok now it’s working again. It even started up with the docker ready. I’m trying with the multiprocess copying again, and it slowed down at 55%, but is still going. Phew. Ok.
I changed to using the TF2 SavedModel format. Still restarts the ‘best’ metric. What a piece of shit. I can’t actually believe it. Ok I wrote my own code for finding the best, by saving all weights with the val_loss in the filename, and then loading the best weights for the next epoch. It’s still not perfect, but it’s better than Keras overwriting the best weights every time.
Interestingly, it seems like maybe my training on the Jetson was actually working, because the same weird little vignette-ing is occurring.
Ok we’re up to $20 billing, on gcloud. It’s adding up, but not too badly yet. Nothing seems to be beating a round of training from like 4 hours ago, so to keep things more exploratory, I added a 50/50 chance to pick from the saved weights at random, rather than loading the winner every time.
Something seems to be happening. The vignette is shrinking, but some chicken border action, maybe.
I left it running overnight, and this morning, we’re up to $33 spent, and today, we can’t log into the VM again. Pretty annoying. Of the 3 reasons for ‘Permission denied’, only one makes sense, Your key expired and Compute Engine deleted your ~/.ssh/authorized_keys file.
Same story if I run the gcloud commands: gcloud beta compute ssh –zone “europe-west4-c” “chicken-vm-template-1” –project “gpu-ggr”
So I apparently need to add a new public key to the Metadata section. I just know something is going to go wrong. Yeah, so I did everything I know I’m supposed to do, and it didn’t work. I generated an OpenSSH private/public key pair in PuttyGen, I changed the permissions on the private key so that only I have access, I updated the SSH Keys in the VM instance metadata, and the metadata for good measure. And ssh -i opensshprivate daniel_brownell@34.91.21.245 -v just ends up with Permission denied (publickey).
Ok and then print the public key, and copy paste it to the VM Instance ‘Edit…’ / SSH Keys… and connect with PuTTY with the private key and… nope. Permission denied (publickey).. Ok I need to go through these answers and find one that works. Same error with windows cmd line ssh, except also complains that the openssh key is an invalid format. Try again later.
Fuck you gcloud. Ok I’m stopping and deleting the VM. $43 used so far.
Also, the training through the night didn’t improve on the val_loss score. Something’s fucked.
Ok I’ve started it up again a few days later. I was wondering about the warnings at the beginning of my training that carious CUDA things were not installed. So apparently I need:
So I increased the boot disk to 35GB and called ‘ cos-extensions install gpu’ again, after cd’ing into /mnt/stateful-partition and it worked a bit better. Still has ‘ERROR: Unable to load the kernel module 'nvidia.ko'.‘ in the logs though. But install logs at ./mnt/stateful_partition/var/lib/nvidia/nvidia-installer.log say its ok…
So the error now is ‘Could not load dynamic library ‘libcuda.so.1′; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64’
And so we need to modify the docker container run command, something like the example in the instructions.
Ok so our container is… gcr.io/deeplearning-platform-release/tf2-gpu.2-4
According to this stackoverflow answer, this already has everything installed. Ok but the host needs the drivers installed.
tf.config.list_physical_devices('GPU')
[]
So yeah, i think i need to install the cos crap, and restart the container with those volume and device bits.
docker stop klt-chicken-vm-template-1-ursn
docker run \
--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
--device /dev/nvidiactl:/dev/nvidiactl \
gcr.io/deeplearning-platform-release/tf2-gpu.2-4
...
[I 14:54:49.167 LabApp] Jupyter Notebook 6.3.0 is running at:
[I 14:54:49.168 LabApp] http://46fce08b5770:8080/
[I 14:54:49.168 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
^C^C^C^C^C^C^C^C^C^C
Not so good. Ok can’t access it either. -p 8080:8080 fixes that. It didn’t like --gpus all.
“Unable to determine GPU information”. Container optimised shit.
Ok I’m going to delete the VM again. Going to check out these nvidia cloud containers. There’s 21.07-tf2-py3 and NGC stuff.
So I can’t pull the dockers cause there’s no space, and even after attaching a persistent disk, not, because things are stored on the boot disk. Ok but I can tell docker to store stuff on a persistent disk.
/etc/docker/daemon.json:
{
"data-root": "/mnt/x/y/docker_data"
}
root@nvidia-ngc-tensorflow-test-b-1-vm:/mnt/disks/disk# docker run --gpus all --rm -it -p 8080:8080 -p 6006:6006 nvcr.io/nvidia/tensorflow:21.07-tf2-py3
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
chmod +x cuda_11.1.0_455.23.05_linux.run
sudo ./cuda_11.1.0_455.23.05_linux.run
or some newer version:
wget https://developer.download.nvidia.com/compute/cuda/11.4.1/local_installers/cuda_11.4.1_470.57.02_linux.run
sudo sh cuda_11.4.1_470.57.02_linux.run
‘boost::filesystem::filesystem_error’
Ok using all the space again. 32GB. Not enough. Fuck this. I’m deleting the VM again. 64GB. SSD persistent disk. Ok installed driver. Running docker…
And…
FFS. Something is compromised. In the time it took to install CUDA and run docker on an Ubuntu VM, an army of Indian hackers managed to delete my root user.
Ok. Maybe it’s time to consider AWS again for GPUs. I think I can officially count GCP GPU as unusable. Learned a few useful things, but overall, yeesh.
I think maybe I’ll just run the training on a cheap non-GPU VM on GCP for now, so that I’m not paying for a GPU that I’m not using.
docker run -d -p 8080:8080 -v /home/daniel_brownell:/home/jupyter gcr.io/deeplearning-platform-release/tf2-cpu.2-4
Ok wow so now with the cpu version, the loss is improving like crazy. It went from 0.28 to 0.24 in 10 epochs (10 minutes or so). That sort of improvement was not happening after like 10 hours on the ‘gpu’.
So yeah, amazing. The code now does a sort of population based training, by picking a random previous set of weights instead of the best weights, half of the time. Overall it slows things down, but should result in a bit more variation in the end.
What finally worked
Ok there’s also an ‘AI platform – notebook’ option. I might try that too.
Ok the instance started up. But it failed to start 4 cron services: nscd, unscd, crond, sshd. CPU use goes to zero. Nothing. Ok so I need to ssh tunnel apparently.
Successfully opened dynamic library libcudart.so.11.0
‘ModelCheckpoint’ object has no attribute ‘_implements_train_batch_hooks’
Ok, needed to change all keras.* etc. to tensorflow.keras.*
Ok fuck me that’s a lot faster than CPU.
Permission denied: ‘weights-0.2439.hdf5’
Ok, let’s sudo it.
Ok there she goes. It’s like 20 times faster maybe. Strangely isn’t doing much better than the CPU though. But I’ll let it run for a bit. It’s only been a minute. I think maybe the CPU doing well was just good luck. Perhaps we trained them too well on the original set of like 173 images, and it was getting good results on those original images.
Ok now it’s been an hour or so, and it’s not beating the CPU. I’ve changed the train / validation set to 50/50 now, and the learning rate is randomly chosen between 0.001 and 0.0003. And I’m upping the epochs to 30. And the filters to 64. batch_size=4, use_batch_norm=True.
We’re down to 23.3 after an hour and a half. 21 now… 3 hours maybe now
Ok 5 hours, lets check:
Holy shit it’s working. That’s great. I’ll leave it running overnight. The overnight results didn’t improve much for some reason.
(TODO: learn about focal loss / dice loss / jaccard distance as possible change to loss function.? less necessary now.)
So it’s cool but it’s 364MB. We need it 1/4 size to run it on the Jetson NX I think.
So, retraining, with filters=32. We’re already down to 0.24 after an hour. Ok I stopped at 0.2104 after a few hours.
So yeah. Good enough for now.
There’s some other things to train, too.
The eggs in simulation: generate views, save images to disk. save segmentation images to disk.
Train the walking again with the gripper.
Eggs in the real world. Use augmentation to place real egg pics in scenes. Possibly use Mask-RCNN/YOLACT code with COCO, instead of continuing in Keras.
The now-working U-net binary chicken segmentation is in Keras, so there will be some tricks required, to run a multi-class segmentation detector, or multiple binary classifiers. Advice for multi-class segmentation is here and the multiple binary classifier advice is here.
When we finally try running it all on a Jetson, we will maybe need to shrink the neural network further. But that can be done last minute. It looks like we can save the h5fs file to TF2’s SavedModel format with model.save(model_fname) and convert to frozen graph, to import into TensorRT, the NVIDIA format. Similar to this. TensorRT shrinks neurons to single bytes, I believe.
One of the main decisions is how to train the Vision. We have an NVIDIA Jetson NX now, which can work on training in the background.
We will try Tensorflow 2 first, and if training is slow, we can try TensorFlow with TensorRT (TF-TRT).
But we’re starting from scratch. As the title suggests, we’re going to try get U-Net working. A neural network shaped like a U, for instance segmentation.
So, dev environment with virtual environments and pip? or Docker?
Let’s try Docker first. Some instructions here and here…
I prefer tagged versions to ‘latest’ because they’re probably more stable.
Working from Jupyter Notebook will be a good way to preserve the code, and if we can use Docker, let’s do that, because containers are easier to deal with, usually, than virtual python environments on a host. We’ll leave this for now because we need to prepare the data.
OIDv6
In the meantime, I need to redo the OID (Open Images) download with bounding boxes or segmentation mask info. Let’s go straight for segmentation, using the method we tried before.
Need dev setup basics. give me some curl and some pip3.
and now feed this into a downloader program. We can use the suggested downloader.py script. but I liked this bash function method. The downloader.py needs the files prefixed with the directory, which is a bit annoying. In Linux, you’d need to use sed to put the directory names in front of every line.
Now I need the PNG files that are the masks for these images.
It seems like these are the 16 zip files.
wget https://storage.googleapis.com/openimages/v5/train-masks/train-masks-0.zip through 16. Oh but it goes 0-9, then A-F.
So, ok how to automate this? bash or perl or python? ok..
for i in {0..9}; do wget https://storage.googleapis.com/openimages/v5/train-masks/train-masks-$i.zip; done
well good enough automation for now. if I used hex maybe I can loop 1..F in bash. Let’s compromise. I could have copy pasted in this time.
for i in {'a','b','c','d','e','f'}; do wget https://storage.googleapis.com/openimages/v5/train-masks/train-masks-$i.zip; done
They’re 262MB each file.
unzip *
2686684 files… yikes
ok i need to find the PNG masks associated with the JPG images. I can work this out but I am flying blind. Chicken is /m/09b5t –
ls -l | grep 09b5t
ls -l | grep 09b5t | wc -l
shows 2237 masks for Chickens. But we only have 1324 images of Chickens.
Ok I need to see pics on the jetson. Ultimately an RDP (remote desktop protocol would be best?). VNC server is an old code but it checks out. Followed these instructions. and connected to 192.168.101.109:5901
Nope. It’s comically small at 640×480.
Ok but yeah I guess I just wanted to see the pictures. But this isn’t really necessary yet, or practical over VNC. I want to verify that the PNG mask corresponds to the JPG image contents. I’ll probably use a Jupyter Notebook ultimately. (I do end up using Jupyter Lab.)
We’re configuring Tensorflow 2 or PyTorch to train some convolutional network with this segmentation data.
It’s got the mappings, and some extra factoids about where the Google data entry annotator people clicked with their wand selection tool, and a “Predicted IoU”, which is a big topic. We should hopefully only need the image to segmentation file mapping.
MaskPath: name of the corresponding mask image.
ImageID: the image this mask lives in.
LabelName: the MID of the object class this mask belongs to.
BoxID: an identifier for the box within the image.
BoxXMin, BoxXMax, BoxYMin, BoxYMax: coordinates of the box linked to the mask, in normalized image coordinates. Note that this is not the bounding box of the mask, but the starting box from which the mask was annotated. These coordinates can be used to relate the mask data with the boxes data.
PredictedIoU: if present, indicates a predicted IoU value with respect to ground-truth. This quality estimate is machine-generated based on human annotator behaviour. See [3] for details.
Clicks: if present, indicates the human annotator clicks, which provided guidance during the annotation process we carried out (See [3] for details). This field is encoded using the following format: X1 Y1 T1;X2 Y2 T2;X3 Y3 T3;.... Xi Yi are the coordinates of the click in normalized image coordinates. Ti is the click type, value 0 indicates the annotator marks the point as background, value 1 as part of the object instance (foreground). These clicks can be interesting for researchers in the field of interactive segmentation. They are not necessary for users interested in the final masks only.
Ok the masks folder is too big though. Let’s just do Chicken, ok? So we’ll delete any PNGs that don’t have m09b5t in their filename. And delete these zip files.
#!/bin/bash
for i in 1 2 3 4 5 6 7 8 9 a b c d e f
do
eval "unzip train-masks-$i.zip -d masks/"
cd masks
find ! -name 'm09b5png' -delete
mv /home/chicken/OID/v6/masks/* /home/chicken/OID/v6/Chicken
cd ..
done
I need to display the information somehow. Jupyter Lab (Notebooks) are probably the best way to display code, and run it interactively.
chicken@jetson:~$ jupyter notebook --generate-config
Writing default config to: /home/chicken/.jupyter/jupyter_notebook_config.py
chicken@jetson:~$ jupyter-lab
Ok so I wasn’t sure why I couldn’t connect to the server on the Jetson, but I’m able to run it at http://localhost:8888/ through an SSH tunnel.
I’m not sure what the difference between Lab and Notebook is, exactly, yet, either. But I think Notebook is a subset of Lab.
Ok so I’m trying to match JPGs and PNGs. Some interesting data, with multiple masks for some images, and no masks for some images.
I set up SAMBA to copy files over and investigate.
I see. The disturbing part is that no images in my test and validation folders matched any masks. But all of the train images had a match…
OH. train, validation and test ALL have their own 16 zip files of masks.
Good thing I automated that… ok so same thing, but changing ‘train’ to the ‘validation’ and ‘test’.
I did a programmatic test on the directories to see if any images were missing a mask:
for fname in os.listdir(test_images_dir):
if len(glob.glob(test_masks_dir + "*" + fname[:-4] + "*")) == 0: print(fname)
It’s looking better. Still some missing, but good enough now. Missing 6 validation masks, and 12 test masks. All training images have at least one mask
Number of Train images: 1122
Number of Train masks: 2237
Number of validation images: 44
Number of validation masks: 59
02a0f2858f27a7ba.jpg
01463f5494340d3d.jpg
00e71a70a2f669ff.jpg
05887f57bc232041.jpg
0d3da02e79f84dde.jpg
0ed7092c41c81d14.jpg
Number of test images: 154
Number of test masks: 186
0e9be8b09f71f909.jpg
0913fbf6fa5c190e.jpg
0f8a38312499d209.jpg
0650a130d7f707b5.jpg
0a8a5aa471796fd5.jpg
0cc4722ca906f86c.jpg
04423d3f6f5b8e74.jpg
03bc7fbc956b3a9a.jpg
07621394c8ad0b47.jpg
000411001ff7dd4f.jpg
0e5ecc56e464dcb8.jpg
05600e8a393e3c3a.jpg
I’ll move these ones out of the folder.
mkdir ~/backup
cd /home/chicken/OID/v6/images/Chicken/validation/
mv 02a0f2858f27a7ba.jpg ~/backup
mv 01463f5494340d3d.jpg ~/backup
mv 00e71a70a2f669ff.jpg ~/backup
mv 05887f57bc232041.jpg ~/backup
mv 0d3da02e79f84dde.jpg ~/backup
mv 0ed7092c41c81d14.jpg ~/backup
cd /home/chicken/OID/v6/images/Chicken/test/
mv 0e9be8b09f71f909.jpg ~/backup
mv 0913fbf6fa5c190e.jpg ~/backup
mv 0f8a38312499d209.jpg ~/backup
mv 0650a130d7f707b5.jpg ~/backup
mv 0a8a5aa471796fd5.jpg ~/backup
mv 0cc4722ca906f86c.jpg ~/backup
mv 04423d3f6f5b8e74.jpg ~/backup
mv 03bc7fbc956b3a9a.jpg ~/backup
mv 07621394c8ad0b47.jpg ~/backup
mv 000411001ff7dd4f.jpg ~/backup
mv 0e5ecc56e464dcb8.jpg ~/backup
mv 05600e8a393e3c3a.jpg ~/backup
Ok and now all the images have masks!
Number of Train images: 1122
Number of Train masks: 2237
Number of validation images: 38
Number of validation masks: 59
Number of test images: 142
Number of test masks: 186
Momentous. Looking at the nicolas windt article, there might be some dead links. So let’s delete those images too.
find -size 0 -delete
Number of Train images: 982
Number of Train masks: 2237
Number of validation images: 32
Number of validation masks: 59
Number of test images: 130
Number of test masks: 186
Oof, still good. Let’s load a picture in Jupyter. Ok tensorflow has a loadimage function.
No module named 'tensorflow'
Right. We tried installing it with Docker. How will that even work? Eish, gotta read up on this.
Back to Tensorflow.
Ok I already downloaded an NVIDIA-friendly tensorflow 3 weeks ago. Well, things move slowly, but all incremental gains move things forward. With experience you learn ways not to do things.
chicken@jetson:~/OID/v6/images$ sudo docker images REPOSITORY TAG IMAGE ID CREATED SIZE tensorflow/tensorflow 2.4.1-gpu-jupyter 64d8717296f8 3 weeks ago 5.71GB dustynv/jetson-inference r32.5.0 ccc2a5f19dad 3 weeks ago 2.89GB nvidia/cuda 11.0-base 2ec708416bb8 5 months ago 122MB
Ok the TF2 instructions say…
Start a GPU container, using the Python interpreter.
Run a Jupyter notebook server with your own notebook directory (assumed here to be ~/notebooks). To use it, navigate to localhost:8888 in your browser. So…
$ docker run -it --rm -v ~/notebooks:/tf/notebooks -p 8888:8888 tensorflow/tensorflow:2.4.1-gpu-jupyter
Error...
standard_init_linux.go:211: exec user process caused "exec format error"
And pip?
chicken@jetson:~$ pip3 install tensorflow Defaulting to user installation because normal site-packages is not writeable ERROR: Could not find a version that satisfies the requirement tensorflow ERROR: No matching distribution found for tensorflow
Great. Sanity check…
docker run -it --rm tensorflow/tensorflow bash
standard_init_linux.go:211: exec user process caused "exec format error"
Ok. Right, Jetson is aarch64, not x86-64… so google is suggesting Archiconda. This is too much for now. What’s wrong with pip? Python 3.6.9 is supposed to work with TF2.4.1 https://pypi.org/project/tensorflow/ hmm i guess there’s just no aarch64 version of TF2 precompiled.
So… one option is switch to PyTorch. Other option is try archiconda. I’m going to try this: https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml
“The Machine learning container contains TensorFlow, PyTorch, JupyterLab, and other popular ML and data science frameworks such as scikit-learn, scipy, and Pandas pre-installed in a Python 3.6 environment. Get started on your AI journey quickly on Jetson with everything pre-installed in this container.”
ok now we’re cooking. (No chickens were cooked during the making of this.)
So now I’m back on track, at like step 0.
I’m working off the Keras U-Net code now, from https://keras.io/examples/vision/oxford_pets_image_segmentation/ because it’s one of the simplest CNNs out there, from 2015. I’ve also opened up another implementation because it has more useful examples for training.
Note though that due to U-Net’s simplicity, it is often used for medical computer vision applications, since there’s not so much deep learning magic going on. You can quite easily imagine the latent representation dwelling somehow, at the bottom of the U shaped neural network. It should give us something interesting.
Let’s find the latent representation of a chicken.
We need to correlate the images and masks. We can glob by file name. Probably good as anything. But should probably put it in arrays of arrays or something. One image, many masks. So like a map from an image filename, to a list of mask filenames. As python calls maps, ‘dictionaries’.
Ok amazing, that works. I can see image and mask, and they correspond.
At some point I need to transform these. Make them all 256×256 pixels or something like that. Hmm.
OK, I got the training running. I got the Jetson like a month ago now, probably.
Had to reduce the batch size and epoch size, to get rid of an Out of Memory error. Then had a sort of browser freeze.
I should really run a script like this, instead:
nohup train.py &
but instead i’m hoping i can run it in Jupyter and it just follows the execution, and doesn’t freeze up. Maybe if I remove some debugging text…
But the loss function wasn’t going anywhere, even after 50 epochs, overnight. The mask prediction is just all black.
And I need to restart the Docker to open the tensorboard port
For Docker users: In case you are running a Docker image of Jupyter Notebook server using TensorFlow’s nightly, it is necessary to expose not only the notebook’s port, but the TensorBoard’s port. Thus, run the container with the following command:
docker run -it -p 8888:8888 -p 6006:6006 \
tensorflow/tensorflow:nightly-py3-jupyter
or in my case,
sudo docker run -it -p 8888:8888 -p 6006:6006 --rm --runtime nvidia --network host -v /home/chicken/OID:/opt/OID -v /home/chicken/notebooks:/opt/notebooks nvcr.io/nvidia/l4t-ml:r32.5.0-py3
hmm the python 'magic' is not working
Ok so I ran tensorboard inside the docker terminal, instead of in the notebook. (You can do that by checking the container ID of 'docker ps' and calling 'docker exec -it <ID> bash')
python3 -m tensorboard.main --logdir=/opt/notebooks/logs
from tensorboard import notebook
import datetime
#%load_ext tensorboard
%reload_ext tensorboard
%tensorboard --logdir /opt/notebooks/logs
notebook.list()
notebook.display(port=6006, height=1000)
ok yeah so my ML model didn't learn shit.
Also apparently they don't have tensorflow 2 in this nvidia ML docker container.
root@jetson:/opt/notebooks/logs# pip3 show tensorflow
Name: tensorflow
Version: 1.15.4+nv20.11
So how to debug? The images are converted to an n-dimensional array.
Got array with shape: (4, 256, 256, 1)
Ok things are going weird now, almost as I notice the TF version. It must be getting late.
Next day: Ok Nvidia has a TF2 docker, and it shares about half the layers with the other docker, so that’s cool: nvcr.io/nvidia/l4t-tensorflow:r32.5.0-tf2.3-py3
But it doesn’t have jupyter installed. Maybe I can copy the relevant bits from the Dockerfile. I’ve tried installing Jupyter and committing the docker, but “Failed building wheel for cffi”, some aarch64 issue.
RUN apt-get update && apt-get install -y libffi6 libffi-dev
Hard to find the nvidia docker files, and they only have l4t-base available.
finally. So, back to tensorflow, and running U-Net!
So, maybe I see a problem with the semantic segmentation, possibly, which is related to chickens being a category among other things, rather than a binary chickeness and non-chickenness :
SparseCategoricalCrossentropy class
” Use this crossentropy metric when there are two or more label classes. “
I only have one class. Chicken. So that won’t work. I need an egg dataset. Luckily this implementation has an example of an eye, and the veins, and that is why we want the U-Net, for the egg anomaly detection.
The problem’s symptom is that nothing is being learned during training. So maybe I’m using the wrong loss function.
I need to review instance segmentation “options”.
The loss function is currently measuring “the crossentropy metric between the labels and predictions.”
The reason I want instance segmentation is to differentiate between chickens, where possible. Panoptic segmentation actually makes the most sense for this project.
Panoptic segmentation uses a semantic network and an instance network, and uses them both, to deliver something like (“cat”,0), (“cat”,1), (“cat”,3)
…
COCO Panoptic API looks great, but it seems to need json to describe all of the PNG images. Bounding boxes seems unnecessary but COCO needs bounding boxes data.
We’ll start a new post on Panoptic Segmentation using COCO, and get back to Tensorflow 2 for U-Net, for semantic segmentation, when training on lit up eggs for in ovo sexing.
…
Update after a hiatus: I see a recent nnU-Net advancement… It’s a meta modelling process evolution thing. “self-configuring” for biomedical imaging. Hmm. Very interesting.
We’re not there yet. We just want to get a basic U-Net working.
I see too, Perceptilabs from W&B is released and they have some beautiful screenshots too, though not available on pip3 yet for aarch64. So it’s not an option at the moment.
So, for reminder, in this post, we’re trying to get basic U-Net segmentation working. Here’s a good explanation of it.
…
“Back to U-net”
I’ve found another implementation of U-Net that seems a bit more plug and play. There is also a useful note here regarding U-Net and the number of classes. https://github.com/karolzak/keras-unet/issues/3
One of their notebooks looks like a promising notebook, the kz-isbi-challenge.py, and I rigged it to run on my data, and I get OOM. Out of Memory. But this is jupyter lab. Let’s not train it in jupyter lab. Seems like a bad idea. Like a common problem that there’s probably a solution to, but where the solution is probably, ‘use python, dumbass’ So, converted to py, and edited. Had to take out all the plotting code. Pity. But same problem.
I found a jetson-stats https://github.com/rbonghi/jetson_stats jtop program and though it only showed 6.2GB/8GB of RAM the whole time, (I wasn’t even using up all the RAM?), it did remind me that i’m in a Docker, and maybe I’m not using swap space, and that 8GB is probably not enough RAM for a conv net. The U-Net had 31 million params.
Trainable params: 31,030,593
ResourceExhaustedError: OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node functional_1/concatenate_3/concat (defined at <ipython-input-26-51303ee95255>:7) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_test_function_3292]
Hmm. Well, about the docker swap space, docker will use the resources it can, on the host, which is gonna be just a bit less than whatever the host can handle. So when it crashed, It appears to me that it’s trying to load gpu memory, and only has 400MB or so.
So that was the advice from the repo author, that you should check your threads to see if they’ve allocated memory already, leaving none for other processes. (top or ps -ef) to see processes running.
After killing jupyter, I left it training overnight, on 300 training images and masks, from our chicken dataset, and it ran out of memory. But it looks like it finished training before it crapped out, and this time, the Out of Memory (OOM) error had some bigger numbers.
2021-06-08 08:15:21.038084: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] total_region_allocated_bytes_: 1400856576 memory_limit_: 1400856576 available bytes: 0 curr_region_allocation_bytes_: 2801713152
2021-06-08 08:15:21.038151: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats:
Limit: 1400856576
InUse: 616462592
MaxInUse: 1400851712
NumAllocs: 37528
MaxAllocSize: 1280887296
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
And you can see the loss was decreasing. That's cool.
So that third, ghostly column, is the one we're watching. I think it's just not very good yet. But maybe I don't understand what it's doing, exactly, either. I am expecting that when I'm done here, it should be able to make the mask, from just the image.
So that was training with the second one, last night. I will continue with it for now. Jaccard distance is, union minus intersection, over union. Sounds good to me. Optimising, using Stochastic Gradient Descent, with some hyperparameters.
Let’s leave it training again. I’m also upping the ratio between training and validation data, from 50/50 to 80/20. why not.
Also, the code we had before, for the first U-Net attempt, in the ‘Chicken Vision.py’ notebook, seemed more memory efficient, because it was lazy loading the images. But maybe much of a muchness. We’ll see, perhaps.
So training isn’t working anymore, it seems.
W tensorflow/core/kernels/gpu_utils.cc:49]
Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
Ok we might need a cloud gpu. Jetson NX not cutting it.
From a while later, after cloud gpus, it is worth noting that there is a weed detection U-Net using two different loss functions, Dice loss, and ‘Focal Tversky loss’, and only has a 19,667 parameter NN. That’s orders of magnitude smaller, so I might want to come back and see how.
So first of all, there are TPUs, Tensor Processing Units, like this one that Google bought https://coral.ai/ / https://coral.ai/products/ that are more specialised. They’re ASICs.
A tensor processing unit (TPU) is an AI acceleratorapplication-specific integrated circuit (ASIC) developed by Google specifically for neural networkmachine learning, particularly using Google’s own TensorFlow software.[1] Google began using TPUs internally in 2015, and in 2018 made them available for third party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.
Second, like, you don’t even need physical cards. You can rent a server at Hetzner, or just buy “Compute” on AWS or Google, etc.
So, how to train your dragon?
Gaming cards are not as fast as TPUs, but they’re pretty good for gaming. That’s something to consider too.
2019 is bit outdated. Before the latest AMD RX 57*/58*
(eg “RX 580X”) series.
Latest advice, 2020 August:
“AMD Ryzen Threadripper 2950x with 2 x Nvidia RTX 2080 Ti.”
NVIDIA has better software support, usually. It’s almost like vi vs. emacs – an eternal battle of the hardware Gods, to increase FLOPS. AMD vs. NVIDIA, newt vs snake, red vs. blue.
AMD has “Vega” 7nm manufacturing process. It’s ahead, for now.
Anyway, what to think of all this? Graphics cards are pretty expensive. And there’s a whole new world of IoT edge computing devices, which are more what we’re interested in, anyway.
For graphics cards, about a year ago, GTX1060 (6GB) was the best deal. AMD was out of the race. But then they got 7nm processing and whipped up some cool sounding CPUs, in like 16 and 32 core versions. So however shitty their software is, they make very efficient, parallelised products, using CPU and GPU, and have historically been the one that follows open standards. NVIDIA is proprietary. But CUDA used to be practically the only game in town.
Anyway, we can just see how long it takes to train the detectron2 chicken and egg image segmentation code.
I can probably just leave my 4 CPU cores training over night for the things we want to do, or set up the raspberry pi to work on something.
This https://github.com/facebookresearch/meshrcnn is maybe getting closer to holy grail in my mind. I like the idea of bridging the gap between simulation and reality in the other direction too. By converting the world into object meshes. Real2Sim.
The OpenAI Rubik’s cube hand policy transfer was done with camera in simulation and camera in real world. This could allow a sort of dreaming, i.e., running simulations on new 3d obj data.)
It could acquire data that it could mull over, when chickens are asleep.
There’s no chicken category in Pix3d. But getting closer. Just need a chicken and egg dataset.
Downloading blender again, to check out the obj file that was generated. Ok Blender doesn’t want to show it, but here’s a handy site https://3dviewer.net/ to view OBJ files. The issue in blender required selecting the obj, then View > Frame Selected to make it zoom in. Switching to orthographic from perspective view also helps.
We provide the first evidence that a smart algorithm with modest CPU OpenMP parallelism can outperform the best available hardware NVIDIA-V100, for training large deep learning architectures. Our system SLIDE is a combination of carefully tailored randomized hashing algorithms with the right data structures that allow asynchronous parallelism. We show up to 3.5x gain against TF-GPU and 10x gain against TF-CPU in training time with similar precision on popular extreme classification datasets. Our next steps are to extend SLIDE to include convolutional layers. SLIDE has unique benefits when it comes to random memory accesses and parallelism. We anticipate that a distributed implementation of SLIDE would be very appealing because the communication costs are minimal due to sparse gradients.
Conclusion The explosion in computing power used for deep learning models has ended the “AI winter” and set new benchmarks for computer performance on a wide range of tasks. However, deep learning’s prodigious appetite for computing power imposes a limit on how far it can improve performance in its current form, particularly in an era when improvements in hardware performance are slowing. This article shows that the computational limits of deep learning will soon be constraining for a range of applications, making the achievement of important benchmark milestones impossible if current trajectories hold. Finally, we have discussed the likely impact of these computational limits: forcing Deep Learning towards less computationally-intensive methods of improvement, and pushing machine learning towards techniques that are more computationally-efficient than deep learning.
Yeah, well, the neocortex has like 7 “hidden” layers, with sparse distributions, with voting / normalising layers. Just a 3d graph of neurons, doing some wiggly things.