Categories
dev envs Linux OpenCV

Installing OpenCV on Jetson

I have the Jetson NX, and I tried the last couple of days to install OpenCV, and am still fighting it. But we’re going to give it a few rounds.

Let’s see, so I used DustyNV’s Dockerfile with an OpenCV setup for 4.4 or 4.5.

But the build dies, or is still missing libraries. There’s a bunch of them, and as I’m learning, everything is a linker issue. Everything. sudo ldconfig.

Here’s a 2019 quora answer to “What is sudo ldconfig in linux?”

“ldconfig updates the cache for the linker in a UNIX environment with libraries found in the paths specified in “/etc/ld.so.conf”. sudo executes it with superuser rights so that it can write to “/etc/ld.so.cache”.

You usually use this if you get errors about some dynamically linked libraries not being found when starting a program although they are actually present on the system. You might need to add their paths to “/etc/ld.so.conf” first, though.” – Marcel Noe

So taking my own advice, let’s see:

chicken@chicken:/etc/ld.so.conf.d$ find | xargs cat $1
cat: .: Is a directory
/opt/nvidia/vpi1/lib64
/usr/local/cuda-10.2/targets/aarch64-linux/lib
# Multiarch support
/usr/local/lib/aarch64-linux-gnu
/lib/aarch64-linux-gnu
/usr/lib/aarch64-linux-gnu
/usr/lib/aarch64-linux-gnu/libfakeroot
# libc default configuration
/usr/local/lib
/usr/lib/aarch64-linux-gnu/tegra
/usr/lib/aarch64-linux-gnu/fakechroot
/usr/lib/aarch64-linux-gnu/tegra-egl
/usr/lib/aarch64-linux-gnu/tegra

Ok. On our host (Jetson), let’s see if we can install it, or access it. It’s Jetpack 4.6.1 so it should have it installed already.

ImportError: libblas.so.3: cannot open shared object file: No such file or directory

cd /usr/lib/aarch64-linux-gnu/
ls -l liblas.so*

libblas.so -> /etc/alternatives/libblas.so-aarch64-linux-gnu

cd /etc/alternatives
ls -l liblas.so*

libblas.so.3-aarch64-linux-gnu -> /usr/lib/aarch64-linux-gnu/atlas/libblas.so.3
libblas.so-aarch64-linux-gnu -> /usr/lib/aarch64-linux-gnu/atlas/libblas.so

Let’s see that location…

chicken@chicken:/usr/lib/aarch64-linux-gnu/atlas$ ls -l
total 22652
libblas.a
libblas.so -> libblas.so.3.10.3
libblas.so.3 -> libblas.so.3.10.3
libblas.so.3.10.3
liblapack.a
liblapack.so -> liblapack.so.3.10.3
liblapack.so.3 -> liblapack.so.3.10.3
liblapack.so.3.10.3

And those are shared objects.

So why do we get ‘libblas.so.3: cannot open shared object file: No such file or directory?’

So let’s try

sudo apt-get install libopenblas-dev liblapack-dev libatlas-base-dev gfortran


Sounds promising. Ha it worked.

chicken@chicken:/usr/lib/aarch64-linux-gnu/atlas$ python3
Python 3.6.9 (default, Dec  8 2021, 21:08:43) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> 

Now let’s try fix DustyNV’s Dockerfile. Oops right, it takes forever to build things, or even to download and install them. So try not to change things early on in the install. So besides, Dusty’s setup already has these being installed. So it’s not that it’s not there. It’s some linking issue.

Ok I start up the NV docker and try import cv2, but

admin@chicken:/workspaces/isaac_ros-dev$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/cv2/__init__.py", line 96, in <module>
    bootstrap()
  File "/usr/local/lib/python3.6/dist-packages/cv2/__init__.py", line 86, in bootstrap
    import cv2
ImportError: libtesseract.so.4: cannot open shared object file: No such file or directory

So where is that file?

admin@chicken:/usr/lib/aarch64-linux-gnu$ find | grep tess | xargs ls -l $1
-rw-r--r-- 1 root root 6892600 Apr  7  2018 ./libtesseract.a
lrwxrwxrwx 1 root root      21 Apr  7  2018 ./libtesseract.so -> libtesseract.so.4.0.0
-rw-r--r-- 1 root root     481 Apr  7  2018 ./pkgconfig/tesseract.pc

Well indeed, that plainly doesn’t exist?

admin@chicken:/$ sudo apt-get install libtesseract-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libarchive13 librhash0 libuv1
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  libleptonica-dev
The following NEW packages will be installed:
  libleptonica-dev libtesseract-dev
0 upgraded, 2 newly installed, 0 to remove and 131 not upgraded.
Need to get 2,666 kB of archives.
After this operation, 14.1 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://ports.ubuntu.com/ubuntu-ports bionic/universe arm64 libleptonica-dev arm64 1.75.3-3 [1,251 kB]
Get:2 http://ports.ubuntu.com/ubuntu-ports bionic/universe arm64 libtesseract-dev arm64 4.00~git2288-10f4998a-2 [1,415 kB]
Fetched 2,666 kB in 3s (842 kB/s)           
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libleptonica-dev.
dpkg: warning: files list file for package 'libcufft-10-2' missing; assuming package has no files currently installed
dpkg: warning: files list file for package 'cuda-cudart-10-2' missing; assuming package has no files currently installed
(Reading database ... 97997 files and directories currently installed.)
Preparing to unpack .../libleptonica-dev_1.75.3-3_arm64.deb ...
Unpacking libleptonica-dev (1.75.3-3) ...
Selecting previously unselected package libtesseract-dev.
Preparing to unpack .../libtesseract-dev_4.00~git2288-10f4998a-2_arm64.deb ...
Unpacking libtesseract-dev (4.00~git2288-10f4998a-2) ...
Setting up libleptonica-dev (1.75.3-3) ...
Setting up libtesseract-dev (4.00~git2288-10f4998a-2) ...

ImportError: libtesseract.so.4: cannot open shared object file: No such file or directory

admin@chicken:/$ echo $LD_LIBRARY_PATH 
/opt/ros/foxy/install/opt/yaml_cpp_vendor/lib:/opt/ros/foxy/install/lib:/usr/lib/aarch64-linux-gnu/tegra-egl:/usr/local/cuda-10.2/targets/aarch64-linux/lib:/usr/lib/aarch64-linux-gnu/tegra:/opt/nvidia/vpi1/lib64:/usr/local/cuda-10.2/targets/aarch64-linux/lib::/opt/tritonserver/lib

So this sounds like a linker issue to me. We tell the linker where things are, it finds them.

admin@chicken:/$ export LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/atlas/:/usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu/lapack:$LD_LIBRARY_PATH

admin@chicken:/$ sudo ldconfig

Ok, would be bad to have too much hope now. Let’s see… no, of course it didn’t work.

So let’s see what libleptonica and libtesseract-dev set up

./usr/lib/aarch64-linux-gnu/libtesseract.so
./usr/lib/aarch64-linux-gnu/libtesseract.a
./usr/lib/aarch64-linux-gnu/pkgconfig/tesseract.pc

And it wants 

admin@chicken:/$ ls -l ./usr/lib/aarch64-linux-gnu/libtesseract.so
lrwxrwxrwx 1 root root 21 Apr  7  2018 ./usr/lib/aarch64-linux-gnu/libtesseract.so -> libtesseract.so.4.0.0

and yeah it's installed.  

Start-Date: 2022-04-15  14:03:03
Commandline: apt-get install libtesseract-dev
Requested-By: admin (1000)
Install: libleptonica-dev:arm64 (1.75.3-3, automatic), libtesseract-dev:arm64 (4.00~git2288-10f4998a-2)
End-Date: 2022-04-15  14:03:06

This guy has a smart idea, to install them, which is pretty clever. But I tried that already, and tesseract’s build failed, of course. Then it complains about undefined references to jpeg,png,TIFF,zlib,etc. Hmm. All that shit is installed.

/usr/lib/gcc/aarch64-linux-gnu/8/../../../aarch64-linux-gnu/liblept.a(libversions.o): In function `getImagelibVersions':
(.text+0x98): undefined reference to `jpeg_std_error'
(.text+0x158): undefined reference to `png_get_libpng_ver'
(.text+0x184): undefined reference to `TIFFGetVersion'
(.text+0x1f0): undefined reference to `zlibVersion'
(.text+0x21c): undefined reference to `WebPGetEncoderVersion'
(.text+0x26c): undefined reference to `opj_version'

But so here’s the evidence: cv2 is looking for libtesseract.so.4, which doesn’t exist at all. And even if we symlinked it to point to the libtesseract.so file, that just links to libtesseract.so.4.0.0 which is empty.

Ah. Ok I had to sudo apt-get install libtesseract-dev on the Jetson host, not inside the docker!!. Hmm. Right. Cause I’m sharing most of the libs on the host anyway. It’s gotta be on the host.

admin@chicken:/usr/lib/aarch64-linux-gnu$ ls -l *tess*
-rw-r--r-- 1 root root 6892600 Apr  7  2018 libtesseract.a
lrwxrwxrwx 1 root root      21 Apr  7  2018 libtesseract.so -> libtesseract.so.4.0.0
lrwxrwxrwx 1 root root      21 Apr  7  2018 libtesseract.so.4 -> libtesseract.so.4.0.0
-rw-r--r-- 1 root root 3083888 Apr  7  2018 libtesseract.so.4.0.0



admin@chicken:/usr/lib/aarch64-linux-gnu$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>> exit()


Success.

So now back to earlier, we were trying to run jupyter lab, to try run the camera calibration code again. I added the installation to the dockerfile. So this command starts it up at http://chicken:8888/lab (or name of your computer).

jupyter lab --ip 0.0.0.0 --port 8888 --allow-root &
Needed matplotlib, so just did quick macro install:

%pip install matplotlib

ModuleNotFoundError: No module named 'matplotlib'

Note: you may need to restart the kernel to use updated packages.

K... restart kernel.  Kernel -> Restart.

Ok now I’m going to try calibrate the stereo cameras, since OpenCV is back.

Seems after some successes, the cameras are not creating capture sessions anymore, even from the host. Let’s reboot.

Alright, new topic. Calibrating stereo cameras.

Categories
3D Research AI/ML CNNs deep dev envs evolution GANs Gripper Gripper Research Linux Locomotion sexing sim2real simulation The Sentient Table UI Vision

Simulation Vision

We’ve got an egg in the gym environment now, so we need to collect some data for training the robot to go pick up an egg.

I’m going to have it save the rgba, depth and segmentation images to disk for Unet training. I left out the depth image for now. The pictures don’t look useful. But some papers are using the depth, so I might reconsider. Some weed bot paper uses 14-channel images with all sorts of extra domain specific data relevant to plants.

I wrote some code to take pics if the egg was in the viewport, and it took 1000 rgb and segmentation pictures or so. I need to change the colour of the egg for sure, and probably randomize all the textures a bit. But main thing is probably to make the segmentation layers with pixel colours 0,1,2, etc. so that it detects the egg and not so much the link in the foreground.

So sigmoid to softmax and so on. Switching to multi-class also begs the question whether to switch to Pytorch & COCO panoptic segmentation based training. It will have to happen eventually, as I think all of the fastest implementations are currently in Pytorch and COCO based. Keras might work fine for multiclass or multiple binary classification, but it’s sort of the beginning attempt. Something that works. More proof of concept than final implementation. But I think Keras will be good enough for these in-simulation 256×256 images.

Regarding multi-class segmentation, karolzak says “it’s just a matter of changing num_classes argument and you would need to shape your mask in a different way (layer per class??), so for multiclass segmentation you would need a mask of shape (width, height, num_classes)

I’ll keep logging my debugging though, if you’re reading this.

So I ran segmask_linkindex.py to see what it does, and how to get more useful data. The code is not running because the segmentation image actually has an array of arrays. I presume it’s a numpy array. I think it must be the rows and columns. So anyway I added a second layer to the loop, and output the pixel values, and when I ran it in the one mode:

-1
-1
-1
83886081
obUid= 1 linkIndex= 4
83886081
obUid= 1 linkIndex= 4
1
obUid= 1 linkIndex= -1
1
obUid= 1 linkIndex= -1
16777217
obUid= 1 linkIndex= 0
16777217
obUid= 1 linkIndex= 0
-1
-1
-1

And in the other mode

-1
-1
-1
1
obUid= 1 linkIndex= -1
1
obUid= 1 linkIndex= -1
1
obUid= 1 linkIndex= -1
-1
-1
-1

Ok I see. Hmm. Well the important thing is that this code is indeed for extracting the pixel information. I think it’s going to be best for the segmentation to use the simpler segmentation mask that doesn’t track the link info. Ok so I used that code from the guy’s thesis project, and that was interpolating the numbers. When I look at the unique elements of the mask without interpolation, I’ve got…

[  0   2 255]
[  0   2 255]
[  0   2 255]
[  0   2 255]
[  0   2 255]
[  0   1   2 255]
[  0   1   2 255]
[  0   2 255]
[  0   2 255]

Ok, so I think:

255 is the sky
0 is the plane
2 is the robotable
1 is the egg

So yeah, I was just confused because the segmentation masks were all black and white. But if you look closely with a pixel picker tool, the pixel values are (0,0,0), (1,1,1), (2,2,2), (255,255,255), so I just couldn’t see it.

The interpolation kinda helps, to be honest.

As per OpenAI’s domain randomization helping with Sim2Real, we want to randomize some textures and some other things like that. I also want to throw in some random chickens. Maybe some cats and dogs. I’m afraid of transfer learning, at this stage, because a lot of it has to do with changing the structure of the final layer of the neural network, and that might be tough. Let’s just do chickens and eggs.

An excerpt from OpenAI:

Costs

Both techniques increase the computational requirements: dynamics randomization slows training down by a factor of 3x, while learning from images rather than states is about 5-10x slower.

Ok that’s a bit more complex than I was thinking. I want to randomize textures and colours, first

I’ve downloaded and unzipped the ‘Describable Textures Dataset’

And ok it’s loading a random texture for the plane

and random colour for the egg and chicken

Ok, next thing is the Simulation CNN.

Interpolation doesn’t work though, for this, cause it interpolates from what’s available in the image:

[  0  85 170 255]
[  0  63 127 191 255]
[  0  63 127 191 255]

I kind of need the basic UID segmentation.

[  0   1   2   3 255]

Ok, pity about the mask colours, but anyway.

Let’s train the UNet on the new dataset.

We’ll need to make karolzak’s changes.

I’ve saved 2000+ rgb.jpg and seg.png files and we’ve got [0,1,2,3,255] [plane, egg, robot, chicken, sky]

So num_classes=5

And

“for multiclass segmentation you would need a mask of shape (width, height, num_classes) “

What is y.shape?

(2001, 256, 256, 1)

which is 2001 files, of 256 x 256 pixels, and one class. So if I change that to 5…? ValueError: cannot reshape array of size 131137536 into shape (2001,256,256,5)

Um… Ok I need to do more research. Brb.

So the keras_unet library is set up to input binary masks per class, and output binary masks per class.

I would rather use the ‘integer’ class output, and have it output a single array, with the class id per pixel. Similar to this question. In preparation for karolzak probably not knowing how to do this with his library, I’ve asked on stackoverflow for an elegant way to make the binary masks from a multi-class mask, in the meantime.

I coded it up using the library author’s suggested method, as he pointed out that the gains of the integer encoding method are minimal. I’ll check it out another time. I think it might still make sense for certain cases.

Ok that’s pretty awesome. We have 4 masks. Human, chicken, egg, robot. I left out plane and sky for now. That was just 2000 images of training, and I have 20000. I trained on another 2000 images, and it’s down to 0.008 validation loss, which is good enough!

So now I want to load the CNN model in the locomotion code, and feed it the images from the camera, and then have a reward function related to maximizing the egg pixels.

I also need to look at the pybullet-planning project and see what it consists of, as I imagine they’ve made some progress on the next steps. “built-in implementations of standard motion planners, including PRM, RRT, biRRT, A* etc.” – I haven’t even come across these acronyms yet! Ok, they are motion planning. Solvers of some sort. Hmm.

Categories
AI/ML CNNs deep dev GANs Linux sexing Vision

Cloud GPUs: GCP

The attempted training of the U-Net on the Jetson NX has been a bit slow, making odd progress over 2 nights, and I’m not sure if it’s working. I’ve had to reduce batch size to 1, and the filter size, which has reduced the number of parameters by about a factor of 10, and still, loading the NN into memory sometimes dies on a concatenation call. The number of images per batch can also crash it, so perhaps some memory can be saved with a better image loading process.

Anyway, projects under an official NVIDIA repo are suggesting that we should be able to train smaller networks like resnet18, with 11 million parameters, on the Jetson. So maybe we can still avoid the cloud.

But judging by the NVIDIA TLT info, any training of resnet50s or 100s are going to need serious GPUs and memory and space for training.

After looking at Google, Amazon and Microsoft offerings, the AWS g4dn.xlarge instance looks like it might be the best option, at $0.526/hr, or Google’s got a T4 based compute engine for only $0.35/hr. These are good options, if 16GB of video ram will be enough. It should be, because we’re working with like 5GB on the Jetson.

Microsoft has the NC6 option, which looks good for a much more beefy GPU and memory, at $0.90/hr.

We’re just looking at Pay-as-you-go prices, as the 1-year and 3-year commitments will end up being expensive.

I’m still keen to try train on the Jetson, but the cloud is becoming more and more probable. In Sweden, visiting Miranda, we’re unable to order a Jetson AGX Xavier, the 32GB version. Arrow won’t ship here without a VAT number, and SiliconHighway is out of stock.

So, attempting Cloud GPUs. If you want to cut to the chase, read this one backwards. So many problems. In the end, it turned out setting it up yourself is practically impossible, but there is an ‘AI Platform’ section that works.

Amazon AWS. Tried to log in to AWS. “Authentication failed because your account has been suspended.” Tells me to create a new account. But then brings me back to the same failure screen. Ok, sending email to their accounts department. Next.

Google Cloud. I tried to create a VM and add a T4 GPU, but none of the regions have them. So I need to download the Gcloud SDK and CLI tool first, to run a command to describe the regions, according to the ‘Before you begin‘ instructions..

Ok, GPUs will only run on N1 and A2 VMs. The A2 VMs are only for A100s, so I need an N1 VM in one of these regions, and we add a T4 GPU.

There’s an option to load a specific docker, and unfortunately they don’t seem to have one with both Pytorch and TF2. Let’s start with TF2 gcr.io/deeplearning-platform-release/tf2-gpu.2-4

So this looks like a good enough VM. 30GB RAM, 8 cpus. For europe-west3, the cost is about 50 cents / hr for the VM and 41 cents / hr for the GPU.

n1-standard-8830GB$0.4896$0.09840
1 GPU16 GB GDDR6$0.41 per GPU

So let’s round up to about $1/hour. I ended up picking the n1-standard-4 (4 cpus, 15 gb ram).

At these prices I’ll want to get things up and running asap. So I am going to prep a bit, before I click the Create VM button.

I had to try a few things to find a cloud instance with a gpu, because the official list didn’t really work. I eventually got one with a T4 GPU from europe-west4-c.

It seems like Google Drive isn’t really part of the google cloud platform ecosystem, so I started a storage bucket with 50GB of space, and am uploading the chicken images to it.

The instance doesn’t have pip or jupyter installed. So let’s do that…

ok so when I sudo’ed, I got this error

Jul 20 14:45:01 chicken-vm konlet-startup[1665]: {"errorDetail":{"message":"write /var/lib/docker/tmp/GetImageBlob362062711: no space left on device"},"error":"write /var/lib/docker/tmp/GetImageBl
 Jul 20 14:45:01 chicken-vm konlet-startup[1665]: ).
 Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 No containers created by previous runs of Konlet found.
 Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 Found 0 volume mounts in container chicken-vm declaration.
 Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 Error: Failed to start container: Error: No such image: gcr.io/deeplearning-platform-release/tf2-gpu.2-4
 Jul 20 14:45:01 chicken-vm konlet-startup[1665]: 2021/07/20 14:43:04 Saving welcome script to profile.d

So 10GB wasn’t enough to load gcr.io/deeplearning-platform-release/tf2-gpu.2-4 , I guess.

Ok deleting the VM. Next time, bigger hard drive. I’m now adding a cloud storage bucket and uploading the chicken images, so I can copy them to the VM’s drive later. It’s taking forever. Wow. Ok.

Now I am trying to spin up a VM again, and it’s practically impossible. I’ve tried every region and zone possible. Ok europe-west1-c. Finally. I also upped my ‘quota’ of gpus, under IAM->Quotas, in case that is a reason I couldn’t find a GPU VM. They reviewed and approved it in about 15 minutes.

+------------------+--------+-----------------+
|       Name       | Region | Requested Limit |
+------------------+--------+-----------------+
| GPUS_ALL_REGIONS | GLOBAL |        1        |
+------------------+--------+-----------------+

So after like 10 minutes of nothing, I see the docker container started up.

68ee22bf268f gcr.io/deeplearning-platform-release/tf2-gpu.2-4 "/entrypoint.sh /run…" 5 minutes ago Up 4 minutes klt-chicken-vm-template-1-ursn

I’ve enabled tcp:8080 port in the firewall settings, but the external ip and new port don’t seem to connect. https://35.195.66.139:8080/ Ah ha. http. We’re in!

Jupyter Lab starting up.

So I tried to download the gcloud tools to get gsutil to access my storage bucket, but was getting ‘Permission denied’, even as root. I chown’ed it to my user, but still no.

I had to go out, so I stopped the VM. Seems you can’t suspend a VM with a GPU. I also saw when I typed ‘sudo -i’ to switch user to root, it said to ‘docker attach’ to my container. But the container is just like a tty printing out logs, so you can get stuck in the docker, and need to ssh in again.

I think the issue was just that I need to be inside the docker to do things. The VM you log into is just a minimal container running environment. So I think that was my issue. Next time I install gsutil, I’ll run ‘docker exec -it 68ee22bf268f bash’ to get into the docker first.

Ok fired up the VM again. This time I exec’ed into the docker, and gsutil was already installed. gsutil cp -r gs://chicken-drive . is copying the files now. It’s slow, and it says to try with -m, for parallel copying, but I’m just going to let it carry on for now. It’s slow, but I can do some other stuff for now. So far our gcloud bill is $1.80.

Ok, /opt/jupyter/chicken-drive has my data now. But according to /opt/jupyter/.jupyter/jupyter_notebook_config.py, I need to move it under /home/jupyter.

Hmm. No space left on drive. What? 26GB all full. But it wasn’t full a second ago. How can moving files cause this? I guess the mv operation must copy and then delete. Ok, so deleting the new one. Let’s try again, one folder at a time. Oh boy. This is something a bit off about the google process. I didn’t start my container, and if I did, I’d probably map a volume. But the host is sort of read only. Anyway. We’re in. I can see the files in Jupyter Lab.

So now we’re training U-Net binary classification using keras-unet, by karolzak, based on the kz-isbi-chanllenge.ipynb notebook.

But now I’m getting this error when it’s clearly there…

FileNotFoundError: [Errno 2] No such file or directory: '/OID/v6/images/Chicken/train/'

Ok well I can’t work it out but changing it to a path relative to the notebook worked. base_dir = “../../../”

Ok first test round of training, binary classification: chicken, not-chicken. Just 173 image/mask pairs, 10 epochs of 40 steps.

Now let’s try with the training set. 1989 chickens this time. 50/50 split. 30 epochs of 50 steps. Ok second round… hmm, not so good. Pretty much all black.

Ok I’m changing the parameters of the network, fixing some code, and starting again.

I see that the pngs were loading float values, whereas in the example, they were loading ints. I fixed it by adding a m = m.convert(‘L’) to the mask (png) loading code. I think previously, it was training with the float values from 0 to 1, divided by 255, whereas the original example had int values from 0 to 255, divided by 255.

So I’m also resetting the parameters, to make this a larger network, since we’re training in the cloud. 512×512 instead of 256×256. Batch size of 3. Horizontal flip augmentation. 64 filters. 10 epochs of 100 steps. Go go go. Ok, out of memory. Batch size of 1. Still out of memory. Back to test set of 173 chickens. Ok it’s only maxing at 40% RAM now. I’ll let it run.

Ok, honestly I don’t know anymore. What is it even doing? Looks like it’s inversing black and white. That’s not very useful.

Ok before giving up, I’m going to make some changes.

The next day, I’m starting up the VM. Total cost so far, $8.84. The files are all missing, so I’m recopying, though using the gsutil -m cp -R gs://chicken-drive . option, and yes it is a lot faster. Though it slows down.

I think the current setup is maybe failing because we’re using 173 images with one kind of augmentation. Instead of 10 epochs of 100 steps of the same shit, let’s rather swap out the training images.

First problem is that Keras is basically broken, in this regard. I’ve immediately discovered that saving and loading a checkpoint does not save and load the metrics, and so it keeps evaluating against a loss of infinity, instead of what your saved model achieved. Very annoying.

Now, after stopping and restarting the VM, and enabling all cloud APIs, I’m having a new problem. gsutil no longer works. After 4% copied, network throughput drops to 0.0B/s. I tried reconnecting and now get:

Connection via Cloud Identity-Aware Proxy Failed
Code: 4003
Reason: failed to connect to backend
You may be able to connect without using the Cloud Identity-Aware Proxy.

I’ve switched back to ‘Allow default access’. Still getting 4003.

Ok, I’ve deleted the instance. Trying again. Started it up. It’s not installing the docker I asked for, after 22 minutes. Something is wrong. Let’s try again. Stopping VM. I’m ticking the ‘Run as priviliged’ box this time.

Ok now it’s working again. It even started up with the docker ready. I’m trying with the multiprocess copying again, and it slowed down at 55%, but is still going. Phew. Ok.

I changed to using the TF2 SavedModel format. Still restarts the ‘best’ metric. What a piece of shit. I can’t actually believe it. Ok I wrote my own code for finding the best, by saving all weights with the val_loss in the filename, and then loading the best weights for the next epoch. It’s still not perfect, but it’s better than Keras overwriting the best weights every time.

Interestingly, it seems like maybe my training on the Jetson was actually working, because the same weird little vignette-ing is occurring.

Ok we’re up to $20 billing, on gcloud. It’s adding up, but not too badly yet. Nothing seems to be beating a round of training from like 4 hours ago, so to keep things more exploratory, I added a 50/50 chance to pick from the saved weights at random, rather than loading the winner every time.

Something seems to be happening. The vignette is shrinking, but some chicken border action, maybe.

I left it running overnight, and this morning, we’re up to $33 spent, and today, we can’t log into the VM again. Pretty annoying. Of the 3 reasons for ‘Permission denied’, only one makes sense, Your key expired and Compute Engine deleted your ~/.ssh/authorized_keys file.

Same story if I run the gcloud commands: gcloud beta compute ssh –zone “europe-west4-c” “chicken-vm-template-1” –project “gpu-ggr”

So I apparently need to add a new public key to the Metadata section. I just know something is going to go wrong. Yeah, so I did everything I know I’m supposed to do, and it didn’t work. I generated an OpenSSH private/public key pair in PuttyGen, I changed the permissions on the private key so that only I have access, I updated the SSH Keys in the VM instance metadata, and the metadata for good measure. And ssh -i opensshprivate daniel_brownell@34.91.21.245 -v just ends up with Permission denied (publickey).

ssh-keygen -t rsa -f ~/.ssh/gcloud_instance1 -C daniel_brownell

Ok and then print the public key, and copy paste it to the VM Instance ‘Edit…’ / SSH Keys… and connect with PuTTY with the private key and… nope. Permission denied (publickey).. Ok I need to go through these answers and find one that works. Same error with windows cmd line ssh, except also complains that the openssh key is an invalid format. Try again later.

Fuck you gcloud. Ok I’m stopping and deleting the VM. $43 used so far.

Also, the training through the night didn’t improve on the val_loss score. Something’s fucked.

Ok I’ve started it up again a few days later. I was wondering about the warnings at the beginning of my training that carious CUDA things were not installed. So apparently I need:

cos-extensions install gpu

and… no space left on device

Ok so more space.

/dev/sda1 31G 22G 9.2G 70% /mnt/stateful_partition

So I increased the boot disk to 35GB and called ‘ cos-extensions install gpu’ again, after cd’ing into /mnt/stateful-partition and it worked a bit better. Still has ‘ERROR: Unable to load the kernel module 'nvidia.ko'.‘ in the logs though. But install logs at ./mnt/stateful_partition/var/lib/nvidia/nvidia-installer.log say its ok…

So the error now is ‘Could not load dynamic library ‘libcuda.so.1′; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64’

And so we need to modify the docker container run command, something like the example in the instructions.

Ok so our container is… gcr.io/deeplearning-platform-release/tf2-gpu.2-4

According to this stackoverflow answer, this already has everything installed. Ok but the host needs the drivers installed.

tf.config.list_physical_devices('GPU')
[]

So yeah, i think i need to install the cos crap, and restart the container with those volume and device bits.

docker stop klt-chicken-vm-template-1-ursn
docker run \
  --volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64 \
  --volume /var/lib/nvidia/bin:/usr/local/nvidia/bin \
  --device /dev/nvidia0:/dev/nvidia0 \
  --device /dev/nvidia-uvm:/dev/nvidia-uvm \
  --device /dev/nvidiactl:/dev/nvidiactl \
  gcr.io/deeplearning-platform-release/tf2-gpu.2-4 

...

[I 14:54:49.167 LabApp] Jupyter Notebook 6.3.0 is running at:
[I 14:54:49.168 LabApp] http://46fce08b5770:8080/
[I 14:54:49.168 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
^C^C^C^C^C^C^C^C^C^C

Not so good. Ok can’t access it either. -p 8080:8080 fixes that. It didn’t like --gpus all.

“Unable to determine GPU information”. Container optimised shit.

Ok I’m going to delete the VM again. Going to check out these nvidia cloud containers. There’s 21.07-tf2-py3 and NGC stuff.

So I can’t pull the dockers cause there’s no space, and even after attaching a persistent disk, not, because things are stored on the boot disk. Ok but I can tell docker to store stuff on a persistent disk.

/etc/docker/daemon.json:

{
    "data-root": "/mnt/x/y/docker_data"
}
root@nvidia-ngc-tensorflow-test-b-1-vm:/mnt/disks/disk# docker run --gpus all --rm -it -p 8080:8080 -p 6006:6006 nvcr.io/nvidia/tensorflow:21.07-tf2-py3

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

Followed the ubuntu 20.04 driver installation,

cuda : Depends: cuda-11-4 (>= 11.4.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Oh boy. Ok so I used this trick to make some /tmp space:

mount --bind /path/to/dir/with/plenty/of/space /tmp

and then as per this answer and the nvidia instructions:

wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda_11.1.0_455.23.05_linux.run
chmod +x cuda_11.1.0_455.23.05_linux.run 
sudo ./cuda_11.1.0_455.23.05_linux.run 

or some newer version:

wget https://developer.download.nvidia.com/compute/cuda/11.4.1/local_installers/cuda_11.4.1_470.57.02_linux.run
sudo sh cuda_11.4.1_470.57.02_linux.run

‘boost::filesystem::filesystem_error’

Ok using all the space again. 32GB. Not enough. Fuck this. I’m deleting the VM again. 64GB. SSD persistent disk. Ok installed driver. Running docker…

And…

FFS. Something is compromised. In the time it took to install CUDA and run docker on an Ubuntu VM, an army of Indian hackers managed to delete my root user.

Ok. Maybe it’s time to consider AWS again for GPUs. I think I can officially count GCP GPU as unusable. Learned a few useful things, but overall, yeesh.

I think maybe I’ll just run the training on a cheap non-GPU VM on GCP for now, so that I’m not paying for a GPU that I’m not using.

docker run -d -p 8080:8080 -v /home/daniel_brownell:/home/jupyter gcr.io/deeplearning-platform-release/tf2-cpu.2-4

Ok wow so now with the cpu version, the loss is improving like crazy. It went from 0.28 to 0.24 in 10 epochs (10 minutes or so). That sort of improvement was not happening after like 10 hours on the ‘gpu’.

So yeah, amazing. The code now does a sort of population based training, by picking a random previous set of weights instead of the best weights, half of the time. Overall it slows things down, but should result in a bit more variation in the end.

What finally worked

Ok there’s also an ‘AI platform – notebook’ option. I might try that too.

Ok the instance started up. But it failed to start 4 cron services: nscd, unscd, crond, sshd. CPU use goes to zero. Nothing. Ok so I need to ssh tunnel apparently.

gcloud compute ssh --project gpu-ggr --zone europe-west1-b notebook -- -L 8080:localhost:8080

Ok that was easy. Let’s try this.

Successfully opened dynamic library libcudart.so.11.0

‘ModelCheckpoint’ object has no attribute ‘_implements_train_batch_hooks’

Ok, needed to change all keras.* etc. to tensorflow.keras.*

Ok fuck me that’s a lot faster than CPU.

Permission denied: ‘weights-0.2439.hdf5’

Ok, let’s sudo it.

Ok there she goes. It’s like 20 times faster maybe. Strangely isn’t doing much better than the CPU though. But I’ll let it run for a bit. It’s only been a minute. I think maybe the CPU doing well was just good luck. Perhaps we trained them too well on the original set of like 173 images, and it was getting good results on those original images.

Ok now it’s been an hour or so, and it’s not beating the CPU. I’ve changed the train / validation set to 50/50 now, and the learning rate is randomly chosen between 0.001 and 0.0003. And I’m upping the epochs to 30. And the filters to 64. batch_size=4, use_batch_norm=True.

We’re down to 23.3 after an hour and a half. 21 now… 3 hours maybe now

Ok 5 hours, lets check:

Holy shit it’s working. That’s great. I’ll leave it running overnight. The overnight results didn’t improve much for some reason.

(TODO: learn about focal loss / dice loss / jaccard distance as possible change to loss function.? less necessary now.)

So it’s cool but it’s 364MB. We need it 1/4 size to run it on the Jetson NX I think.

So, retraining, with filters=32. We’re already down to 0.24 after an hour. Ok I stopped at 0.2104 after a few hours.

So yeah. Good enough for now.

There’s some other things to train, too.

The eggs in simulation: generate views, save images to disk. save segmentation images to disk.

Train the walking again with the gripper.

Eggs in the real world. Use augmentation to place real egg pics in scenes. Possibly use Mask-RCNN/YOLACT code with COCO, instead of continuing in Keras.

The now-working U-net binary chicken segmentation is in Keras, so there will be some tricks required, to run a multi-class segmentation detector, or multiple binary classifiers. Advice for multi-class segmentation is here and the multiple binary classifier advice is here.

When we finally try running it all on a Jetson, we will maybe need to shrink the neural network further. But that can be done last minute. It looks like we can save the h5fs file to TF2’s SavedModel format with model.save(model_fname) and convert to frozen graph, to import into TensorRT, the NVIDIA format. Similar to this. TensorRT shrinks neurons to single bytes, I believe.

Categories
dev Linux

Low Linux memory

Spent enough time on this to warrant a note.

For some reason, pip install torch, which is what I was trying to do, kept dying. It’s a 700MB file, and top showed out of memory.

Ultimately the fix for that was:

pip install torch --no-cache-dir

(something was wrong with the cache I guess)

I also ended up deleting the contents of ~/.cache/pip which was 2.2GB. The new pip cache purge only clears wheels related libs.

Anyway, trying to do development on a 23GB chromebook with GalliumOS gets tough.

I spend a lot of time moving things around. I got myself an NVMe SSD, with 512GB to alleviate the situation.

The most common trick for looking at memory is df -h for seeing memory use, and du -h --max-depth=1 to see how big the directories are, below your current dir.

So, first thing first, the SSD doesn’t want to show up. Ah, the USB-C wasn’t pushed in all the way. Derp.

Second, to clear up some space, linux has journal logs.

https://unix.stackexchange.com/questions/139513/how-to-clear-journalctl :

set a max amount of logs to retain (by time/space):
journalctl --vacuum-time=2d
journalctl --vacuum-size=500M

The third thing is to make some more swap space, just in case.

touch /media/chrx/0FEC49A4317DA4DA/swapfile
cd /media/chrx/0FEC49A4317DA4DA/
sudo dd if=/dev/zero of=swapfile bs=2048 count=1048576
mkswap swapfile
swapon swapfile

swapon

NAME                          TYPE           SIZE   USED PRIO
/dev/zram0                    partition      5.6G 452.9M -2
/media/chrx/0FEC49A4317DA4DA/ swapfile file  2G       0B -3

Ok probably didn’t need more swap space. Not sure where /dev/zram0 is, but maybe I can free up more of it, and up the priority of the SSD?

Anyway, torch is installed now, so nevermind, until I need more memory.

Some more tricks:

Remove thumbnails:

du -sh ~/.cache/thumbnails

rm -rf ~/.cache/thumbnails/*

Clean apt cache:

sudo apt-get clean

Categories
dev Linux

Experiment Analysis and Wheels

The Ray framework has this Analysis class (and Experiment Analysis class), and I noticed the code was kinda buggy, because it should have handled episode_reward_mean being NaN, better https://github.com/ray-project/ray/issues/9826 (episode_reward_mean is an averaged value, so appears as NaN (Not a Number) for the first few rollouts) . It was fixed a mere 18 days ago, so I can download the nightly release instead.

# pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp37-cp37m-manylinux1_x86_64.whl

or, since I still have 3.6 installed,

# pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl

Successfully installed aioredis-1.3.1 blessings-1.7 cachetools-4.1.1 colorful-0.5.4 contextvars-2.4 google-api-core-1.22.0 google-auth-1.20.0 googleapis-common-protos-1.52.0 gpustat-0.6.0 hiredis-1.1.0 immutables-0.14 nvidia-ml-py3-7.352.0 opencensus-0.7.10 opencensus-context-0.1.1 pyasn1-0.4.8 pyasn1-modules-0.2.8 ray-0.9.0.dev0 rsa-4.6

https://github.com/openai/gym/issues/1153

Phew. Wheels. Better than eggs, old greybeard probably said. https://pythonwheels.com/

What are wheels?

Wheels are the new standard of Python distribution and are intended to replace eggs.

Categories
Linux

Freeing up space on Ubuntu

Clear journal and set up 10 day rollover:
journalctl --vacuum-time=10d

Clean the apt cache
du -sh /var/cache/apt/archives
apt-get clean

There's a program to visualise space:
apt-get install baobab

Turns out the other big things are

pytorch is 1.3gb

python 2.7 and ansible is 500mb

in var/lib, docker is 2.5gb and flatpak is 1.5gb

tensorflow is 450mb

I got rid of my old buckets code, got another 1.5GB back, by deleting docker completely, and reinstalling it.

https://askubuntu.com/questions/935569/how-to-completely-uninstall-docker

and how to install reinstall docker on ubuntu:

apt-get install docker-ce docker-ce-cli containerd.io

https://docs.docker.com/engine/install/ubuntu/

Got rid of flatpak :

flatpak remote-add flathub https://flathub.org/repo/flathub.flatpakrepo

flatpak uninstall –all

This uninstalled inkscape and something gtk related.

I also got rid of anything python 2 related.


sudo apt-get purge python 2.7

Categories
dev Hardware hardware_ Linux

RPi without keyboard and mouse

https://sendgrid.com/blog/complete-guide-set-raspberry-pi-without-keyboard-mouse/

https://github.com/motdotla/ansible-pi

First thing is you need a file called ‘ssh’ on the raspbian to enable it:.

https://www.raspberrypi.org/forums/viewtopic.php?t=144839

ok so I found the IP address of the PI

root@chrx:~# nmap -sP 192.168.101.0/24

Starting Nmap 7.60 ( https://nmap.org ) at 2020-04-05 17:06 UTC
Nmap scan report for _gateway (192.168.101.1)
Host is up (0.0026s latency).
MAC Address: B8:69:F4:1B:D5:0F (Unknown)
Nmap scan report for 192.168.101.43
Host is up (0.042s latency).
MAC Address: 28:0D:FC:76:BB:3E (Sony Interactive Entertainment)
Nmap scan report for 192.168.101.100
Host is up (0.049s latency).
MAC Address: 18:F0:E4:E9:AF:E3 (Unknown)
Nmap scan report for 192.168.101.101
Host is up (0.015s latency).
MAC Address: DC:85:DE:22:AC:5D (AzureWave Technology)
Nmap scan report for 192.168.101.103
Host is up (-0.057s latency).
MAC Address: 74:C1:4F:31:47:61 (Unknown)
Nmap scan report for 192.168.101.105
Host is up (-0.097s latency).
MAC Address: B8:27:EB:03:24:B0 (Raspberry Pi Foundation)

Nmap scan report for 192.168.101.111
Host is up (-0.087s latency).
MAC Address: 00:24:D7:87:78:EC (Intel Corporate)
Nmap scan report for 192.168.101.121
Host is up (-0.068s latency).
MAC Address: AC:E0:10:C0:84:26 (Liteon Technology)
Nmap scan report for 192.168.101.130
Host is up (-0.097s latency).
MAC Address: 80:5E:C0:52:7A:27 (Yealink(xiamen) Network Technology)
Nmap scan report for 192.168.101.247
Host is up (0.15s latency).
MAC Address: DC:4F:22:FB:0B:27 (Unknown)
Nmap scan report for chrx (192.168.101.127)
Host is up.
Nmap done: 256 IP addresses (11 hosts up) scanned in 2.45 seconds

if nmap is not installed,

apt-get install nmap

Connect to whatever IP it is

ssh -vvvv pi@192.168.101.105

Are you sure you want to continue connecting (yes/no)? yes

Cool, and to set up wifi, let’s check out this ansible script https://github.com/motdotla/ansible-pi

$ sudo apt update
$ sudo apt install software-properties-common
$ sudo apt-add-repository --yes --update ppa:ansible/ansible
$ sudo apt install ansible

ok 58MB install…

# ansible-playbook playbook.yml -i hosts –ask-pass –become -c paramiko

PLAY [Ansible Playbook for configuring brand new Raspberry Pi]

TASK [Gathering Facts]

TASK [pi : set_fact]
ok: [192.168.101.105]

TASK [pi : Configure WIFI] **
changed: [192.168.101.105]

TASK [pi : Update APT package cache]
[WARNING]: Updating cache and auto-installing missing dependency: python-apt
ok: [192.168.101.105]

TASK [pi : Upgrade APT to the lastest packages] *
changed: [192.168.101.105]

TASK [pi : Reboot] **
changed: [192.168.101.105]

TASK [pi : Wait for Raspberry PI to come back] **
ok: [192.168.101.105 -> localhost]

PLAY RECAP ****
192.168.101.105 : ok=7 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

And I’ll unplug the ethernet and try connect by ssh again

Ah, but it’s moved up to 192.168.1.106 now

nmap -sP 192.168.101.0/24 (I checked again) and now it was ‘Unknown’, but ssh pi@192.168.101.106 worked

(If you can connect to your router, eg. 192.168.0.1 for most D-Link routers, you can go to something like Status -> Wireless, to see connected devices too, and skip the nmap stuff.)

I log in, then to configure some stuff:

sudo raspi-config

Under the interfaces peripheral section, Enable the camera and I2C

sudo apt-get install python-smbus
sudo apt-get install i2c-tools

ok tested with

raspistill -o out.jpg

Then copied across from my computer with

scp pi@192.168.101.106:/home/pi/out.jpg out.jpg

and then make it smaller (because trying to upload the 4MB version no)

convert out.jpg -resize 800×600 new.jpg

Cool and it looks like we also need to expand the partition

sudo raspi-config again, (Advanced Options, and first option)


Upon configuring the latest pi, I needed to first use the ethernet cable,

and then once logged in, use

sudo rfkill unblock 0

to turn on the wifi. The SSID and wifi password could be configured in raspi-config.


At Bitwäsherei, the ethernet cable to the router trick didn’t work.

Instead, as per the resident Gandalf’s advice, the instructions here

https://raspberrypi.stackexchange.com/questions/10251/prepare-sd-card-for-wifi-on-headless-pi

worked for setting up wireless access on the sd card.

“Since May 2016, Raspbian has been able to copy wifi details from /boot/wpa_supplicant.conf into /etc/wpa_supplicant/wpa_supplicant.conf to automatically configure wireless network access”

The file contains

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
country=«your_ISO-3166-1_two-letter_country_code»

network={
    ssid="«your_SSID»"
    psk="«your_PSK»"
    key_mgmt=WPA-PSK
}

Save, and put sd card in RPi. Wireless working and can ssh in again!

2022 News flash:

Incredibly, some more issues.

New issue, user guide not updated yet

https://stackoverflow.com/questions/71804429/raspberry-pi-ssh-access-denied

In essence, the default pi user no longer exists, so you have to create it and set its password using either the official Imager tool or by creating a userconf file in the boot partition of your microSD card, which should contain a single line of text: username:hashed-password

Default pi and raspberry

pi:$6$/4.VdYgDm7RJ0qM1$FwXCeQgDKkqrOU3RIRuDSKpauAbBvP11msq9X58c8Que2l1Dwq3vdJMgiZlQSbEXGaY5esVHGBNbCxKLVNqZW1

Categories
dev Linux

Setting up Java/C++ dev environment

The instructions in the wiki of pyBullet got me thinking, that I could use a proper IDE on the linux machine. I’m too old to learn emacs, and I’m not doing this in vi.

pyCharm is pretty good for Python.

But the Bullet Engine examples are all in C++, so I went with Eclipse CDT https://www.eclipse.org/cdt/ for a nicer experience.

So Eclipse needs Java. So I installed java with

sudo apt install openjdk-8-jdk

Unpacked gzip tar file and deleted it

tar xvf eclipse-cpp-2020-03-R-incubation-linux-gtk-x86_64.tar.gz
rm eclipse-cpp-2020-03-R-incubation-linux-gtk-x86_64.tar.gz

./opt/eclipse/eclipse

Shit yeah! colours

Categories
dev Linux robots

ROS installation

Robot Operating System installation http://wiki.ros.org/melodic/Installation/Ubuntu

Categories
Linux

GalliumOS

An OS for Chromebooks. In hindsight, better than ChromeOS.

chrx and dual boot https://wiki.galliumos.org/Installing

Adjustments needed to be made for dev env setup.

apt-get install python3-distutils

Ok Docker https://docs.docker.com/install/linux/docker-ce/ubuntu/

Ok and did it solve the docker problem (Ubuntu on crouton)???

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

sudo apt-get install docker-ce docker-ce-cli containerd.io

pip install docker-compose
chmod +x /usr/local/bin/docker-compose

git clone https://github.com/javadan/buckets.git


docker-compose up

yes it got further. i gotta fix the buckets code but yes, GalliumOS is great.