Panoptic segmentation picks an instance segmentation algorithm and a semantic segmentation algorithm.

Some notable papers are listed here, with the benchmarks of the best related githubs,

For example,

  • MASK_RCNN algorithm for instance segmentation
  • DeepLabV2 Algorithm for semantic segmentation

The architecture of the neural network has two pyramids, one for semantics (classes), and one to count the instances

After much circular investigation, i arrived at the notion that transfer learning from a pre-trained network, with the ‘fine tuning’ referring to adding a new class, is the way to go.

But we’re still suffering from not having found an example using PNG mask files. I can convert to COCO, and that might be what I do yet, because, like the dataset had their own Panoptic segmentation challenge and format. They seem to be winning this race. We’ll do COCO.

It will mostly involve writing or exporting info into json format, and following some terse, ambiguous instructions.

Another thing is that COCO wants bounding boxes too. So this will be an exercise in config generation to satisfy the COCO format requirements. I have the data from Open images, but COCO looks like the biggest game in town.

Then for algorithm, there’s numerous Pytorch libraries, especially a very relevant one, YOLACT Edge, using a ‘Darknet’ architecture, which is an old “Open Source Neural Networks in C”

Hmm. It’s more instance segmentation than panoptic, but looks like a good compromise, to aim for. – It uses bounding boxes, so what will I do with all these chicken masks?

YOLACTEdge arxiv paper

Otherwise, the tensorflow object detection tutorials are here:

The eager_few_shot_od_training_tflite.ipynb notebook also looks like a winner for showing how to add a new Duck class to a MobileNet architecture. YOLACT Edge has a MobileNet model available too.

I am sitting with a thousand or so JPGs of chickens with corresponding PNG masks, sorted into train/val/test datasets. I was hoping for the Keras UNet segmentation demo to work because I initially thought UNet will be ideal for the egg light camera, but now I’m back to the FAIR detectron2 woods, to find a panoptic segmentation solution.

Let’s try the YOLACT Edge one, because it’s based on YOLO, (You only look once), a single shot object detector algorithm, but which is also more commonly known for ‘You only live once’, an affirmation of often reckless behaviour. YOLACT stands for You Only Look At CoefficienTs. In this case it looks like the state of the art, and it’s been used on the Jetson before, which is promising. At 30 frames per second on the Jetson AGX, we’ll probably be getting 20 or so on the Jetson NX. Since that’s using Torch to TensorRT to speed it up, it seems like we should try it. I was initially averse to using NVIDIA specific software, but we should make the most of this hardware (if we can)

It’s not really panoptic segmentation. But it’s looking Good Enough™ like what we need, rather than what we thought we wanted.

Let’s try these instructions:

We’ll try it on the NX. “Inside” the Docker. What’s our CUDA version?

nvcc --version


TensorRT should already be installed.

(On Nano, if nvcc not found, check out this link )

git clone
cd torch2trt
sudo python3 install --plugins

Here’s from the COCO panoptic readme. RELEVANT EXCERPT FOR….

Panoptic Segmentation

For the panoptic task, each annotation struct is a per-image annotation rather than a per-object annotation. Each per-image annotation has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON struct that stores the semantic information for each image segment. In more detail:

  1. To match an annotation with an image, use the image_id field (that is
  2. For each annotation, per-pixel segment ids are stored as a single PNG at annotation.file_name. The PNGs are in a folder with the same name as the JSON, i.e., annotations/name/ for annotations/name.json. Each segment (whether it’s a stuff or thing segment) is assigned a unique id. Unlabeled pixels (void) are assigned a value of 0. Note that when you load the PNG as an RGB image, you will need to compute the ids via ids=R+G*256+B*256^2.
  3. For each annotation, per-segment info is stored in annotation.segments_info. stores the unique id of the segment and is used to retrieve the corresponding mask from the PNG ( category_id gives the semantic category and iscrowd indicates the segment encompasses a group of objects (relevant for thing categories only). The bbox and area fields provide additional info about the segment.
  4. The COCO panoptic task has the same thing categories as the detection task, whereas the stuff categories differ from those in the stuff task (for details see the panoptic evaluation page). Finally, each category struct has two additional fields: isthing that distinguishes stuff and thing categories and color that is useful for consistent visualization.

"image_id": int, 
"file_name": str, 
"segments_info": [segment_info],}

             "id": int, 
             "category_id": int, 
             "area": int, 
             "bbox": [x,y,width,height], 
             "iscrowd": 0 or 1,}

categories[{"id": int, 
            "name": str,  
            "supercategory": str, 
            "isthing": 0 or 1, 
            "color": [R,G,B],}]

Ok, we can do this.

Right, so, if anything, we want to transfer learn from a trained neural network. There’s some interesting discussion about implementing your own transfer learning of a coco dataset, in keras-retinanet here, but we’re looking at using Yolact Edge, based on pytorch, so let’s not get distracted. We need to create the COCO dataset. I’ve put this off for so long.

We need the COCO categories that are already trained, and I see there is the 2018 api which has the Panoptic challenge coco categories (panoptic_coco_categories.json) and ah ha this is what I have been searching for.


After pretty printing with

python3 -m json.tool panoptic_examples.json

here’s the example, for this bit.

"images": [
"license": 2,
"file_name": "000000142238.jpg",
"coco_url": "",
"height": 427,
"width": 640,
"date_captured": "2013-11-20 16:47:35",
"flickr_url": "",
"id": 142238
"license": 1,
"file_name": "000000439180.jpg",
"coco_url": "",
"height": 360,
"width": 640,
"date_captured": "2013-11-19 01:25:39",
"flickr_url": "",
"id": 439180

and we’ve got some images


and their masks.


Ah here’s ‘bird’ category.

"supercategory": "animal",
"color": [
"isthing": 1,
"id": 16,
"name": "bird"

“Let’s try get some visualisation working”

Ok hold on though. Let’s try get some visualisation working, before anything else. This looks like the ticket. But it is a python file, and running matplotlib, so ideally we’d transform this to a Jupyter Notebook. Ok, just New Notebook, copy paste. Run.

ModuleNotFoundError: No module named 'skimage'

[Big Detour and to the rescue, Datamachines]

Ok we can install it with !pip3 install scikit-image ? No, that fails… what did I do, right, I need to ssh into the Jetson,

chrx@chrx:~$ ssh -L 8888: -L 6006: chicken@

Then find the docker ID, and docker exec -it 519ed46162ae bash into it, and goddamnit what, UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 4029: ordinal not in range(128)

Ok so someone’s already had this happen, and it’s because the locale preferred encoding, needs to be UTF-8. But it’s some obscrure ANSI.

root@jetson:/# python -c 'import locale; print(locale.getpreferredencoding())'

Someone posted a bunch of steps for the L4T docker folks. That would be us. Do we really need this library ?

It’s to get this function.

from skimage.segmentation import find_boundaries

Yes, ok, it is quite hellish to install skimage. This was how to do it in debian, for skimage up to v. 0.13.1-2

apt install python3-skimage

But on it gets “ImportError: cannot import name ‘_validate_lengths'” which is resolved in 1.14.2

I’ve asked on the forum, and am hoping NVIDIA can solve this one. The skimage docs say:

  1. Linux on 64-bit ARM processors (Nvidia Jetson):

As per the latest comment, (only 3 weeks ago, others were on the trail of similar tasks!), mmartial is pointing to datamachines, which has some Dockerfiles for building OpenCV and Tensorflow, and YOLOv4.

Ok, let’s try what the instructions suggest:

make tensorflow_opencv to build all the tensorflow_opencv container images”

I’ll try the CuDNN version next if this doesn’t work…

Ok…we’re on step 16 of 42… Ooh Python 3.8, that’s an upgrade. Build those wheels, pip3! Doh, Step 24 of 42.

bazel: Exec format error

The command returned a non-zero code: 2

*Whomp whomp* sound

ok let’s try

make cudnn_tensorflow_opencv, no…

I asked on the Issues, and they noticed those are the amd64 builds, not the aarch64 build. I could use their DockerHub pre-build for now.

so after a detour, i am using this Dockerfile successfully to run Jupyter on the NX. We got stuck because skimage was difficult to install, and now we’re back on track, annotating the COCO, and so on.

chicken@jetson:~$ cat Dockerfile

RUN pip3 install jupyter jupyterlab --verbose
RUN jupyter lab --generate-config
RUN python3 -c "from import set_password; set_password('nvidia', '/root/.jupyter/jupyter_notebook_config.json')"
CMD /bin/bash -c "jupyter lab --ip --port 8888 --allow-root &> /var/log/jupyter.log" & \
echo "allow 10 sec for JupyterLab to start @ http://$(hostname -I | cut -d' ' -f1):8888 (password nvidia)" && \
echo "JupterLab logging location: /var/log/jupyter.log (inside the container)" && \

chicken@jetson:~$ sudo docker build -t nx_setup .

chicken@jetson:~$ sudo docker run -it -p 8888:8888 -p 6006:6006 --rm --runtime nvidia --network host -v /home/chicken/:/dmc nx_setup

So, where were we?

Right. Panoptic API, we wanted to run, first, so we could check progress. But it needed skimage installed. Haha. Ok, one week later… let’s try see the example.

Phew, ok. Getting back on track. So now we want to train it on the chickens.


“Back to COCO”

As someone teaching myself about this, I know what I ideally want is to transfer learn from a trained network. But it isn’t obvious how. I apparently need to chop off the last layer of a trained network, freeze most of the network, and then retrain the last bit.

Well, back to this soon…


Here we have a suggestion from dbolya author of YOLACT and YOLACT++, the original.

try: self.load_state_dict(state_dict) except RuntimeError as e: print('Ignoring "' + str(e) + '"')
and then resume training from yolact_im700_54_80000.pth:
python --config=<your_config> --resume=weights/yolact_im700_54_800000.pth --start_iter=0
When there are size mismatches between tensors, Pytorch will spit out an error message but also keep on loading the rest of the tensors anyway. So here we just attempt to load a checkpoint with the wrong number of classes, eat the errors the Pytorch complains about, and then start training from iteration 0 with just those couple of tensors being untrained. You should see only the C (class) and S (semantic segmentation) losses reset.
You probably also want to modify the learning rate, decay schedule, and number of iterations in your config to account for fine-tuning.

And an allusion to an example of its use, perhaps. And more clues about how to fine tune the ‘network head’.

You can do this by following the fine tuning procedure (#36) and then here:


Line 628 in f54b0a5

p = pred_layer(pred_x)

replace that with

p = pred_layer(pred_x)
p = pred_layer(pred_x.detach())

Ok… so here’s the YOLACT diagram:

A command to run the training.

Google also has a new detection algorithm and it looks much faster than Mask RCNN

EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling

Of course the state of the art keeps improving, but this looks like a stepping stone.




This is the S.O.T.A. The G.O.A.T. of 2020. So we’ll try it out:

TF2 Model Zoo introduces new SOTA models such as CenterNetExtremeNet, and EfficientDet.


Dynamic Style Transfer

Like a slider for style transfer.

More generally, neural style transfer is pretty common now. Some examples

Javascript version:

Also, a comixifying / sort of service:

Here’s an app for it: –

Which Image resolution should I use for training for deep neural network?

CIFAR dataset is 32px*32px,

MIT 128px*128px,

Stanford 96px*96px.

Following the advice here

“small-image models are much faster to train.”

“Here is a smoothed kernel-density plot of image sizes in our “Open Fruits” dataset:”

We see here that the images peak at around 128x128 in size. So for our initial input size we will choose 1/3 of that: 48x48.

Now it’s time to experiment! What kind of model you end up building in this phase of the project is entirely up to you.” (

I’ll have a look at the chicken images, and see how to scale them down. Maybe ffmpeg or convert or imagemagick pre-processing is better. But we’ll get there soon enough.

This is a notably relevant paper from 2019, that appears to be keeping track of eggs

“Our custom SSD object detection and classification model classified when chickens and eggs were detected by the video camera. Our models can label video frames with classifications for 8 breeds of chickens and 4 colors of eggs, with 98% accuracy on chickens or eggs alone and 82.5% accuracy while detecting both types of objects.”

“Tuned accuracy is needed for proper thresholding of object detection” (

Also interesting,

Factors Affecting Egg Production in Backyard Chicken

This is maybe getting closer to holy grail in my mind. I like the idea of bridging the gap between simulation and reality in the other direction too. By converting the world into object meshes. Real2Sim.

The OpenAI Rubik’s cube hand policy transfer was done with camera in simulation and camera in real world. This could allow a sort of dreaming, i.e., running simulations on new 3d obj data.)

It could acquire data that it could mull over, when chickens are asleep.


Pixel2Mesh: Generating 3D Mesh Models
from Single RGB Images

Remember Hinton’s dark knowledge. The trick is having a few models distill into one.

In trying to get Mesh R-CNN working, I had to add DEVICE=CPU to the config.

python3 demo/ --config-file configs/pix3d/meshrcnn_R50_FPN.yaml --input /home/chrx/Downloads/chickenegg.jpg --output output_demo --onlyhighest MODEL.WEIGHTS meshrcnn://meshrcnn_R50.pth

Success! It’s a chair.

There’s no chicken category in Pix3d. But getting closer. Just need a chicken and egg dataset.

Downloading blender again, to check out the obj file that was generated. Ok Blender doesn’t want to show it, but here’s a handy site to view OBJ files. The issue in blender required selecting the obj, then View > Frame Selected to make it zoom in. Switching to orthographic from perspective view also helps.

Chair is a pretty adaptable class.

Ran through the nice working jupyter notebook and produced this video

It is the Mask R-CNN algorithm from matterport, ported over by facebook labs, and better maintained. It was forked and fixed up for tourists.

We can train it on the robot eye view camera, maybe train it on google images of copyleft chickens and eggs.

I think this looks great, for endowing the robot with a basic “recognition” of the features of classes it’s been exposed to.

Seems I was oblivious to Facebook AI but of course they hire very smart people. I’d sell my soul for $240k/yr too. It is super nice to get a working Jupyter Notebook. Thank you.

Here are the other FB project using detectron2, copy pasted:

Projects by Facebook

Note that these are research projects, and therefore may not have the same level of support or stability as detectron2.

External Projects

External projects in the community that use detectron2:

Also, more generally,

Errors encountered while attempting to install

File "", line 8, in
import tqdm
ImportError: No module named tqdm

pip3 uninstall tqdm
pip3 install tqdm

Ok so…

python3 -m pip install -e .

python3 --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml --webcam --opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl

Requires pyyaml>=5.1


pip install pyyaml==5.1
 Successfully built pyyaml
Installing collected packages: pyyaml
Attempting uninstall: pyyaml
Found existing installation: PyYAML 3.12
ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

pip3 install --ignore-installed PyYAML
Successfully installed PyYAML-5.1

Next error...

ModuleNotFoundError: No module named 'torchvision'

pip install torchvision

Next error...

Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from


python3 --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml --webcam --opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl MODEL.DEVICE cpu

[08/17 20:53:11 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml', input=None, opts=['MODEL.WEIGHTS', 'detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl', 'MODEL.DEVICE', 'cpu'], output=None, video_input=None, webcam=True)
[08/17 20:53:12 fvcore.common.checkpoint]: Loading checkpoint from detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl
[08/17 20:53:12 fvcore.common.file_io]: Downloading …
[08/17 20:53:12]: Downloading from …
model_final_f10217.pkl: 178MB [01:26, 2.05MB/s]
[08/17 20:54:39]: Successfully downloaded /root/.torch/fvcore_cache/detectron2/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl. 177841981 bytes.
[08/17 20:54:39 fvcore.common.file_io]: URL cached in /root/.torch/fvcore_cache/detectron2/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl
[08/17 20:54:39 fvcore.common.checkpoint]: Reading a file from 'Detectron2 Model Zoo'
0it [00:00, ?it/s]/opt/detectron2/detectron2/layers/ UserWarning: This overload of nonzero is deprecated:
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
return x.nonzero().unbind(1)
0it [00:06, ?it/s]
Traceback (most recent call last):
File "", line 118, in
cv2.namedWindow(WINDOW_NAME, cv2.WINDOW_NORMAL)
cv2.error: OpenCV(4.3.0) /io/opencv/modules/highgui/src/window.cpp:634: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function 'cvNamedWindow'


pip install opencv-python

Requirement already satisfied: opencv-python in /usr/local/lib/python3.6/dist-packages (

Looks like 4.3.0 vs kinda thing

sudo apt-get install libopencv-*


/opt/detectron2/detectron2/layers/ UserWarning: This overload of nonzero is deprecated:
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
return x.nonzero().unbind(1)

def nonzero_tuple(x):
A 'as_tuple=True' version of torch.nonzero to support torchscript.
because of
if x.dim() == 0:
return x.unsqueeze(0).nonzero().unbind(1)
return x.nonzero(as_tuple=True).unbind(1)

AttributeError: 'tuple' object has no attribute 'unbind'

FFS. Why does nothing ever fucking work ?
pytorch 1.6:
"putting 1.6.0 milestone for now; this isn't the worst, but it's a pretty bad user experience."

Yeah no shit.

let's try...

return x.nonzero(as_tuple=False).unbind(1)

Ok next error same


Ok... back to this error (after adding as_tuple=False twice)

 File "", line 118, in
cv2.namedWindow(WINDOW_NAME, cv2.WINDOW_NORMAL)
cv2.error: OpenCV(4.3.0) /io/opencv/modules/highgui/src/window.cpp:634: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function 'cvNamedWindow'

Decided to check if maybe this is a conda vs pip thing. Like maybe I just need to install the conda version instead?

But it looks like a GTK+ 2.x isn’t installed. Seems I installed it using pip, i.e. pip install opencv-contrib-python and that isn’t built with gtk+2.x. I can also use qt as the graphical interface.

GTK supposedly uses more memory because GTK provides more functionality. Qt does less and uses less memory. If that is your logic, then you should also look at Aura and the many other user interface libraries providing less functionality.” (link )

So let’s make a whole new Chapter, because we’re installing OpenCV again! (Why? Because I want to try run the detectron2 file.)

pip3 uninstall opencv-python
pip3 uninstall opencv-contrib-python 

(or sudo apt-get remove ___)

and afterwards build the opencv package from source code from github.

git clone

cd ~/opencv

mkdir release

cd release



sudo make install

ok… pls…

python3 –config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml –webcam –opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl MODEL.DEVICE cpu

sweet jaysus finally.

Here’s an image of the network from a medium article on RCNN:

FB really likes detecting things. I went with their PyTorch version. The matterport version didn’t work out of the box, so went with FB’s code to try image segmentation.

Caffe2 version:

PyTorch version:

Matterport’s version:

Deep Learning based Image Segmentation with OpenCV:

Also Watershed algorithm is available in OpenCV:



Segmenting an image by the watershed transformation is therefore a two-step process:

* Finding the markers and the segmentation criterion (the criterion or function which will be used to split the regions – it is most often the contrast or gradient, but not necessarily).

* Performing a marker-controlled watershed with these two elements.

Currently we have LSD-SLAM working, and that’s cool for us humans to see stuff, but having an object mesh to work with makes more sense. I don’t know if there’s really any difference, but at least in terms of simulator integration, this makes sense. I’m thinking, there’s object detection, semantic segmentation, etc, etc, and in the end, I want the robot to have a relative coordinate system, in a way. But robots will probably get by with just pixels and stochastic magic.

But the big idea for me, here, is transform monocular camera images into mesh objects. Those .obj files or whatever, could be imported into the physics engine, for training in simulation.



The PhD candidate: – In the Q&A at the end, she mentions AtlasNet as only being able to address local structures. Latest research looks interesting too

ShapeNET seems to be a common resource, and and these obj files might be interesting

hope that works. It’s that guy on youtube who says ‘dear scholars’ and ‘what a time to be alive’.

Advertising was: Lambda GPU clouds, $20 for imagenet training, no setup required. Good to know.

looks like a nice UI for stuff :