18C Computer Vision Artifical Knowledge Links

Waarneming.nl: Max 3 keer per jaar herziening van het model.
Model 1.Naturalis: 2-20 photos per species
Model 2.Naturalis: 40 photos per species

Model 2-4.1Nat:40 photographers per species
Model 5.1Nat:1000 photos per species
eferentiewaarnemingen er gebruikt worden (5000 of 40)

https://groups.google.com/forum/#!topic/inaturalist/K9nJOC0Cjss

https://www.inaturalist.org/pages/help#cv-taxa
https://www.inaturalist.org/pages/help#computer-vision
https://www.inaturalist.org/pages/help#cv-select
https://www.inaturalist.org/blog/31806-a-new-vision-model
FWIW, there's also discussion and some additional charts at
https://forum.inaturalist.org/t/psst-new-vision-model-released/10854/11
https://www.inaturalist.org/pages/identification_quality_experiment
https://www.inaturalist.org/journal/loarie/10016-identification-quality-experiment-update
https://www.inaturalist.org/journal/loarie/9260-identification-quality-experiment-update
about a rare species, but the system might still recommend one based on nearby observation
https://forum.inaturalist.org/t/identification-quality-on-inaturalist/7507
https://github.com/kueda/inaturalist-identification-quality-experiment/blob/master/identification-quality-experiment.ipynb
"nearby" means near in space and time
The model became more efficient in sedges and grasse
the vision model does not itself incorporate non-image data other than taxon IDs
b/c because
https://www.inaturalist.org/blog/25510-vision-model-updates ("taxon and region comparisons" 20190614)
https://distill.pub/2020/circuits/zoom-in/ ("connections between neurons")
https://www.inaturalist.org/projects/flora-of-russia/journal/31726
https://www.inaturalist.org/posts/31726-
https://forum.inaturalist.org/t/provide-relevant-geographic-data-confidence-level-accuracy-scores-with-ai-suggestions/9226/2
https://forum.inaturalist.org/t/range-covered-by-the-seen-nearby-feature/2849/5

https://forum.inaturalist.org/t/use-computer-vision-to-annotate-observations/3331
https://forum.inaturalist.org/t/what-image-s-are-used-for-training-computer-vision/3307/6

Interessant artikel over de Artificial Intelligence achter Inaturalist: https://news.mongabay.com/wildtech/2017/08/smartphone-app-combines-computer-vision-and-crowdsourcing-to-explore-earths-biodiversity-one-photo-at-a-time/

= = = Nov2020
https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/32
https://github.com/inaturalist/iNaturalistAPI/blob/main/lib/controllers/v1/computervision_controller.js#L368

Common ancestor for the top 3 raw results

https://github.com/inaturalist/iNaturalistAPI/blob/main/lib/controllers/v1/computervision_controller.js#L368

https://www.inaturalist.org/journal/ahospers/28858-funding-infrastructure-costs-images-on-amazon

m looking for a way of finding observations without coordinates. Many of these have Location Notes, so it is basically lacking Longitude or Latitude that I am looking for.
I am not interested in those with Latitude = 0 or Longitude = 0 (see https://www.inaturalist.org/projects/null (which is very inappropriately named, as I am looking for NULLS but this project identifies zeros instead - nulls have no data (value unassigned, or empty, or missing), but 0 is a specific datum - zero - like any other value - and not a “null”)).
At present for this user, filtering on verifiable=false gives me more or less what I want, but conflates these with any Data Quality criteria, not just missing coordinates.
https://www.inaturalist.org/observations?place_id=any&subview=grid&user_id=ahospers&verifiable=false 1

I added a very basic search to atlases in response to Jane’s feature request https://www.inaturalist.org/atlases 6. So now if you wanted to see all ‘marked’, ‘active’ atlases of taxa in the LIliaceae you’d do https://www.inaturalist.org/atlases?utf8=✓&filters[taxon_name]=Lilies&filters[taxon_id]=47328&filters[is_active]=True&filters[is_marked]=True

The out-of-range is vestigal, we don’t display it anywhere anymore (except the old filter menu thats still on https://www.inaturalist.org/observations/loarie 1). It worked directly on the taxon-range, rather than using atlases

Publicado por ahospers ahospers, 20 de marzo de 2020

Comentarios

Thumb

Coolia 2020(3)

In recent years there has been an explosion in the availability of apps for smartphones that can be
used to help with mushroom identification in the field. There are a number of approaches available, ranging from those apps that identify mushroom automatically based on the use of Artificial Intelligence (AI) and automated Image Recognition, through those that require the user to use traditional dichotomous keys or multi-access keys, to those that may only have a range of images without a clear system for identification of any species of interest

BACHELORARBEIT MAGIC MUSHROOM APP -Mit Deep Learning essbare Pilze erkennen met Python!!!
https://www.ntb.ch/fileadmin/NTB_Institute/ICE/projekte/MagicMushroom/JUNG_R._WAGNER_D._MagicMushroom_App-Pilzklassifikation_mit_CNNs.pdf

Deep Shrooms: classifying mushroom images
https://tuomonieminen.github.io/deep-shrooms/
https://github.com/TuomoNieminen/deep-shrooms (Python)
https://teekoivi.users.cs.helsinki.fi/
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
https://www.youtube.com/watch?v=f6Bf3gl4hWY

Shroomnet: Kunstliches neuronales Netz f ¨ ur die Bestimmung von Pilzarten
https://www.obermeier.ch/wp-content/uploads/2018/12/ShroomNET_small.pdf !!

Artificial Intelligence for plant identification on smartphones and tablets
https://bsbi.org/wp-content/uploads/dlm_uploads/BSBI-News-144-pp34-40-plant-id-apps-final.pdf

https://web.plant.id/

TUOMAS NIEMINEN DEEP LEARNING IN QUANTIFYING VASCULAR BURDEN FROM BRAIN IMAGES
https://www.semanticscholar.org/paper/TUOMAS-NIEMINEN-DEEP-LEARNING-IN-QUANTIFYING-BURDEN-Eskola/aea24dc5822ac9f5af4801f9aaf9ab864cf23aea

Obsidentify
https://play.google.com/store/apps/details?id=org.observation.obsidentify

Deens svampeatlas
https://play.google.com/store/apps/details?id=com.noque.svampeatlas

Duits
https://play.google.com/store/apps/details?id=com.nastylion.pilz

iNaturalist Seek
https://play.google.com/store/apps/details?id=org.inaturalist.seek

Google Lens
https://play.google.com/store/apps/details?id=com.google.ar.lens

= = = Nov2020
https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/32?u=ahospers

Common ancestor for the top 3 raw results

https://github.com/inaturalist/iNaturalistAPI/blob/main/lib/controllers/v1/computervision_controller.js#L368

https://www.inaturalist.org/journal/ahospers/28858-funding-infrastructure-costs-images-on-amazon

m looking for a way of finding observations without coordinates. Many of these have Location Notes, so it is basically lacking Longitude or Latitude that I am looking for.
I am not interested in those with Latitude = 0 or Longitude = 0 (see https://www.inaturalist.org/projects/null (which is very inappropriately named, as I am looking for NULLS but this project identifies zeros instead - nulls have no data (value unassigned, or empty, or missing), but 0 is a specific datum - zero - like any other value - and not a “null”)).
At present for this user, filtering on verifiable=false gives me more or less what I want, but conflates these with any Data Quality criteria, not just missing coordinates.
https://www.inaturalist.org/observations?place_id=any&subview=grid&user_id=ahospers&verifiable=false 1

I added a very basic search to atlases in response to Jane’s feature request https://www.inaturalist.org/atlases 6. So now if you wanted to see all ‘marked’, ‘active’ atlases of taxa in the LIliaceae you’d do https://www.inaturalist.org/atlases?utf8=✓&filters[taxon_name]=Lilies&filters[taxon_id]=47328&filters[is_active]=True&filters[is_marked]=True

The out-of-range is vestigal, we don’t display it anywhere anymore (except the old filter menu thats still on https://www.inaturalist.org/observations/loarie 1). It worked directly on the taxon-range, rather than using atlases

Publicado por ahospers hace 12 meses (Marca)
Thumb

ee natuurlijk niet. Aangezien er redelijk lange tijd tussen nieuwe versies van de herkenningssoftware zitten, zal het ongeveer als volgt in elkaar zitten, waar bij het proces telkens deels of geheel wordt doorgelopen. Men begint met de volledige gevalideerde fotodatabase. Daarop laat men een 'Convolutional Neural Network' 'leren' (hoe dit precies in zijn werk gaat wil ik je graag tot in de puntjes uitleggen, maar lijkt mij nogal buiten de strekking van dit forum).

Het belangrijkste is in ieder geval om een balans te vinden tussen 'overfitten' en geen herkenning hebben. Dat wil zeggen: voorkomen dat alleen zeer sterk gelijkende foto's op de foto's in de database herkend worden, maar er wel voor zorgen dat soorten 'gegeneraliseerd' herkend worden. Zodat dus ook nieuwe foto's van die soort worden herkend.

Wanneer de volledige database als trainingsmateriaal wordt gebruikt, en de volledige database ook als testmateriaal wordt gebruikt, kan het voorkomen dat een erg complex model superhoog scoort. Superfijn toch, 99% herkenning!! Niet dus. Vaak is dit een gevolg van 'overfitting'. Wanneer je het model een nieuwe foto aanrijkt die deze nog niet eerder heeft gezien, weet het model er vaak slecht raad mee, omdat het heeft geleerd de specifieke foto's uit de database te herkennen in plaats van patronen/'soorten'.

Om dit te voorkomen zijn allerlei trucjes bedacht, waarbij de meest gebruikte optie 'K-fold cross-validation' is, waarbij tijdens het trainen bij iedere iteratie van het proces steeds een ander stukje apart wordt gehouden (vaak ca. 20% van de data, afhankelijk van de grootte van de database), waarna vervolgens het model iedere iteratie van het proces op dat stukje getest/gescoord wordt (wederom, is hier tot in de puntjes uit te leggen, maar niet echt de juiste plaats lijkt mij).

Het uiteindelijke percentage correcte voorspellingen in die 'K-fold cross-validation' van de laatste iteratie geeft vaak al een vrij aardige indicatie van de kwaliteit van het model, maar voor de kwaliteit wordt vaak gekeken naar de 'confusion matrix', waaruit duidelijk wordt wat goed en fout gaat (de waarden en percentages correct voorspelde foto's die hieruit rollen zou je in principe 'testwaarden' kunnen noemen). In de confusion matrix is ook te zien als welke soort een foutief voorspelde soort dan wel voorspeld is etc. etc. Diegenen die statistiek in hun opleiding hebben gehad, zullen zo vast nog wel de termen 'vals positief' en 'vals negatief' herkennen. Ik kan mij zo voorstellen dat in dit geval wordt gepoogd om het aantal vals positieven zo laag mogelijk te houden. Liever geen voorspelling dan een onjuiste voorspelling.

Het bericht waarnaar je hier verwijst is dan ook slechts bedoeld om dingen eruit te vissen die tijdens de ontwikkeling over het hoofd worden gezien. Daar wordt vaak gekeken naar de grote lijnen, en het optimaliseren van het model in het algemeen (met eventueel preprocessing, en vooral neural network design (layers, aantal neurons, leersnelheid, en nog een heleboel parameters)). Wat gepoogd wordt in zo'n publieke testfase als via dit bericht is om erachter te komen of er toch niet wat 'overfitting' in het model is geslopen (er blijven altijd wat valkuilen, wat voor gekozen methode om dit te voorkomen dan ook), en om eventuele terugkerende fouten bij bepaalde soorten of soortgroepen eruit te vissen voordat het geheel in gebruik wordt genomen...

Publicado por ahospers hace 10 meses (Marca)
Thumb

Specific to the scenario you raise, the CV suggestion rules already adjust the “raw” list of CV matches to “insert” other sister species seen nearby. From this post 1 by , it seems that the suggestion algorithm currently:
https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/32
finds the common ancestor for the top 3 raw results,
searches for additional taxa descending from that ancestor that have been observed within 100 km of the observation’s location, and
inserts those taxa into the list of raw results based on the frequency of nearby observations.
My guess is that this “insertion” process may be failing for Trirhabda observations because the raw CV results do not contain 3 closely related species. There are currently 2,905 putative Trihabda observations. Of these, 1,195 are identified just as being genus Trihabda. iNat recognizes 26 total species in the genus. Of these, there are 7 species that have no observations at present.

There are 2-3 Trirhabda species I would expect to be covered by the CV model. The first is Trirhabda bacharidis (currently with 665 observations) which had about 335 verifiable observations when the most recent training dataset was collected on 29 September 2019. CV also should be aware of Trirhabda flaviolimbata which had about 410 verifiable observations by the cut-off date. The third possible species is Trirhabda canadensis, which had about 120 verifiable observations by 29 September 2019. However, it’s possible that fewer than 50 of these had a community ID, which would have excluded the species.

So, when someone uploads an observation, there’s a maximum of 2 or 3 Trirhabda species that could be returned in the result set. For the insertion process to search for other species under Trirhabda, the raw result set would need to ID all those Trirhabda species as the top 3 results. Failing that, the insertion process could kick in at the Family level, if the top 3 results are all in Chrysomelidae, but that spans a huge number of genera and species, so I doubt this would result in additional Trirhabda species being inserted.

So in summary it could be that suggestions for Trirhabda will improve quite a bit once there are 4 or 5 species covered by CV.

But your scenario does suggest that it’s worth looking for any a logic tweaks that would better handle Trirhabda observations without degrading suggestions for other scenarios.

Back on your broader proposal, I see benefits for the prioritization you suggest, but this order does cause me concern:

https://forum.inaturalist.org/t/better-use-of-location-in-computer-vision-suggestions/915/32

Publicado por ahospers hace 7 meses (Marca)
Thumb

For a lot of taxa I work with, the species-level suggestions are comprehensive and accurate, even within genera of 5 - 20 plant species. I’m concerned that making the genus-level suggestion more prominent than a high-confidence species ID will result in lots of observations with genus-level initial IDs where in fact CV did a fine job of finding the right species. That creates a lot more work for identifiers.

I would support prioritizing the genus just for those observations where the algorithm can identify factors that call into question the reliability of a good visual match. These might be:

Many related species that are not in scope for CV.
High rate of previous misidentifications.
Low CV coverage rate for this geography.
Some variable factor that reflects how amenable each iconic taxon is to image-based ID (e.g. it’s realistic to identify many flowering plants to species or even subspecies level based on photographs, but for many arthropods a genus- or family-level ID is the best that is reasonable).

The figure you linked to is just a simplified cartoon to provide a general example of what’s happening. It’s actually the top 10 results. If you want to dig into the code, here’s the “common ancestor” calculation 2. That said, I suspect your explanation for what went wrong with that obs is probably correct. https://github.com/inaturalist/iNaturalistAPI/blob/main/lib/controllers/v1/computervision_controller.js#L368

https://www.tandfonline.com/doi/full/10.1080/00219266.2020.1739114?scroll=top&needAccess=true 10

From the abstract: “Overall, iNaturalist provided greater taxonomic resolution with proper scientific and common names and additional natural history information for many organisms, piquing student interest. Incorporating this smartphone technology may increase identification of local biodiversity and student engagement in the biological sciences.”

Publicado por ahospers hace 7 meses (Marca)
Thumb

We're currently training a new model based on an export in September that had ~18 million images of 35k+ taxa. It's running with the same setup that we've used on previous models, but with a lot more data, so it will probably take ~210 days and be done some time next Spring. We're simultaneously experimenting with an updated system (TensorFlow 2, Xception vs Inception) that seems to be much faster, e.g. it seems like it might do the same job in 40-60 days, so if it seems like the new system performs about the same as the old one in terms of accuracy, inference speed, etc., we might just switch over to that and have a new model deployed in January or February 2021.

FWIW, COVID has kind of put a hitch in our goal of training 2 models a year. We actually ordered some new hardware right before our local shelter in place orders were issued, and we didn't feel the benefit of the new hardware outweighed the COVID risk of spending extended time inside at the office to assemble everything and get it running. Uncertainty about when it would be safe to do so was part of why we didn't start training a new model in the spring (that and the general insanity of the pandemic), but eventually we realized things weren't likely to get much better any time soon so we just started a new training job on the old system.

The Academy is actually open to the public again now

Publicado por ahospers hace 5 meses (Marca)

Agregar un comentario

Acceder o Crear una cuenta para agregar comentarios.