Strategies for programmatic name cleaning (2024)

ScottChamberlain

2020-09-16

Source: vignettes/name_cleaning.Rmd

name_cleaning.Rmd

taxize offers interactive prompts when usingget_*() functions (e.g., get_tsn()). Theseprompts make it easy in interactive use to select choices when there aremore than one match found.

However, to make your code reproducible you don’t want interactiveprompts.

This vignette covers some options for programmatic name cleaning.

get_* functions

When using get_*() functions programatically, you have afew options.

rows parameter

Normally, if you get more than one result, you get a prompt askingyou to select which taxon you want.

get_tsn("Quercus b")#> tsn target commonnames nameusage#> 1 19298 Quercus beebiana not accepted#> 2 507263 Quercus berberidifolia scrub oak accepted#> 3 19300 Quercus bicolor swamp white oak accepted#> 4 19303 Quercus borealis not accepted#> 5 195131 Quercus borealis var. maxima not accepted#> 6 195166 Quercus boyntonii Boynton's sand post oak accepted#> 7 506533 Quercus brantii Brant's oak accepted#> 8 195150 Quercus breviloba not accepted#> 9 195099 Quercus breweri not accepted#> 10 195168 Quercus buckleyi Texas oak accepted#>#> More than one TSN found for taxon 'Quercus b'!#>#> Enter rownumber of taxon (other inputs will return 'NA'):#>#> 1:

Instead, we can use the rows parameter to specify which records wewant by number only (not by a name itself). Here, we want the first 3records:

get_tsn('Quercus b', rows = 1:3)#> tsn target commonnames nameusage#> 1 19298 Quercus beebiana not accepted#> 2 19300 Quercus bicolor swamp white oak accepted#> 3 19303 Quercus borealis not accepted#>#> More than one TSN found for taxon 'Quercus b'!#>#> Enter rownumber of taxon (other inputs will return 'NA'):#>#> 1:

However, you still get a prompt as there is more than one result.

Thus, for full programmatic usage, you can specify a single row, ifyou happen to know which one you want:

get_tsn('Quercus b', rows = 3)#> ══ 1 queries ═══════════════#> ✔ Found: Quercus b#> ══ Results ═════════════════#> #> ● Total: 1 #> ● Found: 1 #> ● Not Found: 0#> [1] "19303"#> attr(,"class")#> [1] "tsn"#> attr(,"match")#> [1] "found"#> attr(,"multiple_matches")#> [1] TRUE#> attr(,"pattern_match")#> [1] FALSE#> attr(,"uri")#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19303"

In reality it is unlikely you’ll know which row you want, unlessperhaps you just want one result from each query, regardless of what itis.

underscore methods

A better fit for programmatic use are underscore methods. Eachget_*() function has a sister method with and trailingunderscore, e.g., get_tsn() andget_tsn_().

get_tsn_("Quercus b")#> $`Quercus b`#> # A tibble: 5 x 4#> tsn scientificName commonNames nameUsage#> <chr> <chr> <chr> <chr> #> 1 19300 Quercus bicolor swamp white oak,chêne bicolore accepted #> 2 195166 Quercus boyntonii Boynton's sand post oak,Boynton's oak accepted #> 3 195168 Quercus buckleyi Texas oak,Buckley's oak accepted #> 4 506533 Quercus brantii Brant's oak accepted #> 5 507263 Quercus berberidifolia scrub oak accepted

The result is a single data.frame for each taxon queried, which canbe processed downstream with whatever logic is required in yourworkflow.

You can also combine rows parameter with underscorefunctions, as a single number of a range of numbers:

get_tsn_("Quercus b", rows = 1)#> $`Quercus b`#> # A tibble: 1 x 4#> tsn scientificName commonNames nameUsage#> <chr> <chr> <chr> <chr> #> 1 19300 Quercus bicolor swamp white oak,chêne bicolore accepted
get_tsn_("Quercus b", rows = 1:2)#> $`Quercus b`#> # A tibble: 2 x 4#> tsn scientificName commonNames nameUsage#> <chr> <chr> <chr> <chr> #> 1 19300 Quercus bicolor swamp white oak,chêne bicolore accepted #> 2 195166 Quercus boyntonii Boynton's sand post oak,Boynton's oak accepted

as.* methods

All get_*() functions have associatedas.*() functions (e.g., get_tsn() andas.tsn()).

Many taxize functions use taxonomic identifier classes(S3 objects) that are the output of get_*() functions.as.*() methods make it easy to make the required S3taxonomic identifier classes if you already know the identifier. Forexample:

Already a tsn, returns the same

as.tsn(get_tsn("Quercus douglasii"))#> ══ 1 queries ═══════════════#> ✔ Found: Quercus douglasii#> ══ Results ═════════════════#> #> ● Total: 1 #> ● Found: 1 #> ● Not Found: 0#> [1] "19322"#> attr(,"class")#> [1] "tsn"#> attr(,"match")#> [1] "found"#> attr(,"multiple_matches")#> [1] FALSE#> attr(,"pattern_match")#> [1] FALSE#> attr(,"uri")#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322"

numeric

as.tsn(c(19322, 129313, 506198))#> [1] "19322" "129313" "506198"#> attr(,"class")#> [1] "tsn"#> attr(,"match")#> [1] "found" "found" "found"#> attr(,"multiple_matches")#> [1] FALSE FALSE FALSE#> attr(,"pattern_match")#> [1] FALSE FALSE FALSE#> attr(,"uri")#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322" #> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=129313"#> [3] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=506198"

And you can do the same for character, or list inputs - depending onthe data source.

The above as.tsn() examples have the parametercheck = TRUE, meaning we ping the data source web serviceto make sure the identifier exists. You can skip that check if you likeby setting check = FALSE, and the result is returned muchfaster:

as.tsn(c("19322","129313","506198"), check = FALSE)#> [1] "19322" "129313" "506198"#> attr(,"class")#> [1] "tsn"#> attr(,"match")#> [1] "found" "found" "found"#> attr(,"multiple_matches")#> [1] FALSE FALSE FALSE#> attr(,"pattern_match")#> [1] FALSE FALSE FALSE#> attr(,"uri")#> [1] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19322" #> [2] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=129313"#> [3] "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=506198"

With the output of as.*() methods, you can then proceedwith other taxize functions.

gnr_resolve

Some functions in taxize are meant specifically for namecleaning. One of those is gnr_resolve().

gnr_resolve() doesn’t provide prompts as doget_*() functions, but instead return data.frame’s. So wedon’t face the same problem, and can use gnr_resolve() in aprogrammatic workflow straight away.

spp <- names_list(rank = "species", size = 10)gnr_resolve(spp, preferred_data_sources = 11)#> # A tibble: 13 x 5#> user_supplied_na… submitted_name matched_name data_source_tit… score#> * <chr> <chr> <chr> <chr> <dbl>#> 1 Astragalus radka… Astragalus radk… Astragalus radkane… GBIF Backbone T… 0.988#> 2 Montanoa gigas Montanoa gigas Montanoa gigas Rze… GBIF Backbone T… 0.988#> 3 Serratula semise… Serratula semis… Serratula semiserr… GBIF Backbone T… 0.988#> 4 Serratula semise… Serratula semis… Serratula semiserr… GBIF Backbone T… 0.988#> 5 Delosperma pagea… Delosperma page… Delosperma pageanu… GBIF Backbone T… 0.988#> 6 Delosperma pagea… Delosperma page… Delosperma pageanu… GBIF Backbone T… 0.988#> 7 Zieria hydroscop… Zieria hydrosco… Zieria hydroscopic… GBIF Backbone T… 0.988#> 8 Baccharis flabel… Baccharis flabe… Baccharis flabella… GBIF Backbone T… 0.988#> 9 Piper gonocarpum Piper gonocarpum Piper gonocarpum T… GBIF Backbone T… 0.988#> 10 Lathraea japonica Lathraea japoni… Lathraea japonica … GBIF Backbone T… 0.988#> 11 Lathraea japonica Lathraea japoni… Lathraea japonica … GBIF Backbone T… 0.988#> 12 Lathraea japonica Lathraea japoni… Lathraea japonica … GBIF Backbone T… 0.988#> 13 Verbesina tachir… Verbesina tachi… Verbesina tachiren… GBIF Backbone T… 0.988

Other functions

Some other functions in taxize use get_*()functions internally (e.g., classification()), but you cancan generally pass on parameters to the get_*() functionsinternally.

Feedback?

Let us know if you have ideas for better ways to do programmatic namecleaning at https://github.com/ropensci/taxize/issues or https://discuss.ropensci.org/ !

Strategies for programmatic name cleaning (2024)
Top Articles
Latest Posts
Article information

Author: Gregorio Kreiger

Last Updated:

Views: 6016

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Gregorio Kreiger

Birthday: 1994-12-18

Address: 89212 Tracey Ramp, Sunside, MT 08453-0951

Phone: +9014805370218

Job: Customer Designer

Hobby: Mountain biking, Orienteering, Hiking, Sewing, Backpacking, Mushroom hunting, Backpacking

Introduction: My name is Gregorio Kreiger, I am a tender, brainy, enthusiastic, combative, agreeable, gentle, gentle person who loves writing and wants to share my knowledge and understanding with you.