Regular expression gymnastics

Day two

A brief explanation of the concept of a file and ASCII text. What role do pipes and the diamond operator play in prepping data for analysis?

Examples:
  1. capturing data from a complex text string:

    1. on the command line:

      1. cat Pta.seq.uniq | grep "^>" | sed -E 's#.*/gi=([[:digit:]]+).*/len=([[:digit:]]+).*#\1,\2#'

      2. cat Pta.seq.uniq | grep "^>" | sed -E 's#.*/gi=([[:digit:]]+).*#\1#' | sort | uniq | wc -l

      3. cat Pta.seq.uniq | grep "^>" | sed -E 's#.*/gi=([[:digit:]]+).*#\1#' | wc -l

    2. in R … examples with sub() from weather station data

  2. editing misspellings or irregular capitalization

  3. substitution of field separators

 

Day one

For our first work with regular expressions, we will start with a browser-based tool.  Please bring a laptop. For experimenting with the UNIX tools we will use laptops that run linux or UNIX, log into a remote linux machine (teton), or share computers as needed. Below are the data we will work with.

Here is a list of file names. We want to write a regular expression that captures the three components of each name separately (leaving off the “.csv”).

WY480140_ALTA_0658.csv
WY480540_BASIN_1991.csv
WY480915_BORDER_1458.csv
WY481175_BUFFALO_BILL_DAM_3649.csv
WY481675_CHEYENNE_2836.csv
WY481730_CHUGWATER_8790.csv
WY481905_COLONY_5951.csv
WY482595_DIVERSION_DAM_1327.csv
WY482715_DUBOIS_1476.csv
WY483100_EVANSTON_3425.csv
WY484065_GREEN_RIVER_6236.csv
WY485345_Lake_Yellowstone_2847.csv
WY485415_LARAMIE_5154.csv
WY485830_LUSK_6803.csv
WY486195_MIDWEST_1214.csv
WY486440_Moran_9705.csv
WY486660_NEWCASTLE_0202.csv
WY487105_PATHFINDER_DAM_0600.csv
WY487115_PAVILLION_3678.csv
WY487240_PINE_BLUFFS_6539.csv
WY487260_PINEDALE_6771.csv
WY487388_POWELL_FIELD_STATION_6778.csv
WY487760_RIVERTON_0259.csv
WY487845_ROCK_SPRINGS_3352.csv
WY487990_SARATOGA_1846.csv
WY488160_SHERIDAN_FIELD_STATION_7211.csv
WY488995_TORRINGTON_EXP_FARM_9001.csv
WY489615_WHEATLAND_5516.csv
WY489770_WORLAND_0653.csv
WY489905_YNP_MAMMOTH_7791.csv

We will work with a 19MB file that contains the sequence of nucleotides from expressed genes from a pine: Pta.seq.uniq. Below are four lines of text from that file, which we’ll use to experiment.  

>gnl|UG|Pta#S26634826 EST1155143 Subtracted pine embryo library, Lib_C Pinus taeda cDNA clone PIMEC35, mRNA sequence /clone=PIMEC35 /gb=DT630731 /gi=74160799 /ti=1684917080 /ug=Pta.13463 /len=447

GCCTCAAATCTCATATATATATATATATTATTAtttttttttttttCAATTCCTCAAAACTTCAAACCCTCATAACCCACCCCCTCTCTCGCGCCCGTCAGACGAAAGAGCAAGAACAGTGACAAGATCCAACCGGGGGTTAGGCTGTGACAACAACTGAATTAAGCCGCGGTTAGGAGGGCCGGGTTAGAGGCCACACACTTGTGTGGTTTTCCTTCGTAGTTGGGTGATGATGAGATTCAAGGACAAAGGCAGCAAAACGGAAATCTAGTTGAAATTCTTGGTCTGGATCCCCCATCTTCATCAACGGTGTCTATTGGCTGAACCAAGGTAAACAGGCCTGTGCTCATTCCCTTTCCGCCTCTTCTCTGCAAACAGCAAGAAACGAAAGCAACAATGTTCGACAGTTGGAGCATCGATAACAGTACCTCGGCCGCGACCACCCTA

>gnl|UG|Pta#S25096398 RTDK1_8_D05.b1_A029 Roots, dark Pinus taeda cDNA clone RTDK1_8_D05_A029 3', mRNA sequence /clone=RTDK1_8_D05_A029 /clone_end=3' /gb=DR069584 /gi=67047276 /ti=891267715 /ug=Pta.18 /len=922

CGCAAAGTTGATCACAATCTGGCGGGTTCGACTGTCGCTGTAGAGACTCTGGTCGGACGTGAAGAGGGTCTGCCTGTTGAGGAGATTGACATAGTATTTGTTGTCGAACAAATTGGGAGTGCGAATATCCAAGTTGGTGGTGTTAACGGTAGTACTGGCAGGACAGGTGAGATAAAGATTCTTAGCGAAAGTCTGGTCCATCGTGGGATCCTGCATTTGTGAACCAGTGGTGCTATTATATAGTCTGTTATCGAAGGAGGAGCAGTTGCCTCTGCCAATCGTGTGTCCTCCTGAAAGGGCCACCAGATCTGTGTTATTCAAGCCTTTGGGACCGAAAATGCTGATAAGGTGCGTTACGTTGGAGGTCGGGGCAGGCAAATTGGCCAGAACGGTCGTACTATTGGCGAAGGTAAGGCTGTCCCTGCGGCCGAGTGGTATGGGATAGCATGGCCCACCAGCTATGTCGACGGAGTCACGGGCCGCTAAGGCAAGAATGTCTGCACACGACACAGTTCCGCTGCACGCCGCTTCTACGTTCTCTTTGATGTCATTGATGATTTTCAGAGCCTCCGCTCTGAGTGATAAGTTGGGCGCAACTGTTTGCTCCCCCGATGTTGAGTTCAGCAACACAGACCCGTCGCATCCCTGGACAAAACAGTCGTGGAAGTGGAGCCTCAGCAATCCTGCAGCTTGTGTGATGTCTGCACTCAAATAGGCTTCCATGCGCTGCCGCACTATCGACTCCAATGACGGGCAACTTGTGTTGTAGAACGTCCAGGAAAGACCCGCCACGGGAGTTGGCAGAGCGTTCACAGCACTACCATATACAATCACAAATATAGAAAGCAAAACAGTGGCCGGAGTCATATTTGCCTCTTTCTTTGATATGCAATACCTTCTTCGATCCAGATCCAGTTCAGGA