regex - Perl: removing unique lines between two files -


instead of removing duplicate lines, interested in removing unique lines found between 2 files. files have different formats.

file 1:

m160505_031746_42156_s1_p0|105337|10450_16161 m160505_031746_42156_s1_p0|104750|20537_27903 m160505_031746_42156_s1_p0|103809|17563_25308 m160505_031746_42156_s1_p0|103217|8075_11486  

file 2 (tab separated):

accaatcccatcaccatctt    m160505_031746_42156_s1_p0|105337|10450_16161 attaaaataccattatatgg    m160505_031746_42156_s1_p0|104750|20537_27903 caaactccaactacgaactg    m160505_031746_42156_s1_p0|103809|17563_25308 atctatttaaacctaatcgg    m160505_031746_42156_s1_p0|103217|8075_11486 accaatcccatcaccatctt    m160505_031746_42156_s1_p0|152092|36592_40830 attaaaataccattatatgg    m160505_031746_42156_s1_p0|143825|13009_23809 caaactccaactacgaactg    m160505_031746_42156_s1_p0|143710|0_20191 atctatttaaacctaatcgg    m160505_031746_42156_s1_p0|140833|25358_34709 

file 2 has same lines file 1 in column 2, preceded 20 letters in column 1. 20 letter pattern in column 1 repeated in file 2 (several times, more twice), unique associated sequences each occurrence.

i match sequences in file 1 second column in file 2. if there match, generate new file both columns each match, maintaining relationship file 2 has between 2 columns. in effect, looking remove rows in file 2 not have column 2 matches in file 1.

i realize code needs help, here have far give more of idea of how thinking. end needing use hash, although worried doing because of repeats in column 1. don't want lose , relationships column 2.

use strict; use warnings;  open(out, '>', '/path/to/out.txt') or die $!; open(fmt0, '<', '/path/to/fmt0.txt') or die $!;  $regex = qr/m160505_.*/; while(my $line = <fmt0>){     $line =~ $regex;     open(fmt6, '<', '/path/to/fmt6.txt') or die $!;     while(my $zero_fmt = <fmt6>){             if ($zero_fmt =~ /([a-z]{20})\t($line)/i){                     print out $zero_fmt;             }     } } 

thanks help!

something might job done. :-)

grep -f <(grep ^m160505_ file1) file2 

here's perl solution, since that's asked:

#!/usr/bin/env perl  use strict; use warnings;  die "usage: $0 <file1> <file2>\n"   unless @argv == 2;  open(my $file1, '<', $argv[0])   or die "could not open file1: $!\n";  %keys; while (<$file1>) {   chomp;   $keys{$_} = 1 if /^m160505_/; }  close($file1);  open (my $file2, '<', $argv[1])   or die "could not open file2: $!\n";  while (<$file2>) {   chomp;   ($key) = /\t(.+)$/;   print "$_\n" if $keys{$key}; }  close($file2); 

in action:

$ grep -f <(grep ^m160505_ file1) file2 accaatcccatcaccatctt    m160505_031746_42156_s1_p0|105337|10450_16161 attaaaataccattatatgg    m160505_031746_42156_s1_p0|104750|20537_27903 caaactccaactacgaactg    m160505_031746_42156_s1_p0|103809|17563_25308 atctatttaaacctaatcgg    m160505_031746_42156_s1_p0|103217|8075_11486  $ ./atgc.pl file1 file2 accaatcccatcaccatctt    m160505_031746_42156_s1_p0|105337|10450_16161 attaaaataccattatatgg    m160505_031746_42156_s1_p0|104750|20537_27903 caaactccaactacgaactg    m160505_031746_42156_s1_p0|103809|17563_25308 atctatttaaacctaatcgg    m160505_031746_42156_s1_p0|103217|8075_11486 

Comments