instead of removing duplicate lines, interested in removing unique lines found between 2 files. files have different formats.
file 1:
m160505_031746_42156_s1_p0|105337|10450_16161 m160505_031746_42156_s1_p0|104750|20537_27903 m160505_031746_42156_s1_p0|103809|17563_25308 m160505_031746_42156_s1_p0|103217|8075_11486
file 2 (tab separated):
accaatcccatcaccatctt m160505_031746_42156_s1_p0|105337|10450_16161 attaaaataccattatatgg m160505_031746_42156_s1_p0|104750|20537_27903 caaactccaactacgaactg m160505_031746_42156_s1_p0|103809|17563_25308 atctatttaaacctaatcgg m160505_031746_42156_s1_p0|103217|8075_11486 accaatcccatcaccatctt m160505_031746_42156_s1_p0|152092|36592_40830 attaaaataccattatatgg m160505_031746_42156_s1_p0|143825|13009_23809 caaactccaactacgaactg m160505_031746_42156_s1_p0|143710|0_20191 atctatttaaacctaatcgg m160505_031746_42156_s1_p0|140833|25358_34709
file 2 has same lines file 1 in column 2, preceded 20 letters in column 1. 20 letter pattern in column 1 repeated in file 2 (several times, more twice), unique associated sequences each occurrence.
i match sequences in file 1 second column in file 2. if there match, generate new file both columns each match, maintaining relationship file 2 has between 2 columns. in effect, looking remove rows in file 2 not have column 2 matches in file 1.
i realize code needs help, here have far give more of idea of how thinking. end needing use hash, although worried doing because of repeats in column 1. don't want lose , relationships column 2.
use strict; use warnings; open(out, '>', '/path/to/out.txt') or die $!; open(fmt0, '<', '/path/to/fmt0.txt') or die $!; $regex = qr/m160505_.*/; while(my $line = <fmt0>){ $line =~ $regex; open(fmt6, '<', '/path/to/fmt6.txt') or die $!; while(my $zero_fmt = <fmt6>){ if ($zero_fmt =~ /([a-z]{20})\t($line)/i){ print out $zero_fmt; } } }
thanks help!
something might job done. :-)
grep -f <(grep ^m160505_ file1) file2
here's perl solution, since that's asked:
#!/usr/bin/env perl use strict; use warnings; die "usage: $0 <file1> <file2>\n" unless @argv == 2; open(my $file1, '<', $argv[0]) or die "could not open file1: $!\n"; %keys; while (<$file1>) { chomp; $keys{$_} = 1 if /^m160505_/; } close($file1); open (my $file2, '<', $argv[1]) or die "could not open file2: $!\n"; while (<$file2>) { chomp; ($key) = /\t(.+)$/; print "$_\n" if $keys{$key}; } close($file2);
in action:
$ grep -f <(grep ^m160505_ file1) file2 accaatcccatcaccatctt m160505_031746_42156_s1_p0|105337|10450_16161 attaaaataccattatatgg m160505_031746_42156_s1_p0|104750|20537_27903 caaactccaactacgaactg m160505_031746_42156_s1_p0|103809|17563_25308 atctatttaaacctaatcgg m160505_031746_42156_s1_p0|103217|8075_11486 $ ./atgc.pl file1 file2 accaatcccatcaccatctt m160505_031746_42156_s1_p0|105337|10450_16161 attaaaataccattatatgg m160505_031746_42156_s1_p0|104750|20537_27903 caaactccaactacgaactg m160505_031746_42156_s1_p0|103809|17563_25308 atctatttaaacctaatcgg m160505_031746_42156_s1_p0|103217|8075_11486
Comments
Post a Comment