`

Leetcode - Repeated DNA Sequences

 
阅读更多
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

[分析]
此题思路是容易想到的,遍历输入字符串的每个长度为10的substring,利用HashMap 检查其出现次数,出现两次或者以上的则加入到结果中。
实现时仅当某个substring第二次出现时加入结果可避免结果中出现重复字符串。但直接实现会得到Memory Limit Exceed,就是程序内存开销太大了。
此题的关键就是要将那些待检查的substring转换为int来节省内存,如何高效的编码substring?共4个字符,ACGT,可用两个bit区分它们,分别是00,01,10,11,
参考解答中的掩码技巧值得学习,使用一个20位的数字0x3ffff称为eraser,每次要更新一位字符时,将老的编码hint & eraser, 然后左移两位,然后加上新字符对应的编码,
这样就得到了新substring的编码,很巧妙~

[ref]
http://blog.csdn.net/coderhuhy/article/details/43647731

public class Solution {
    // Method 2: hashmap store int instead of string to bypass MLE
    public static final int eraser = 0x3ffff;
    public static HashMap<Character, Integer> ati = new HashMap<Character, Integer>();
    static {
        ati.put('A', 0);
        ati.put('C', 1);
        ati.put('G', 2);
        ati.put('T', 3);
    }
    public List<String> findRepeatedDnaSequences(String s) {
        List<String> result = new ArrayList<String>();
        if (s == null || s.length() <= 10)
            return result;
        int N = s.length();
        int hint = 0;
        for (int i = 0; i < 10; i++) {
            hint = (hint << 2) + ati.get(s.charAt(i));
        }
        HashMap<Integer, Integer> checker = new HashMap<Integer, Integer>();
        checker.put(hint, 1);
        for (int i = 10; i < N; i++) {
            hint = ((hint & eraser) << 2) + ati.get(s.charAt(i));
            Integer value = checker.get(hint);
            if (value == null) {
                checker.put(hint, 1);
            } else if (value == 1) {
                checker.put(hint, value + 1);
                result.add(s.substring(i - 9, i + 1));
            }
        }
        return result;
    }
    // Method 1: Memory Limit Exceed & may contain duplicates
    public List<String> findRepeatedDnaSequences1(String s) {
        HashMap<String, Integer> map = new HashMap<String, Integer>();
        int last = s.length() - 10;
        for (int i = 0; i <= last; i++) {
            String key = s.substring(i, i + 10);
            if (map.containsKey(key)) {
                map.put(key, map.get(key) + 1);
            } else {
                map.put(key, 1);
            }
        }
        List<String> result = new ArrayList<String>();
        for (String key : map.keySet()) {
            if (map.get(key) > 1)
                result.add(key);
        }
        return result;
    }
}
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics