11 votes

Programming Challenge: Freestyle textual analysis.

I just realized that I completely glossed over this week's programming challenge. For this week, let's do something more flexible: write a program that accepts a file as input--either through a file name/path or through standard input--and perform any form of analysis you want on its contents. That's it!

For instance, you could count the occurrences of each word and find the most common ones, or you could determine the average sentence length, or the average unique words per sentence. You could even perform an analysis on changes in words and sentence structure over time (could be useful for e.g. poetry where metre may play an important role). You can stick with simple numbers or dive right into the grittiest forms of textual analysis currently available. You could output raw text or even a graphical representation. You could even do a bit of everything!

How simple or complex your solution ends up being is completely up to you, but I encourage you to challenge yourself by e.g. learning a new language or about different textual analysis techniques, or by focusing on code quality rather than complexity, or even by taking a completely different approach to the problem than you ordinarily would. There are a lot of learning opportunities available here.

28 comments

  1. [4]
    39hp
    Link
    I wanted to make something that I could actually use. I'm a biologist and something that's really boring is making PCR primers. It's boring because it's super mundane and because I'm always second...

    I wanted to make something that I could actually use. I'm a biologist and something that's really boring is making PCR primers. It's boring because it's super mundane and because I'm always second guessing myself making sure that I picked the right pair, that they fit all the criteria, and that I'm not making primers for the same strand.

    So I built a primer designer in Python. Just paste your sequence in FASTA format. It will convert it from multi-line to single-line, it will scan the sequence at designated lengths, and seek out stretches that contain the desired GC content, and having found stretches that fit that criteria the code will print: the sequence, the reverse complement, the location of the primer, the GC%, and the melting temperature of the primer.

    I'm sure this has been done before, but I'm super proud that I was able to build this from scratch:

    print("Paste FASTA sequence, then hit 'enter'...")
    
    fastaSL=[]
    while True:
        fasta=input()
        fastaSL.append(fasta)
        if fasta=='':
            break
    
    gene=''.join(fastaSL)
    
    size=int(input('Primer length:\n'))
    GCmax=int(input('Max GC%:\n'))
    GCmin=int(input('Min GC%:\n'))
    
    seq=[]
    base=[]
    
    for i in range(0,len(gene)-size+1):
        fragS=str(gene[i:i+size])
        seq.append(fragS)
        base.append(i)
    
    for j in range(0,len(seq)):
        GC=[]
        frag=seq[j]
        loc=base[j]
        GCcount=[]
        for i in range(0,len(frag)):
            if frag[i]=='G':
                GCcount.append(i)
            if frag[i]=='C':
                GCcount.append(i)
        if GCmax > 100 * len(GCcount) / len(frag) > GCmin:
            GC.append((100*len(GCcount))//len(frag))
            rc=frag[::-1]
            rcseq=[]
            for h in range(0, len(rc)):
                if rc[h] == 'A':
                    rcseq.append('T')
                elif rc[h] == 'T':
                    rcseq.append('A')
                elif rc[h] == 'C':
                    rcseq.append('G')
                elif rc[h] == 'G':
                    rcseq.append('C')
            rcprim = ''.join(rcseq)
            print("......Coding Sequence...", frag)
            print("...Reverse Complement...", rcprim)
            print("......Primer Location...", j, '-', j+len(frag), 'bases')
            print("..................GC%...", (GC)[0])
            print("...................Tm...", 4*(len(GCcount))+2*(len(frag)-len(GCcount)),'C \n')
    
    5 votes
    1. [3]
      Emerald_Knight
      Link Parent
      Good use for a programming challenge! I understand nothing about what it is you're actually doing as a biologist, but I'm glad to hear you got some practical benefit out of this :)

      Good use for a programming challenge! I understand nothing about what it is you're actually doing as a biologist, but I'm glad to hear you got some practical benefit out of this :)

      1 vote
      1. [2]
        39hp
        Link Parent
        I'm super happy with the progress I've made learning to program. I started last Friday(?), didn't really start trying things until Monday, and now I've built something I can use. I'm sure it...

        I'm super happy with the progress I've made learning to program. I started last Friday(?), didn't really start trying things until Monday, and now I've built something I can use.

        I'm sure it sounds super corny, but I'm really proud of myself right now, lol!

        2 votes
        1. Emerald_Knight
          Link Parent
          There's nothing corny about it. Every craftsman who enjoys their work will feel pride in their accomplishments!

          There's nothing corny about it. Every craftsman who enjoys their work will feel pride in their accomplishments!

  2. [14]
    39hp
    Link
    Sorry I’m late to this party, but do you have an intro to programming somewhere? I’m really interested in learning this skill, but I don’t even know where to start :(

    Sorry I’m late to this party, but do you have an intro to programming somewhere? I’m really interested in learning this skill, but I don’t even know where to start :(

    4 votes
    1. [6]
      Celeo
      Link Parent
      I feel a little dirty linking to reddit, but /r/learnprogramming has a lot of resources in their wiki: https://www.reddit.com/r/learnprogramming/wiki/faq

      I feel a little dirty linking to reddit, but /r/learnprogramming has a lot of resources in their wiki: https://www.reddit.com/r/learnprogramming/wiki/faq

      4 votes
      1. [2]
        Emerald_Knight
        Link Parent
        Ha! Looks like we ended up both pointing to the same resource. You just slightly beat me to the punch :)

        Ha! Looks like we ended up both pointing to the same resource. You just slightly beat me to the punch :)

        1 vote
        1. Celeo
          Link Parent
          Haha, a great idea then!

          Haha, a great idea then!

      2. [3]
        delicious_grownups
        Link Parent
        Don't feel dirty! Reddit is not a bad place, inherently. It does far more good things than bad, I think. But this place is so pristine and we've all been having such quality conversation here that...

        Don't feel dirty! Reddit is not a bad place, inherently. It does far more good things than bad, I think. But this place is so pristine and we've all been having such quality conversation here that I'm already finding that I now need do use both sites to get my fix

        1 vote
        1. [2]
          Celeo
          Link Parent
          True. I've been unsubbing from more and more subreddits, distilling what I see to smaller communities that are still interesting. I come here for good conversation, but still keep reddit around...

          True. I've been unsubbing from more and more subreddits, distilling what I see to smaller communities that are still interesting. I come here for good conversation, but still keep reddit around for its constant stream of links.

          1. delicious_grownups
            Link Parent
            And the porn. In all seriousness, the sheer force of numbers is part of it as well. I posted a link here about the Manafort family texts and it got a few upvotes and a few comments. I posted the...

            And the porn.

            In all seriousness, the sheer force of numbers is part of it as well. I posted a link here about the Manafort family texts and it got a few upvotes and a few comments. I posted the same link to the /russialago sub and it got over 3k upvotes and hundreds of comments. Hard to give up a platform that gives you that kind of reach

    2. [2]
      Gyrfalcon
      Link Parent
      I just linked this over in the daily discussion, and I think it's probably on the learnprogramming wiki, but I am personally a fan of Automate the Boring Stuff with Python. Python is a good first...

      I just linked this over in the daily discussion, and I think it's probably on the learnprogramming wiki, but I am personally a fan of Automate the Boring Stuff with Python. Python is a good first language, very forgiving and you can do a lot with it, plus the book helps you to work on practical problems pretty early on. Good luck, and don't be afraid to reach out to the community here as you learn!

      3 votes
      1. 39hp
        Link Parent
        Thank you for that! I will be reading that book!

        Thank you for that! I will be reading that book!

        1 vote
    3. Emerald_Knight
      Link Parent
      I only know of codecademy, and I honestly don't care for how much they hold your hand and make it too easy to not actually learn anything. That being said, this subreddit might be able to help...

      I only know of codecademy, and I honestly don't care for how much they hold your hand and make it too easy to not actually learn anything. That being said, this subreddit might be able to help you. They list a number of resources to help you decide which language to start with and how to get introduced to it.

      Please let me know if you have any trouble with anything in there. I would be happy to help you find alternative resources, help you troubleshoot setting up, or clarify anything that seems confusing. If you manage to get things set up and learn the basics without any trouble, then I also wouldn't mind providing further guidance moving forward.

      2 votes
    4. [3]
      Soptik
      (edited )
      Link Parent
      Really good website, where I learnt programming is ict.social. 90% articles is free, only advanced ones and exercises (which you don't need) are payed. Don't worry, by the time you arrive to...

      Really good website, where I learnt programming is ict.social. 90% articles is free, only advanced ones and exercises (which you don't need) are payed. Don't worry, by the time you arrive to advanced topics, you'll be experienced enough to search for the information elsewhere.

      C# is one of the most modern language, it's very robust, you'll feel great when programming and has great support, it's one of the most powerful languages I've ever seen, it's really fast as well. You can use it to make everything - Windows apps, Linux apps, Mac apps, Chrome OS apps, Android apps, iPhone apps, servers, web browsers, games... Everything. I've started with this language.

      However, you might like Python as well. It's great starter language, as it's syntax fells much more natural than C# syntax. On the other hand, it's really slow language. You'll find a lot of sources for it, because lot of people start with this language. But because Python feels so natural, your projects will almost always look like mess and you'll have problems orienting in your own code as soon as you start coding something bigger. C# is more strict and not so messy.

      So I'd recommend Python for general idea what is programming and short (max few hundred lines) scripts, and C#, if you mean it for real with programming and you want to program big projects, or if you like more structured and strict environment.

      Good luck!

      If you ever come into problems, PM me, I'd be glad to answer all your answers.

      1 vote
      1. [2]
        39hp
        Link Parent
        Just an enormous thank you to everyone in this thread! I have some down time between experiments and I'm teaching myself python with some YouTube videos. I've got some data sets I'm going to try...

        Just an enormous thank you to everyone in this thread!

        I have some down time between experiments and I'm teaching myself python with some YouTube videos. I've got some data sets I'm going to try to play with and hopefully participate in that AI challenge.

        I dove in head first, and it honestly doesn't seem so bad yet. :)

        1 vote
        1. Soptik
          Link Parent
          Good luck with the challenge :-) I'll post my C# solution of the challenge, so you can take a look if you really get stuck. Wikipedia could help you a lot here, it's probably much better explained...

          Good luck with the challenge :-) I'll post my C# solution of the challenge, so you can take a look if you really get stuck. Wikipedia could help you a lot here, it's probably much better explained there than in my challenge.

          Python is great language, it's syntax looks and feels so natural and beautiful. And take a look into object-oriented programming as soon as you learn basics in Python, it'll be worth it, I promise :-)

          1 vote
  3. [10]
    Gyrfalcon
    Link
    I have a simple thing to submit this week; I may come back with something more complex if I have time. I have started working on a program to solve sudoku puzzles. To that end, I devised a way to...

    I have a simple thing to submit this week; I may come back with something more complex if I have time. I have started working on a program to solve sudoku puzzles. To that end, I devised a way to represent the starting state of the board in a text file. Here is a sample one I made for testing.

    0,0,0,1
    1,3,1,2
    2,6,2,3
    3,1,3,4
    4,4,4,5
    5,7,5,6
    6,2,6,7
    7,5,7,8
    8,8,8,9
    

    The first number is the sector on the board, the next two are the column and row of the number, and last number is the number to place in that square. Here is the relevant code section for processing the file into the structs I am using to store the information:

    int main(int argc, char const *argv[])
    {
      BOARD gameBoard = newBoard();
    
      // TODO prompt for file name instead
      FILE *fp = fopen("testboard.txt", "r");
    
    
      while(!feof(fp))
      {
        int value, row, column, sector;
        fscanf(fp, "%d,%d,%d,%d", &sector,&column,&row,&value);
        free(gameBoard.squares[column][row]);
        gameBoard.squares[column][row] = newSquare(value, row, column, sector);
      }
    
      printBoard(gameBoard);
    
      return 0;
    }
    

    And lastly, here's the output from the test file I showed above:

     1       |          |          
             |  4       |          
             |          |  7       
    ------------------------------
        2    |          |          
             |     5    |          
             |          |     8    
    ------------------------------
           3 |          |          
             |        6 |          
             |          |        9
    

    I'm not sure if it will quite come out here but it looks good in the terminal!

    3 votes
    1. [9]
      Emerald_Knight
      Link Parent
      You definitely went in a completely different direction than I'd intended, but it certainly suits the challenge description! Good idea with the 4-digit number placement format :)

      You definitely went in a completely different direction than I'd intended, but it certainly suits the challenge description! Good idea with the 4-digit number placement format :)

      1 vote
      1. [8]
        Gyrfalcon
        Link Parent
        Thanks! Encoding the sector was the only elegant way I could think of to check for what's missing in a sector. In the future I want to add a header with some more information, things like if it is...

        Thanks! Encoding the sector was the only elegant way I could think of to check for what's missing in a sector. In the future I want to add a header with some more information, things like if it is a 9 number puzzle or a 4 number puzzle and the difficulty level. First I need to make some kind of algorithm to solve puzzles though!

        1 vote
        1. [7]
          Emerald_Knight
          Link Parent
          You might also consider some form of serialization. For instance, the first digit could indicate what kind of puzzle it is--4- or 9-number puzzle--and the following digits would follow a very...

          You might also consider some form of serialization. For instance, the first digit could indicate what kind of puzzle it is--4- or 9-number puzzle--and the following digits would follow a very strict scheme, e.g. each group of 9 details a quadrant, for each of its three rows from left to right, and a 0 would indicate an empty slot.

          For example, with a 4-number game:

          41234001020004321
          
          1 2 | 0 0
          3 4 | 1 0
          ---------
          2 0 | 4 3
          0 0 | 2 1
          

          If you wanted, you could add a delimiter to allow for a variable-sized puzzle (e.g. 10-number):

          4;1234001020004321
          

          There are plenty of ways you can do the encoding and decoding. You can also go for a multi-line option like you're currently doing. There's no one right answer.

          As for implementing a solver, definitely look into strategies like X-wing and Swordfish. At some point you'll need to do brute force attempts for particularly difficult puzzles, of course, but those are a good starting point for solving the easier ones :)

          2 votes
          1. [6]
            Gyrfalcon
            Link Parent
            Hmmm, the serialization could be good, but I originally devised the current scheme so that I could specify the puzzles sparsely, if you will. I will have to check those algorithms out, though for...

            Hmmm, the serialization could be good, but I originally devised the current scheme so that I could specify the puzzles sparsely, if you will. I will have to check those algorithms out, though for a first pass I want to try teaching the computer to solve the puzzles the way I do. For one I think it would be interesting, and for two I want to see the maximum potential of the strategies I currently employ without my ability to execute limiting them.

            2 votes
            1. [5]
              Emerald_Knight
              Link Parent
              No problem. I just want to make sure you have additional options to consider. Out of curiosity, which strategies do you currently employ?

              No problem. I just want to make sure you have additional options to consider.

              Out of curiosity, which strategies do you currently employ?

              2 votes
              1. [4]
                Gyrfalcon
                Link Parent
                This isn't exactly what I do when solving myself, but this is the closest thing I think would be effective as a program. Anyway: For each square, check what is present in its row, column, and...

                This isn't exactly what I do when solving myself, but this is the closest thing I think would be effective as a program. Anyway:

                1. For each square, check what is present in its row, column, and sector. The possibilities for that square are only the numbers that do not appear in these checks.
                2. Come back to each square. If it only has 1 possibility, assign it that value. If it has more than one possibility, check to see if any of its possibilities are unrepresented in its row, column, and sector. If it is the only possible position for a value in any of those groupings, it must be that value.
                3. After each assignment, redo step 1.
                4. Once every square is filled, the puzzle is solved.

                I have been solving puzzles on my phone, and the app that I use makes this strategy relatively simple to carry out, since I can mark possibilities on every square, and then begin placing numbers. As I mentioned before, this strategy relies on me executing the setup properly, which happens less than I would like.

                2 votes
                1. [3]
                  Emerald_Knight
                  Link Parent
                  Ah, so the basic deduction process! You may or may not have at least used subgroup exclusion at one point or another, but in any case, there's a whole lot you could work with in terms of...

                  Ah, so the basic deduction process! You may or may not have at least used subgroup exclusion at one point or another, but in any case, there's a whole lot you could work with in terms of strategies.

                  Sudoku is fun :)

                  1 vote
                  1. [2]
                    Gyrfalcon
                    Link Parent
                    I'll be honest, I know very little about Sudoku. I learned to do them in elementary school and I am just picking them back up in college. Maybe once I have the current program working I can post...

                    I'll be honest, I know very little about Sudoku. I learned to do them in elementary school and I am just picking them back up in college. Maybe once I have the current program working I can post it here in ~comp and we can discuss more!

                    1 vote
                    1. Emerald_Knight
                      (edited )
                      Link Parent
                      That sounds like fun! I'd be interested in seeing what you come up with. Honestly, I should really look into writing up a solver again myself. I once wrote one during my first year of CS studies...

                      That sounds like fun! I'd be interested in seeing what you come up with.

                      Honestly, I should really look into writing up a solver again myself. I once wrote one during my first year of CS studies as an optional assignment, and my code quality wasn't particularly good back then. I'd like to see what I could manage now that I'm actually comfortable with programming :)

                      Edit: Speaking of, I just found the old project skeleton for the original assignment code from my old class! I'm definitely going to mess around with this!

                      1 vote