100일 챌린지/빅데이터기반 인공지능 융합 서비스 개발자

Day 88 - web scraping (1)

ksyke 2024. 12. 3. 17:52

목차

    116p~

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Tag

     

    Beautiful Soup Documentation — Beautiful Soup 4.12.0 documentation

    Beautiful Soup Documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers h

    www.crummy.com

     

    from urllib.request import Request
    import urllib.request
    msg='https://m.naver.com/'
    with(
        urllib.request.urlopen(msg) as f,
        open('data2.csv','w') as fw
        ):
            print(bytes.decode(f.read(),'utf-8'))
    
    import requests
    requests.get('http://m.naver.com').text

    import requests
    
    requests.get('http://m.naver.com').text

    html_doc = """
    <html>
    <head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    </body>
    </html>
    """
    from bs4 import BeautifulSoup
    # soup=BeautifulSoup(html_doc,'html_parser')
    soup=BeautifulSoup(html_doc)
    soup.prettify()

    soup.title

    soup.a

    soup.find_all('a')

    soup.find(id='link2')

    soup.p['class']

    soup.find_all('p',{'class':'story'})

    soup.a.get_text()

     

    soup.p.get_text()

    # soup.find_all('p',{'class':'story'})[0].get_text()
    str(soup.find_all('p',{'class':'story'})[0].get_text())

    soup.p.string.replace_with('edit')

    soup.prettify()

    soup.body.contents

    soup.body.contents[3].contents[0]

    list(soup.body.children)